Skip to content

RLM-style prompt externalization: promising idea, negative results at current scale #134

@justrach

Description

@justrach

Summary

We explored implementing RLM-style (Recursive Language Model) auto-externalization of oversized user prompts, inspired by this paper analysis. The idea: when a user pipes in a large input, write it to a temp file and give the model a handle + metadata instead of inlining it, letting the model read slices on demand via tools.

Result: the feature works correctly but produces worse outcomes on realistic tasks. We're filing this as a research finding, not a feature request.

What We Built

A maybe_externalize() function in user_prompt.rs that:

  • Checks if user prompt or piped additional_context exceeds max_inline_prompt_chars (default 100K)
  • Writes the full content to a temp file (/tmp/forge_prompt_*.txt)
  • Replaces the inline content with an XML metadata message:
    <externalized_content label="piped input" total_lines="4521" total_chars="185234" path="/tmp/forge_prompt_abc.txt">
    The piped input is too large to include inline (185234 chars, 4521 lines).
    Use the read tool to examine relevant sections. Use grep/search to find specific patterns.
    </externalized_content>
  • The model then uses its existing tools (read, shell, grep) to inspect slices

Changes were ~70 lines across 4 files: user_prompt.rs, config.rs, .forge.toml, reader.rs.

Experiment Setup

Test 1: Simple retrieval (150K chars, "what's the first word?")

Metric Control (inline) Treatment (externalized)
Prompt tokens 31,436 13,473
Total tokens 31,474 13,634
Turns 1 2
Tool calls 0 1
Wall time 3.6s 4.2s

Result: 57% token savings. The model read 100 bytes via head and answered. This is the sweet spot — simple retrieval from a large corpus.

Test 2: SWE-bench Lite tasks (5 instances, 116-123K chars each)

Real bug analysis tasks: "Analyze this issue and the surrounding code context. What is the root cause? Describe the fix."

Metric Control (inline) Treatment (externalized)
Avg prompt tokens 50,830 233,103
Avg completion tokens 1,159 2,226
Avg total tokens 51,988 235,329
Avg turns 1.4 6.6
Avg tool calls 0.4 6.0
Avg wall time 27.7s 54.8s

Result: 353% MORE tokens, 2x slower. The model externalized the input, then read it back across 6-8 tool calls. Each turn accumulated prompt tokens (prior context + new read results). By the time it had enough context to reason about the bug, it had consumed 4.5x more tokens than just having everything inline from the start.

Per-instance breakdown

Instance Control tokens Treatment tokens Ratio
astropy-12907 33,762 210,272 6.2x worse
astropy-14182 33,342 318,177 9.5x worse
astropy-14365 124,945 440,421 3.5x worse
astropy-14995 35,098 76,461 2.2x worse
astropy-6938 32,794 131,314 4.0x worse

Why It Fails

The RLM paper reports wins on 10M+ token inputs with 1000+ documents. At that scale, the model physically cannot inline the content and must decompose. Our inputs were ~120K chars (~30K tokens) — well within context window limits for modern models.

The core issue: externalization converts one cheap read into many expensive round trips.

  • Inline: 1 turn, model sees everything, reasons once → ~32K prompt tokens
  • Externalized: 6-8 turns, each turn re-sends the full conversation history + new tool results → prompt tokens compound across turns, reaching 200-440K

Externalization only wins when:

  1. The input is truly massive — far beyond the context window, not just 100K chars
  2. The task requires only a small slice of the input (needle-in-haystack)
  3. The model can skip most content via grep/search and never read it all

For real engineering tasks (bug analysis, code review, refactoring), the model typically needs broad context. Inlining is cheaper because it's one round trip.

What Codegraff Already Does Better

The existing architecture already handles the RLM use case more naturally:

  • File tools: the model can read files in slices, grep for patterns
  • Sub-agents (AgentExecutor): recursive model calls with independent contexts
  • Shell tool: Python/grep/awk for programmatic filtering
  • Meta-tools (tools_list, tools_info, call_tool): lazy discovery pattern

These are all available when the model decides it needs them, rather than forcing decomposition upfront.

Recommendations

  1. Don't implement prompt externalization at 100K chars. The multi-turn overhead destroys savings.
  2. Consider it only for inputs that literally don't fit — e.g., when the input exceeds the model's context window limit. At that point the choice is "externalize or truncate," and externalization wins.
  3. The compaction system (Compactor) is the right approach for managing growing context within a session. It summarizes old turns rather than externalizing new input.
  4. If revisiting: the threshold should be tied to the model's context window, not a fixed char count. Something like "externalize when input > 80% of available context window" would be more principled.

Branch

All changes were on release/0.2.11 and have been reverted. The benchmark data is in the trajectory DB for conversations from 2026-05-25 ~21:52-21:55 UTC+8.

Raw Data

Benchmark script and SWE-bench test instances were at /tmp/swe_bench_test/. Results JSON:

{
  "treatment_avg": {"prompt_tokens": 233103, "completion_tokens": 2226, "turns": 6.6, "tool_calls": 6.0, "wall_s": 54.8},
  "control_avg": {"prompt_tokens": 50830, "completion_tokens": 1159, "turns": 1.4, "tool_calls": 0.4, "wall_s": 27.7}
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions