Summary
We explored implementing RLM-style (Recursive Language Model) auto-externalization of oversized user prompts, inspired by this paper analysis. The idea: when a user pipes in a large input, write it to a temp file and give the model a handle + metadata instead of inlining it, letting the model read slices on demand via tools.
Result: the feature works correctly but produces worse outcomes on realistic tasks. We're filing this as a research finding, not a feature request.
What We Built
A maybe_externalize() function in user_prompt.rs that:
- Checks if user prompt or piped
additional_context exceeds max_inline_prompt_chars (default 100K)
- Writes the full content to a temp file (
/tmp/forge_prompt_*.txt)
- Replaces the inline content with an XML metadata message:
<externalized_content label="piped input" total_lines="4521" total_chars="185234" path="/tmp/forge_prompt_abc.txt">
The piped input is too large to include inline (185234 chars, 4521 lines).
Use the read tool to examine relevant sections. Use grep/search to find specific patterns.
</externalized_content>
- The model then uses its existing tools (read, shell, grep) to inspect slices
Changes were ~70 lines across 4 files: user_prompt.rs, config.rs, .forge.toml, reader.rs.
Experiment Setup
Test 1: Simple retrieval (150K chars, "what's the first word?")
| Metric |
Control (inline) |
Treatment (externalized) |
| Prompt tokens |
31,436 |
13,473 |
| Total tokens |
31,474 |
13,634 |
| Turns |
1 |
2 |
| Tool calls |
0 |
1 |
| Wall time |
3.6s |
4.2s |
Result: 57% token savings. The model read 100 bytes via head and answered. This is the sweet spot — simple retrieval from a large corpus.
Test 2: SWE-bench Lite tasks (5 instances, 116-123K chars each)
Real bug analysis tasks: "Analyze this issue and the surrounding code context. What is the root cause? Describe the fix."
| Metric |
Control (inline) |
Treatment (externalized) |
| Avg prompt tokens |
50,830 |
233,103 |
| Avg completion tokens |
1,159 |
2,226 |
| Avg total tokens |
51,988 |
235,329 |
| Avg turns |
1.4 |
6.6 |
| Avg tool calls |
0.4 |
6.0 |
| Avg wall time |
27.7s |
54.8s |
Result: 353% MORE tokens, 2x slower. The model externalized the input, then read it back across 6-8 tool calls. Each turn accumulated prompt tokens (prior context + new read results). By the time it had enough context to reason about the bug, it had consumed 4.5x more tokens than just having everything inline from the start.
Per-instance breakdown
| Instance |
Control tokens |
Treatment tokens |
Ratio |
| astropy-12907 |
33,762 |
210,272 |
6.2x worse |
| astropy-14182 |
33,342 |
318,177 |
9.5x worse |
| astropy-14365 |
124,945 |
440,421 |
3.5x worse |
| astropy-14995 |
35,098 |
76,461 |
2.2x worse |
| astropy-6938 |
32,794 |
131,314 |
4.0x worse |
Why It Fails
The RLM paper reports wins on 10M+ token inputs with 1000+ documents. At that scale, the model physically cannot inline the content and must decompose. Our inputs were ~120K chars (~30K tokens) — well within context window limits for modern models.
The core issue: externalization converts one cheap read into many expensive round trips.
- Inline: 1 turn, model sees everything, reasons once → ~32K prompt tokens
- Externalized: 6-8 turns, each turn re-sends the full conversation history + new tool results → prompt tokens compound across turns, reaching 200-440K
Externalization only wins when:
- The input is truly massive — far beyond the context window, not just 100K chars
- The task requires only a small slice of the input (needle-in-haystack)
- The model can skip most content via grep/search and never read it all
For real engineering tasks (bug analysis, code review, refactoring), the model typically needs broad context. Inlining is cheaper because it's one round trip.
What Codegraff Already Does Better
The existing architecture already handles the RLM use case more naturally:
- File tools: the model can read files in slices, grep for patterns
- Sub-agents (
AgentExecutor): recursive model calls with independent contexts
- Shell tool: Python/grep/awk for programmatic filtering
- Meta-tools (
tools_list, tools_info, call_tool): lazy discovery pattern
These are all available when the model decides it needs them, rather than forcing decomposition upfront.
Recommendations
- Don't implement prompt externalization at 100K chars. The multi-turn overhead destroys savings.
- Consider it only for inputs that literally don't fit — e.g., when the input exceeds the model's context window limit. At that point the choice is "externalize or truncate," and externalization wins.
- The compaction system (
Compactor) is the right approach for managing growing context within a session. It summarizes old turns rather than externalizing new input.
- If revisiting: the threshold should be tied to the model's context window, not a fixed char count. Something like "externalize when input > 80% of available context window" would be more principled.
Branch
All changes were on release/0.2.11 and have been reverted. The benchmark data is in the trajectory DB for conversations from 2026-05-25 ~21:52-21:55 UTC+8.
Raw Data
Benchmark script and SWE-bench test instances were at /tmp/swe_bench_test/. Results JSON:
{
"treatment_avg": {"prompt_tokens": 233103, "completion_tokens": 2226, "turns": 6.6, "tool_calls": 6.0, "wall_s": 54.8},
"control_avg": {"prompt_tokens": 50830, "completion_tokens": 1159, "turns": 1.4, "tool_calls": 0.4, "wall_s": 27.7}
}
Summary
We explored implementing RLM-style (Recursive Language Model) auto-externalization of oversized user prompts, inspired by this paper analysis. The idea: when a user pipes in a large input, write it to a temp file and give the model a handle + metadata instead of inlining it, letting the model read slices on demand via tools.
Result: the feature works correctly but produces worse outcomes on realistic tasks. We're filing this as a research finding, not a feature request.
What We Built
A
maybe_externalize()function inuser_prompt.rsthat:additional_contextexceedsmax_inline_prompt_chars(default 100K)/tmp/forge_prompt_*.txt)Changes were ~70 lines across 4 files:
user_prompt.rs,config.rs,.forge.toml,reader.rs.Experiment Setup
Test 1: Simple retrieval (150K chars, "what's the first word?")
Result: 57% token savings. The model read 100 bytes via
headand answered. This is the sweet spot — simple retrieval from a large corpus.Test 2: SWE-bench Lite tasks (5 instances, 116-123K chars each)
Real bug analysis tasks: "Analyze this issue and the surrounding code context. What is the root cause? Describe the fix."
Result: 353% MORE tokens, 2x slower. The model externalized the input, then read it back across 6-8 tool calls. Each turn accumulated prompt tokens (prior context + new read results). By the time it had enough context to reason about the bug, it had consumed 4.5x more tokens than just having everything inline from the start.
Per-instance breakdown
Why It Fails
The RLM paper reports wins on 10M+ token inputs with 1000+ documents. At that scale, the model physically cannot inline the content and must decompose. Our inputs were ~120K chars (~30K tokens) — well within context window limits for modern models.
The core issue: externalization converts one cheap read into many expensive round trips.
Externalization only wins when:
For real engineering tasks (bug analysis, code review, refactoring), the model typically needs broad context. Inlining is cheaper because it's one round trip.
What Codegraff Already Does Better
The existing architecture already handles the RLM use case more naturally:
AgentExecutor): recursive model calls with independent contextstools_list,tools_info,call_tool): lazy discovery patternThese are all available when the model decides it needs them, rather than forcing decomposition upfront.
Recommendations
Compactor) is the right approach for managing growing context within a session. It summarizes old turns rather than externalizing new input.Branch
All changes were on
release/0.2.11and have been reverted. The benchmark data is in the trajectory DB for conversations from 2026-05-25 ~21:52-21:55 UTC+8.Raw Data
Benchmark script and SWE-bench test instances were at
/tmp/swe_bench_test/. Results JSON:{ "treatment_avg": {"prompt_tokens": 233103, "completion_tokens": 2226, "turns": 6.6, "tool_calls": 6.0, "wall_s": 54.8}, "control_avg": {"prompt_tokens": 50830, "completion_tokens": 1159, "turns": 1.4, "tool_calls": 0.4, "wall_s": 27.7} }