RLM-style prompt externalization: promising idea, negative results at current scale

## Summary

We explored implementing RLM-style (Recursive Language Model) auto-externalization of oversized user prompts, inspired by [this paper analysis](http://127.0.0.1:3001/news/recursive-language-models-context-window-breakthrough). The idea: when a user pipes in a large input, write it to a temp file and give the model a handle + metadata instead of inlining it, letting the model read slices on demand via tools.

**Result: the feature works correctly but produces worse outcomes on realistic tasks.** We're filing this as a research finding, not a feature request.

## What We Built

A `maybe_externalize()` function in `user_prompt.rs` that:
- Checks if user prompt or piped `additional_context` exceeds `max_inline_prompt_chars` (default 100K)
- Writes the full content to a temp file (`/tmp/forge_prompt_*.txt`)
- Replaces the inline content with an XML metadata message:
  ```xml
  <externalized_content label="piped input" total_lines="4521" total_chars="185234" path="/tmp/forge_prompt_abc.txt">
  The piped input is too large to include inline (185234 chars, 4521 lines).
  Use the read tool to examine relevant sections. Use grep/search to find specific patterns.
  </externalized_content>
  ```
- The model then uses its existing tools (read, shell, grep) to inspect slices

Changes were ~70 lines across 4 files: `user_prompt.rs`, `config.rs`, `.forge.toml`, `reader.rs`.

## Experiment Setup

### Test 1: Simple retrieval (150K chars, "what's the first word?")

| Metric | Control (inline) | Treatment (externalized) |
|---|---|---|
| Prompt tokens | 31,436 | 13,473 |
| Total tokens | 31,474 | 13,634 |
| Turns | 1 | 2 |
| Tool calls | 0 | 1 |
| Wall time | 3.6s | 4.2s |

**Result: 57% token savings.** The model read 100 bytes via `head` and answered. This is the sweet spot — simple retrieval from a large corpus.

### Test 2: SWE-bench Lite tasks (5 instances, 116-123K chars each)

Real bug analysis tasks: "Analyze this issue and the surrounding code context. What is the root cause? Describe the fix."

| Metric | Control (inline) | Treatment (externalized) |
|---|---|---|
| Avg prompt tokens | 50,830 | **233,103** |
| Avg completion tokens | 1,159 | 2,226 |
| Avg total tokens | 51,988 | **235,329** |
| Avg turns | 1.4 | **6.6** |
| Avg tool calls | 0.4 | **6.0** |
| Avg wall time | 27.7s | **54.8s** |

**Result: 353% MORE tokens, 2x slower.** The model externalized the input, then read it back across 6-8 tool calls. Each turn accumulated prompt tokens (prior context + new read results). By the time it had enough context to reason about the bug, it had consumed 4.5x more tokens than just having everything inline from the start.

### Per-instance breakdown

| Instance | Control tokens | Treatment tokens | Ratio |
|---|---|---|---|
| astropy-12907 | 33,762 | 210,272 | 6.2x worse |
| astropy-14182 | 33,342 | 318,177 | 9.5x worse |
| astropy-14365 | 124,945 | 440,421 | 3.5x worse |
| astropy-14995 | 35,098 | 76,461 | 2.2x worse |
| astropy-6938 | 32,794 | 131,314 | 4.0x worse |

## Why It Fails

The RLM paper reports wins on 10M+ token inputs with 1000+ documents. At that scale, the model physically cannot inline the content and *must* decompose. Our inputs were ~120K chars (~30K tokens) — well within context window limits for modern models.

The core issue: **externalization converts one cheap read into many expensive round trips.**

- **Inline**: 1 turn, model sees everything, reasons once → ~32K prompt tokens
- **Externalized**: 6-8 turns, each turn re-sends the full conversation history + new tool results → prompt tokens compound across turns, reaching 200-440K

Externalization only wins when:
1. The input is **truly massive** — far beyond the context window, not just 100K chars
2. The task requires only a **small slice** of the input (needle-in-haystack)
3. The model can **skip most content** via grep/search and never read it all

For real engineering tasks (bug analysis, code review, refactoring), the model typically needs broad context. Inlining is cheaper because it's one round trip.

## What Codegraff Already Does Better

The existing architecture already handles the RLM use case more naturally:
- **File tools**: the model can read files in slices, grep for patterns
- **Sub-agents** (`AgentExecutor`): recursive model calls with independent contexts
- **Shell tool**: Python/grep/awk for programmatic filtering
- **Meta-tools** (`tools_list`, `tools_info`, `call_tool`): lazy discovery pattern

These are all available *when the model decides it needs them*, rather than forcing decomposition upfront.

## Recommendations

1. **Don't implement prompt externalization at 100K chars.** The multi-turn overhead destroys savings.
2. **Consider it only for inputs that literally don't fit** — e.g., when the input exceeds the model's context window limit. At that point the choice is "externalize or truncate," and externalization wins.
3. **The compaction system (`Compactor`) is the right approach** for managing growing context within a session. It summarizes old turns rather than externalizing new input.
4. **If revisiting**: the threshold should be tied to the model's context window, not a fixed char count. Something like "externalize when input > 80% of available context window" would be more principled.

## Branch

All changes were on `release/0.2.11` and have been reverted. The benchmark data is in the trajectory DB for conversations from 2026-05-25 ~21:52-21:55 UTC+8.

## Raw Data

Benchmark script and SWE-bench test instances were at `/tmp/swe_bench_test/`. Results JSON:
```json
{
  "treatment_avg": {"prompt_tokens": 233103, "completion_tokens": 2226, "turns": 6.6, "tool_calls": 6.0, "wall_s": 54.8},
  "control_avg": {"prompt_tokens": 50830, "completion_tokens": 1159, "turns": 1.4, "tool_calls": 0.4, "wall_s": 27.7}
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RLM-style prompt externalization: promising idea, negative results at current scale #134

Summary

What We Built

Experiment Setup

Test 1: Simple retrieval (150K chars, "what's the first word?")

Test 2: SWE-bench Lite tasks (5 instances, 116-123K chars each)

Per-instance breakdown

Why It Fails

What Codegraff Already Does Better

Recommendations

Branch

Raw Data

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Control (inline)	Treatment (externalized)
Prompt tokens	31,436	13,473
Total tokens	31,474	13,634
Turns	1	2
Tool calls	0	1
Wall time	3.6s	4.2s

Metric	Control (inline)	Treatment (externalized)
Avg prompt tokens	50,830	233,103
Avg completion tokens	1,159	2,226
Avg total tokens	51,988	235,329
Avg turns	1.4	6.6
Avg tool calls	0.4	6.0
Avg wall time	27.7s	54.8s

Instance	Control tokens	Treatment tokens	Ratio
astropy-12907	33,762	210,272	6.2x worse
astropy-14182	33,342	318,177	9.5x worse
astropy-14365	124,945	440,421	3.5x worse
astropy-14995	35,098	76,461	2.2x worse
astropy-6938	32,794	131,314	4.0x worse

RLM-style prompt externalization: promising idea, negative results at current scale #134

Description

Summary

What We Built

Experiment Setup

Test 1: Simple retrieval (150K chars, "what's the first word?")

Test 2: SWE-bench Lite tasks (5 instances, 116-123K chars each)

Per-instance breakdown

Why It Fails

What Codegraff Already Does Better

Recommendations

Branch

Raw Data

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions