Recovery-Bench

Recovery-Bench is a benchmark for evaluating how well LLM agents recover from mistakes. A weak agent attempts a Terminal-Bench 2.0 task and fails. We evaluate how well agents can recover after replaying the failed trajectory to reproduce the corrupted environment.

How It Works

Weak agent runs task → fails → trajectory saved
                                        ↓
                          Replay failed commands in fresh env
                                        ↓
                          Recovery agent starts from corrupted state
                                        ↓
                          Measure: did it recover? (reward > 0)

Initial traces — An agent (with a weak model) runs Terminal-Bench tasks.
Filter failures — Keep only trajectories where the agent failed (reward = 0).
Replay — Re-execute the failed agent's commands in a fresh Docker container to reproduce the corrupted state.
Recovery — A recovery agent gets the original task, corrupted environment, and optionally context from the failed attempt.
Score — Compare recovery success rates across models and agents.

Setup

pip install -e .

# Pull the bundled initial traces (requires Git LFS)
git lfs install
git lfs pull

Add API keys for the models you're testing.

Shared failure set

The git lfs pull fetches pre-generated Terminus-2 Haiku 4.5 initial traces into runs/. These traces are the common baseline for all experiments — every model and agent is evaluated against the same set of failed tasks and corrupted environments, making results directly comparable across runs.

Evaluating Models (using Terminus-2)

Pick any LiteLLM model and run it against the shared Haiku 4.5 failure set using Terminus-2:

python -m recovery_bench.generate_traces \
    --recovery-model anthropic/claude-opus-4-6 \
    --resume-initial runs/initial-claude-haiku-4-5-20251001-20260303_194859

For model-specific kwargs (reasoning effort, temperature, etc.), pass a JSON config instead:

--recovery-model configs/terminus/sonnet-46-max.json

{
  "model": "anthropic/claude-sonnet-4-6",
  "model_kwargs": { "reasoning_effort": "max", "temperature": 1.0 }
}

Evaluating Agents

By default, recovery uses RecoveryTerminus (a Terminus-2 agent with replay and recovery instructions). You can also swap the agent to evaluate harnesses with installed:<name> to wrap any Harbor (installed) agent for recovery. This dynamically creates a recovery variant that inherits the agent's full behavior and adds replay + recovery instructions:

python -m recovery_bench.generate_traces \
    --recovery-agent installed:claude-code \
    --recovery-model anthropic/claude-sonnet-4-6 \
    --resume-initial runs/initial-claude-haiku-4-5-20251001-20260303_194859

Works with any Harbor (installed) agent: installed:codex, installed:gemini-cli, installed:aider, etc.

Message Modes

--message-mode controls the amount of context from the failed attempt provided to the recovery agent:

Mode	What the agent gets
`full` (default)	Full transcript of the previous conversation
`summary`	LLM-generated summary of what was tried and what went wrong
`none`	Nothing — only the replayed environment and original task

Advanced Usage

Generating your own initial traces

Generate fresh traces instead of using the bundled ones:

# Initial traces only
python -m recovery_bench.generate_traces \
    --initial-model anthropic/claude-haiku-4-5-20251001 \
    --task-id sqlite-db-truncate

# Full pipeline: initial + recovery in one command
python -m recovery_bench.generate_traces \
    --initial-model anthropic/claude-haiku-4-5-20251001 \
    --initial-agent my_module.agents:MyInitialAgent \
    --recovery-model anthropic/claude-opus-4-5-20251101 \
    --recovery-agent my_module.agents:MyRecoveryAgent \
    --task-id sqlite-db-truncate

Full CLI reference

Flag	Description	Default
`--initial-model`	Model for initial traces	Required unless `--resume-initial`
`--initial-agent`	Registry name or import path	`terminus-2`
`--recovery-model`	Model for recovery	Omit to skip recovery
`--recovery-agent`	Registry name, import path, or `installed:<name>`	`recovery-terminus`
`--message-mode`	`full`, `none`, or `summary`	`full`
`--resume-initial`	Path to existing initial traces	—
`--task-id`	Task ID (repeatable)	All tasks
`--n-concurrent`	Parallel processes	`8`
`--job-name`	Custom output directory name	Auto-generated
`--dataset-version`	Terminal-Bench version	`2.0`
`--env`	Harbor backend (`docker`, `daytona`, `modal`)	—

Project structure

recovery_bench/
  generate_traces.py    CLI entry point
  pipeline.py           Orchestrator: initial → reorganize → recovery → aggregate
  prompts.py            Prompt text and instruction builders
  replay.py             Trajectory parsing and environment replay
  utils.py              Config resolution, task hashing, usage tracking
  agents/
    __init__.py         Agent registry
    base.py             RecoveryInstalledAgent (generic Harbor agent wrapper)
    recovery_mixin.py   Shared recovery logic across all recovery agents
    terminus.py         RecoveryTerminus, BaselineTerminus
    letta_code.py       LettaCode, RecoveryLettaCode

Acknowledgements

Recovery-Bench is built on Harbor and Terminal-Bench 2.0.

Citation

If you use Recovery-Bench in your research, please cite our blog post and the Terminal-Bench paper.

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
configs		configs
recovery_bench		recovery_bench
runs/initial-claude-haiku-4-5-20251001-20260303_194859		runs/initial-claude-haiku-4-5-20251001-20260303_194859
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recovery-Bench

How It Works

Setup

Shared failure set

Evaluating Models (using Terminus-2)

Evaluating Agents

Message Modes

Advanced Usage

Generating your own initial traces

Full CLI reference

Project structure

Acknowledgements

Citation

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Recovery-Bench

How It Works

Setup

Shared failure set

Evaluating Models (using Terminus-2)

Evaluating Agents

Message Modes

Advanced Usage

Generating your own initial traces

Full CLI reference

Project structure

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages