Recovery-Bench is a benchmark for evaluating how well LLM agents recover from mistakes. A weak agent attempts a Terminal-Bench 2.0 task and fails. We evaluate how well agents can recover after replaying the failed trajectory to reproduce the corrupted environment.
Weak agent runs task → fails → trajectory saved
↓
Replay failed commands in fresh env
↓
Recovery agent starts from corrupted state
↓
Measure: did it recover? (reward > 0)
- Initial traces — An agent (with a weak model) runs Terminal-Bench tasks.
- Filter failures — Keep only trajectories where the agent failed (reward = 0).
- Replay — Re-execute the failed agent's commands in a fresh Docker container to reproduce the corrupted state.
- Recovery — A recovery agent gets the original task, corrupted environment, and optionally context from the failed attempt.
- Score — Compare recovery success rates across models and agents.
pip install -e .
# Pull the bundled initial traces (requires Git LFS)
git lfs install
git lfs pullAdd API keys for the models you're testing.
The git lfs pull fetches pre-generated Terminus-2 Haiku 4.5 initial traces into runs/. These traces are the common baseline for all experiments — every model and agent is evaluated against the same set of failed tasks and corrupted environments, making results directly comparable across runs.
Pick any LiteLLM model and run it against the shared Haiku 4.5 failure set using Terminus-2:
python -m recovery_bench.generate_traces \
--recovery-model anthropic/claude-opus-4-6 \
--resume-initial runs/initial-claude-haiku-4-5-20251001-20260303_194859For model-specific kwargs (reasoning effort, temperature, etc.), pass a JSON config instead:
--recovery-model configs/terminus/sonnet-46-max.json{
"model": "anthropic/claude-sonnet-4-6",
"model_kwargs": { "reasoning_effort": "max", "temperature": 1.0 }
}By default, recovery uses RecoveryTerminus (a Terminus-2 agent with replay and recovery instructions). You can also swap the agent to evaluate harnesses with installed:<name> to wrap any Harbor (installed) agent for recovery. This dynamically creates a recovery variant that inherits the agent's full behavior and adds replay + recovery instructions:
python -m recovery_bench.generate_traces \
--recovery-agent installed:claude-code \
--recovery-model anthropic/claude-sonnet-4-6 \
--resume-initial runs/initial-claude-haiku-4-5-20251001-20260303_194859Works with any Harbor (installed) agent: installed:codex, installed:gemini-cli, installed:aider, etc.
--message-mode controls the amount of context from the failed attempt provided to the recovery agent:
| Mode | What the agent gets |
|---|---|
full (default) |
Full transcript of the previous conversation |
summary |
LLM-generated summary of what was tried and what went wrong |
none |
Nothing — only the replayed environment and original task |
Generate fresh traces instead of using the bundled ones:
# Initial traces only
python -m recovery_bench.generate_traces \
--initial-model anthropic/claude-haiku-4-5-20251001 \
--task-id sqlite-db-truncate
# Full pipeline: initial + recovery in one command
python -m recovery_bench.generate_traces \
--initial-model anthropic/claude-haiku-4-5-20251001 \
--initial-agent my_module.agents:MyInitialAgent \
--recovery-model anthropic/claude-opus-4-5-20251101 \
--recovery-agent my_module.agents:MyRecoveryAgent \
--task-id sqlite-db-truncate| Flag | Description | Default |
|---|---|---|
--initial-model |
Model for initial traces | Required unless --resume-initial |
--initial-agent |
Registry name or import path | terminus-2 |
--recovery-model |
Model for recovery | Omit to skip recovery |
--recovery-agent |
Registry name, import path, or installed:<name> |
recovery-terminus |
--message-mode |
full, none, or summary |
full |
--resume-initial |
Path to existing initial traces | — |
--task-id |
Task ID (repeatable) | All tasks |
--n-concurrent |
Parallel processes | 8 |
--job-name |
Custom output directory name | Auto-generated |
--dataset-version |
Terminal-Bench version | 2.0 |
--env |
Harbor backend (docker, daytona, modal) |
— |
recovery_bench/
generate_traces.py CLI entry point
pipeline.py Orchestrator: initial → reorganize → recovery → aggregate
prompts.py Prompt text and instruction builders
replay.py Trajectory parsing and environment replay
utils.py Config resolution, task hashing, usage tracking
agents/
__init__.py Agent registry
base.py RecoveryInstalledAgent (generic Harbor agent wrapper)
recovery_mixin.py Shared recovery logic across all recovery agents
terminus.py RecoveryTerminus, BaselineTerminus
letta_code.py LettaCode, RecoveryLettaCode
Recovery-Bench is built on Harbor and Terminal-Bench 2.0.
If you use Recovery-Bench in your research, please cite our blog post and the Terminal-Bench paper.