Skip to content

letta-ai/recovery-bench

Repository files navigation

Recovery-Bench

Recovery-Bench is a benchmark for evaluating how well LLM agents recover from mistakes. A weak agent attempts a Terminal-Bench 2.0 task and fails. We evaluate how well agents can recover after replaying the failed trajectory to reproduce the corrupted environment.

How It Works

Weak agent runs task → fails → trajectory saved
                                        ↓
                          Replay failed commands in fresh env
                                        ↓
                          Recovery agent starts from corrupted state
                                        ↓
                          Measure: did it recover? (reward > 0)
  1. Initial traces — An agent (with a weak model) runs Terminal-Bench tasks.
  2. Filter failures — Keep only trajectories where the agent failed (reward = 0).
  3. Replay — Re-execute the failed agent's commands in a fresh Docker container to reproduce the corrupted state.
  4. Recovery — A recovery agent gets the original task, corrupted environment, and optionally context from the failed attempt.
  5. Score — Compare recovery success rates across models and agents.

Setup

pip install -e .

# Pull the bundled initial traces (requires Git LFS)
git lfs install
git lfs pull

Add API keys for the models you're testing.

Shared failure set

The git lfs pull fetches pre-generated Terminus-2 Haiku 4.5 initial traces into runs/. These traces are the common baseline for all experiments — every model and agent is evaluated against the same set of failed tasks and corrupted environments, making results directly comparable across runs.

Evaluating Models (using Terminus-2)

Pick any LiteLLM model and run it against the shared Haiku 4.5 failure set using Terminus-2:

python -m recovery_bench.generate_traces \
    --recovery-model anthropic/claude-opus-4-6 \
    --resume-initial runs/initial-claude-haiku-4-5-20251001-20260303_194859

For model-specific kwargs (reasoning effort, temperature, etc.), pass a JSON config instead:

--recovery-model configs/terminus/sonnet-46-max.json
{
  "model": "anthropic/claude-sonnet-4-6",
  "model_kwargs": { "reasoning_effort": "max", "temperature": 1.0 }
}

Evaluating Agents

By default, recovery uses RecoveryTerminus (a Terminus-2 agent with replay and recovery instructions). You can also swap the agent to evaluate harnesses with installed:<name> to wrap any Harbor (installed) agent for recovery. This dynamically creates a recovery variant that inherits the agent's full behavior and adds replay + recovery instructions:

python -m recovery_bench.generate_traces \
    --recovery-agent installed:claude-code \
    --recovery-model anthropic/claude-sonnet-4-6 \
    --resume-initial runs/initial-claude-haiku-4-5-20251001-20260303_194859

Works with any Harbor (installed) agent: installed:codex, installed:gemini-cli, installed:aider, etc.

Message Modes

--message-mode controls the amount of context from the failed attempt provided to the recovery agent:

Mode What the agent gets
full (default) Full transcript of the previous conversation
summary LLM-generated summary of what was tried and what went wrong
none Nothing — only the replayed environment and original task

Advanced Usage

Generating your own initial traces

Generate fresh traces instead of using the bundled ones:

# Initial traces only
python -m recovery_bench.generate_traces \
    --initial-model anthropic/claude-haiku-4-5-20251001 \
    --task-id sqlite-db-truncate

# Full pipeline: initial + recovery in one command
python -m recovery_bench.generate_traces \
    --initial-model anthropic/claude-haiku-4-5-20251001 \
    --initial-agent my_module.agents:MyInitialAgent \
    --recovery-model anthropic/claude-opus-4-5-20251101 \
    --recovery-agent my_module.agents:MyRecoveryAgent \
    --task-id sqlite-db-truncate

Full CLI reference

Flag Description Default
--initial-model Model for initial traces Required unless --resume-initial
--initial-agent Registry name or import path terminus-2
--recovery-model Model for recovery Omit to skip recovery
--recovery-agent Registry name, import path, or installed:<name> recovery-terminus
--message-mode full, none, or summary full
--resume-initial Path to existing initial traces
--task-id Task ID (repeatable) All tasks
--n-concurrent Parallel processes 8
--job-name Custom output directory name Auto-generated
--dataset-version Terminal-Bench version 2.0
--env Harbor backend (docker, daytona, modal)

Project structure

recovery_bench/
  generate_traces.py    CLI entry point
  pipeline.py           Orchestrator: initial → reorganize → recovery → aggregate
  prompts.py            Prompt text and instruction builders
  replay.py             Trajectory parsing and environment replay
  utils.py              Config resolution, task hashing, usage tracking
  agents/
    __init__.py         Agent registry
    base.py             RecoveryInstalledAgent (generic Harbor agent wrapper)
    recovery_mixin.py   Shared recovery logic across all recovery agents
    terminus.py         RecoveryTerminus, BaselineTerminus
    letta_code.py       LettaCode, RecoveryLettaCode

Acknowledgements

Recovery-Bench is built on Harbor and Terminal-Bench 2.0.

Citation

If you use Recovery-Bench in your research, please cite our blog post and the Terminal-Bench paper.

About

Recovery-Bench is a benchmark for evaluating the capability of LLM agents to recover from mistakes

Resources

Stars

Watchers

Forks

Contributors

Languages