Compare coding agents head-to-head. Pass rate, cost, time, consistency -- one command.
Every "which coding agent is best?" discussion runs on vibes. There's no lightweight tool to systematically compare agents on your tasks, with your codebase, tracking what actually matters: does it work, how long did it take, and what did it cost?
pip install -e ".[dev]"
agent-eval run --tasks examples/tasks.yaml --agents claude-code,aider --runs 3 --output report.json
agent-eval report --input report.jsonSample Evaluation Suite
-------------------------------------------------------------------
Agent Pass Rate Avg Time Avg Cost Consistency
-------------------------------------------------------------------
claude-code 80% 45.2s $0.1200 0.95
aider 60% 30.1s $0.0300 0.85
-------------------------------------------------------------------
Best pass rate: claude-code (80%) | Fastest: aider (30.1s avg) | Cheapest: aider ($0.0300 avg)
- Define tasks in YAML -- description, repo, test command, timeout
- Isolated execution -- each agent runs in a fresh git worktree (no Docker needed)
- Run agents -- Claude Code, Aider, or any CLI agent via adapters
- Judge results -- test commands (exit 0 = pass) or LLM-as-judge
- Report -- pass rate, avg time, avg cost, consistency across runs
| Agent | CLI | Status |
|---|---|---|
| Claude Code | claude |
Supported |
| Aider | aider |
Supported |
| Custom | Extend AgentAdapter |
Extensible |
name: "My Eval Suite"
tasks:
- id: add-auth
description: "Add JWT auth middleware to the Flask app"
repo: "./my-flask-app"
test_cmd: "python -m pytest tests/ -v"
timeout_seconds: 180
tags: [python, auth]| SWE-bench | agent-eval | |
|---|---|---|
| Tasks | Fixed dataset (GitHub issues) | Your custom tasks |
| Setup | Docker, heavy infra | Git worktrees, no Docker |
| Cost tracking | No | Yes |
| Consistency measurement | No (1 run) | Yes (N runs, std dev) |
| Time to first result | Hours | Minutes |
agent-eval run --tasks YAML --agents NAME,NAME --runs N --concurrency N --output PATH
agent-eval report --input PATH
agent-eval list-agentsMIT