Skip to content

joaquinhuigomez/agent-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

agent-eval

Compare coding agents head-to-head. Pass rate, cost, time, consistency -- one command.

The problem

Every "which coding agent is best?" discussion runs on vibes. There's no lightweight tool to systematically compare agents on your tasks, with your codebase, tracking what actually matters: does it work, how long did it take, and what did it cost?

Quickstart

pip install -e ".[dev]"
agent-eval run --tasks examples/tasks.yaml --agents claude-code,aider --runs 3 --output report.json
agent-eval report --input report.json

Example output

Sample Evaluation Suite
-------------------------------------------------------------------
Agent            Pass Rate   Avg Time   Avg Cost   Consistency
-------------------------------------------------------------------
claude-code           80%      45.2s    $0.1200         0.95
aider                 60%      30.1s    $0.0300         0.85
-------------------------------------------------------------------
Best pass rate: claude-code (80%) | Fastest: aider (30.1s avg) | Cheapest: aider ($0.0300 avg)

How it works

  1. Define tasks in YAML -- description, repo, test command, timeout
  2. Isolated execution -- each agent runs in a fresh git worktree (no Docker needed)
  3. Run agents -- Claude Code, Aider, or any CLI agent via adapters
  4. Judge results -- test commands (exit 0 = pass) or LLM-as-judge
  5. Report -- pass rate, avg time, avg cost, consistency across runs

Supported agents

Agent CLI Status
Claude Code claude Supported
Aider aider Supported
Custom Extend AgentAdapter Extensible

Task format

name: "My Eval Suite"
tasks:
  - id: add-auth
    description: "Add JWT auth middleware to the Flask app"
    repo: "./my-flask-app"
    test_cmd: "python -m pytest tests/ -v"
    timeout_seconds: 180
    tags: [python, auth]

vs SWE-bench

SWE-bench agent-eval
Tasks Fixed dataset (GitHub issues) Your custom tasks
Setup Docker, heavy infra Git worktrees, no Docker
Cost tracking No Yes
Consistency measurement No (1 run) Yes (N runs, std dev)
Time to first result Hours Minutes

CLI reference

agent-eval run --tasks YAML --agents NAME,NAME --runs N --concurrency N --output PATH
agent-eval report --input PATH
agent-eval list-agents

License

MIT

About

Compare coding agents head-to-head. Pass rate, cost, time, consistency — one command. No Docker required.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages