A public, local-first starter kit for building reproducible benchmark suites for AI agents.
The goal is not to create another leaderboard. The goal is to help agent builders answer one practical question:
When I change my agent setup — model, prompt, memory, tools, MCP, planner, critic loop, or multi-agent scaffold — did it actually get better?
Agent Bench Lab is designed around repeatable task families, versioned fixtures, deterministic or semi-deterministic scoring, trace logging, and anti-overfitting controls.
Agent Bench Lab is not limited to coding-agent benchmarks.
It is a canonical benchmark framework for any repeatable agent task family where the result can be checked with deterministic, semi-deterministic, state-based, artifact-based, trace-based, or rubric-assisted scoring.
Supported task families may include:
- code and repository repair;
- docs, knowledge-base, and source-grounded research tasks;
- spreadsheets, data analysis, and reporting tasks;
- support inbox and customer-service workflows;
- ticket triage and task-board updates;
- browser workflows over frozen or self-hosted snapshots;
- internal API and tool-use workflows;
- memory and personalization tasks;
- security, prompt-injection, and policy-compliance tasks;
- customer-specific private holdout checks.
The common unit is not "coding task" or "office task". The common unit is:
task family + fixtures + allowed tools + expected artifact/state + scorer + run comparison
The public v0/v0.7 implementation includes a small starter suite, five hardened task-family patterns, lifecycle gates, and a public-safe research radar. The framework is intentionally broader than the implemented starter cases.
Agent Bench Lab is the benchmark standard layer.
Consumer applications may use Agent Bench Lab to run benchmark suites inside a product, workflow, CLI, dashboard, or customer-facing experience. Consumer applications should not define a separate benchmark system when they can consume Agent Bench Lab task families, scorer interfaces, run records, and comparison protocols.
Recommended boundary:
- Agent Bench Lab owns task-family definitions, schemas, scorer conventions, run records, comparison protocol, and public/private benchmark rules.
- Private Eval Layer owns protected holdouts, answer keys, hidden labels, customer-specific checks, canaries, and private scorer configs.
- Consumer applications own product UX, onboarding, agent setup management, access control, task delivery, artifact upload, result presentation, and customer workflows.
Agent Bench Lab should not need to know which consumer application is using it.
Agent Bench Lab should define how benchmarks work without storing protected evaluation content.
The Private Eval Layer holds hidden labels, private holdouts, answer keys, protected scorer configs, canaries, customer-specific checks, and redaction rules outside the public repo. Scorers should use reusable contracts such as artifact_exact, schema_contract, numeric_metric, state_diff, claim_rubric, trace_policy, and security_leak instead of inventing a new hidden-check format per task family.
See Private Eval Layer, Scorer type contracts, and Reporting and feedback.
After the first five decision-grade public patterns, v0.6 adds standard-layer gates instead of another task family.
Lifecycle metadata declares whether each task family is experimental, decision-grade, verified, or deprecated. Hardening metadata declares mutation smoke scripts and exploit smoke status for decision-grade families. No task is marked verified yet.
make lifecycle-check
make mutation-smoke
make hardening-checkSee Benchmark lifecycle, Mutation and exploit gates, Suite strategy, and Report schema v1 guidance.
Research Radar keeps Agent Bench Lab aligned with external benchmark and eval methodology without turning the repo into a news feed.
It tracks benchmark mechanics: oracles, hidden splits, replay, trace policy, scoring contracts, exploitability, contamination, standards, and eval-framework changes.
research/
Public research/ files contain watchlists, source maps, queries, and daily/weekly templates only. Raw feeds, private notes, customer observations, private holdouts, and protected scorer details stay out of the public repo.
See Research Radar and research/README.md.
This repository is a v0 public starter. It contains:
- public task-card templates;
- a small core-suite config;
- JSON schemas for tasks, runs, traces, and scores;
- minimal Python CLI scaffolding;
- sample public fixtures;
- sample scorers plus hardened IF-01, DATA-01, DOC-01, SUP-01, and API-01 artifact/state-based scorers;
- a local command-based runner for external agent setups;
- documentation for benchmark design, metrics, anti-overfitting, lifecycle status, hardening gates, and research radar process.
It intentionally does not contain private holdout tasks, production secrets, personal data, or benchmark answers for real evaluation runs.
Release status: v0.7.0 is the latest published release and added Research Radar. main now includes the Local Agent Runner MVP; v0.7.1 is intended to stabilize the runner contract before any v0.8 direction is selected.
Most agent demos prove that an agent can succeed once. Product work needs stronger evidence:
- Can it succeed repeatedly?
- Does it still work after task mutations?
- Does the improvement generalize to hidden variants?
- Did latency or cost increase?
- Did it use tools safely?
- Did memory help or pollute the result?
- Did a critic loop improve quality or just add expensive theatre?
Use the same task families across different agent setups:
Setup A: model + system prompt + tools, no memory
Setup B: same setup, but with memory
Setup C: same setup, but with reviewer loop
Then compare them on the same seeds and hidden variants.
same task family + same scoring + controlled setup change = useful comparison
This repo is public. Treat public files as examples and templates.
For serious evaluation, keep these outside the public repo:
- private hidden fixtures;
- private holdout seeds;
- real benchmark answers;
- traces from commercial or personal tasks;
- API keys and provider metadata;
- user data;
- production prompts that should not be public.
The .gitignore includes private/, runs/, artifacts/, traces/, and common secret files by default.
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
agent-bench list-tasks
agent-bench validateCreate public sample artifacts and run scoring smoke tests:
python3 scripts/create_sample_artifacts.py
agent-bench score --task IF-01 --case case_001 --artifacts examples/artifacts/IF-01/case_001
agent-bench score --task DATA-01 --case case_001 --artifacts examples/artifacts/DATA-01/case_001
python3 scripts/public_leak_check.py .Without installing the package, use the source-tree Make targets:
make validate
make test
make smoke
make lifecycle-check
make mutation-smoke
make hardening-check
make leak-checkThe examples directory intentionally starts mostly empty. Generated artifacts under examples/artifacts/ are ignored by git except for the README placeholder.
Use agent-bench run to hand an agent-visible task packet to any local command and score the artifacts it writes:
agent-bench run \
--task IF-01 \
--case case_001 \
--agent-cmd "python3 scripts/mock_agent_write_artifacts.py" \
--out runs/manual/mock/IF-01_case_001The command receives AGENT_BENCH_TASK_PACKET and AGENT_BENCH_ARTIFACTS_DIR. It should write final artifacts to the artifacts directory. The runner then writes run.json, trace.jsonl, and score.json.
The task packet excludes scorer-only files such as check_config.json, answer keys, hidden labels, private scorer configs, canaries, and expected values. The scorer still reads the original fixture and the produced artifacts.
Create two local smoke-run directories and compare them:
make compare-smokeOr run the commands directly:
python3 scripts/create_sample_runs.py
agent-bench compare \
--baseline runs/baseline \
--candidate runs/spec_first \
--out reports/generated/compare_baseline_vs_spec_first.md \
--csv reports/generated/compare_baseline_vs_spec_first.csvThe comparison is paired: same task, same case, same scorer, different agent config. Public runs are smoke tests only; decision-grade evaluation requires private holdout cases outside the public repo.
IF-01 is the first hardened task-family pattern. It uses public synthetic cases, deterministic check_config.json files, critical violation caps, mutation support, and tests for strict artifact-contract compliance. See IF-01 decision-grade pattern.
make if01-smokeDATA-01 is the second hardened task-family pattern. It uses synthetic CSV/SQLite fixtures, deterministic metrics.json, factual report.md, checked chart_spec.json, mutation support, and tests for exact data work without relying on a visual PNG oracle. See DATA-01 decision-grade pattern.
make data01-smokeDOC-01 is the third hardened task-family pattern. It uses synthetic fixed-corpus documents, deterministic answer.md, checked citations.json, checked claims.json, mutation support, and tests for grounded answers without relying on live web or an LLM judge. See DOC-01 decision-grade pattern.
make doc01-smokeSUP-01 is the fourth hardened task-family pattern and the first operational/customer-style workflow. It uses synthetic support inboxes, deterministic triage.json, checked drafts.json, checked escalations.json, decision_log.md, mutation support, and tests for policy-grounded replies without live inbox, browser, SaaS, or real customer data. See SUP-01 decision-grade pattern.
make sup01-smokeAPI-01 is the fifth hardened task-family pattern and the first local internal API/tool-registry workflow. It uses synthetic API catalogs, local state fixtures, deterministic tool_calls.json, checked result.json, decision_log.md, scorer-side state simulation, mutation support, and tests for forbidden-tool avoidance without live SaaS, MCP, browser, or real APIs. See API-01 decision-grade pattern.
make api01-smokeThe recommended v0 core suite has seven task families:
| ID | Task | Capability |
|---|---|---|
| CODE-01 | Local regression patch | coding + test discipline |
| TERM-02 | Log-driven config repair | terminal/debugging |
| APP-04 | Airline rebooking under policy | stateful tools + policy |
| DATA-01 | CSV + SQL memo | exact data work + concise reporting |
| DOC-01 | Fixed-corpus grounded answer | citations + unsupported-claim checks |
| IF-01 | Long spec contract artifact | strict instruction following |
| SEC-01 | Hidden prompt injection in HTML/email | security + tool-output trust boundary |
The initial core suite is a starter set for proving the runner/scorer/compare loop. It is not the full scope of Agent Bench Lab and should not be interpreted as coding-first. Future task families can cover support, knowledge work, spreadsheets, browser workflows, ticketing, internal APIs, and customer-specific private checks using the same task/scorer/run model.
SUP-01 is intentionally not added to configs/suites/core.json by default. Operational/customer-style workflows start in:
configs/suites/ops-local.json
This keeps core focused while allowing support and ticketing tasks to grow under an ops-oriented local suite.
API-01 is intentionally not added to configs/suites/core.json by default. Local tool/API workflows start in:
configs/suites/tools-local.json
This keeps live-service-free API/tool reasoning separate from the fast starter core and from operational support workflows.
agent-bench-lab/
configs/ suite and agent config examples
docs/ public documentation
fixtures/public/ public example fixtures only
private/ gitignored private holdouts, if created locally
schemas/ JSON schemas
src/agent_bench_lab/ CLI and local harness skeleton
tasks/ task cards, prompts, and scorer modules
examples/artifacts/ local generated artifacts for smoke tests
scripts/ helper scripts
- Prefer local fixtures over live services.
- Prefer exact/state-based scoring over subjective judging.
- Keep hidden holdouts separate from public examples.
- Log traces, costs, latency, and tool calls.
- Compare paired runs on the same seeds.
- Treat safety and policy violations as hard gates where appropriate.
- Do not tune prompts on the same cases used for final comparison.
- Documentation index
- Canonical scope and consumer boundary
- Private Eval Layer
- Scorer type contracts
- Reporting and feedback
- Task authoring
- Public/private split
- Run records
- Comparing setups
- IF-01 decision-grade pattern
- DATA-01 decision-grade pattern
- DOC-01 decision-grade pattern
- SUP-01 decision-grade pattern
- API-01 decision-grade pattern
- Local Agent Runner MVP
- Public release checklist
- v0 roadmap
MIT. See LICENSE.