Agent Bench Lab

A public, local-first starter kit for building reproducible benchmark suites for AI agents.

The goal is not to create another leaderboard. The goal is to help agent builders answer one practical question:

When I change my agent setup — model, prompt, memory, tools, MCP, planner, critic loop, or multi-agent scaffold — did it actually get better?

Agent Bench Lab is designed around repeatable task families, versioned fixtures, deterministic or semi-deterministic scoring, trace logging, and anti-overfitting controls.

Scope: all agent task families

Agent Bench Lab is not limited to coding-agent benchmarks.

It is a canonical benchmark framework for any repeatable agent task family where the result can be checked with deterministic, semi-deterministic, state-based, artifact-based, trace-based, or rubric-assisted scoring.

Supported task families may include:

code and repository repair;
docs, knowledge-base, and source-grounded research tasks;
spreadsheets, data analysis, and reporting tasks;
support inbox and customer-service workflows;
ticket triage and task-board updates;
browser workflows over frozen or self-hosted snapshots;
internal API and tool-use workflows;
memory and personalization tasks;
security, prompt-injection, and policy-compliance tasks;
customer-specific private holdout checks.

The common unit is not "coding task" or "office task". The common unit is:

task family + fixtures + allowed tools + expected artifact/state + scorer + run comparison

The public v0/v0.7 implementation includes a small starter suite, five hardened task-family patterns, lifecycle gates, and a public-safe research radar. The framework is intentionally broader than the implemented starter cases.

Relationship to consumer applications

Agent Bench Lab is the benchmark standard layer.

Consumer applications may use Agent Bench Lab to run benchmark suites inside a product, workflow, CLI, dashboard, or customer-facing experience. Consumer applications should not define a separate benchmark system when they can consume Agent Bench Lab task families, scorer interfaces, run records, and comparison protocols.

Recommended boundary:

Agent Bench Lab owns task-family definitions, schemas, scorer conventions, run records, comparison protocol, and public/private benchmark rules.
Private Eval Layer owns protected holdouts, answer keys, hidden labels, customer-specific checks, canaries, and private scorer configs.
Consumer applications own product UX, onboarding, agent setup management, access control, task delivery, artifact upload, result presentation, and customer workflows.

Agent Bench Lab should not need to know which consumer application is using it.

Private eval and scorer contracts

Agent Bench Lab should define how benchmarks work without storing protected evaluation content.

The Private Eval Layer holds hidden labels, private holdouts, answer keys, protected scorer configs, canaries, customer-specific checks, and redaction rules outside the public repo. Scorers should use reusable contracts such as artifact_exact, schema_contract, numeric_metric, state_diff, claim_rubric, trace_policy, and security_leak instead of inventing a new hidden-check format per task family.

See Private Eval Layer, Scorer type contracts, and Reporting and feedback.

Benchmark lifecycle and hardening gates

After the first five decision-grade public patterns, v0.6 adds standard-layer gates instead of another task family.

Lifecycle metadata declares whether each task family is experimental, decision-grade, verified, or deprecated. Hardening metadata declares mutation smoke scripts and exploit smoke status for decision-grade families. No task is marked verified yet.

make lifecycle-check
make mutation-smoke
make hardening-check

See Benchmark lifecycle, Mutation and exploit gates, Suite strategy, and Report schema v1 guidance.

Research Radar

Research Radar keeps Agent Bench Lab aligned with external benchmark and eval methodology without turning the repo into a news feed.

It tracks benchmark mechanics: oracles, hidden splits, replay, trace policy, scoring contracts, exploitability, contamination, standards, and eval-framework changes.

research/

Public research/ files contain watchlists, source maps, queries, and daily/weekly templates only. Raw feeds, private notes, customer observations, private holdouts, and protected scorer details stay out of the public repo.

See Research Radar and research/README.md.

Current status

This repository is a v0 public starter. It contains:

public task-card templates;
a small core-suite config;
JSON schemas for tasks, runs, traces, and scores;
minimal Python CLI scaffolding;
sample public fixtures;
sample scorers plus hardened IF-01, DATA-01, DOC-01, SUP-01, and API-01 artifact/state-based scorers;
a local command-based runner for external agent setups;
documentation for benchmark design, metrics, anti-overfitting, lifecycle status, hardening gates, and research radar process.

It intentionally does not contain private holdout tasks, production secrets, personal data, or benchmark answers for real evaluation runs.

Release status: v0.7.0 is the latest published release and added Research Radar. main now includes the Local Agent Runner MVP; v0.7.1 is intended to stabilize the runner contract before any v0.8 direction is selected.

Why this exists

Most agent demos prove that an agent can succeed once. Product work needs stronger evidence:

Can it succeed repeatedly?
Does it still work after task mutations?
Does the improvement generalize to hidden variants?
Did latency or cost increase?
Did it use tools safely?
Did memory help or pollute the result?
Did a critic loop improve quality or just add expensive theatre?

Core idea

Use the same task families across different agent setups:

Setup A: model + system prompt + tools, no memory
Setup B: same setup, but with memory
Setup C: same setup, but with reviewer loop

Then compare them on the same seeds and hidden variants.

same task family + same scoring + controlled setup change = useful comparison

Public/private split

This repo is public. Treat public files as examples and templates.

For serious evaluation, keep these outside the public repo:

private hidden fixtures;
private holdout seeds;
real benchmark answers;
traces from commercial or personal tasks;
API keys and provider metadata;
user data;
production prompts that should not be public.

The .gitignore includes private/, runs/, artifacts/, traces/, and common secret files by default.

Quick start

python3 -m venv .venv
source .venv/bin/activate
pip install -e .
agent-bench list-tasks
agent-bench validate

Create public sample artifacts and run scoring smoke tests:

python3 scripts/create_sample_artifacts.py
agent-bench score --task IF-01 --case case_001 --artifacts examples/artifacts/IF-01/case_001
agent-bench score --task DATA-01 --case case_001 --artifacts examples/artifacts/DATA-01/case_001
python3 scripts/public_leak_check.py .

Without installing the package, use the source-tree Make targets:

make validate
make test
make smoke
make lifecycle-check
make mutation-smoke
make hardening-check
make leak-check

The examples directory intentionally starts mostly empty. Generated artifacts under examples/artifacts/ are ignored by git except for the README placeholder.

Run an external agent setup

Use agent-bench run to hand an agent-visible task packet to any local command and score the artifacts it writes:

agent-bench run \
  --task IF-01 \
  --case case_001 \
  --agent-cmd "python3 scripts/mock_agent_write_artifacts.py" \
  --out runs/manual/mock/IF-01_case_001

The command receives AGENT_BENCH_TASK_PACKET and AGENT_BENCH_ARTIFACTS_DIR. It should write final artifacts to the artifacts directory. The runner then writes run.json, trace.jsonl, and score.json.

The task packet excludes scorer-only files such as check_config.json, answer keys, hidden labels, private scorer configs, canaries, and expected values. The scorer still reads the original fixture and the produced artifacts.

See Local Agent Runner MVP.

Compare two agent setups

Create two local smoke-run directories and compare them:

make compare-smoke

Or run the commands directly:

python3 scripts/create_sample_runs.py
agent-bench compare \
  --baseline runs/baseline \
  --candidate runs/spec_first \
  --out reports/generated/compare_baseline_vs_spec_first.md \
  --csv reports/generated/compare_baseline_vs_spec_first.csv

The comparison is paired: same task, same case, same scorer, different agent config. Public runs are smoke tests only; decision-grade evaluation requires private holdout cases outside the public repo.

First decision-grade task family: IF-01

IF-01 is the first hardened task-family pattern. It uses public synthetic cases, deterministic check_config.json files, critical violation caps, mutation support, and tests for strict artifact-contract compliance. See IF-01 decision-grade pattern.

make if01-smoke

Second decision-grade task family: DATA-01

DATA-01 is the second hardened task-family pattern. It uses synthetic CSV/SQLite fixtures, deterministic metrics.json, factual report.md, checked chart_spec.json, mutation support, and tests for exact data work without relying on a visual PNG oracle. See DATA-01 decision-grade pattern.

make data01-smoke

Third decision-grade task family: DOC-01

DOC-01 is the third hardened task-family pattern. It uses synthetic fixed-corpus documents, deterministic answer.md, checked citations.json, checked claims.json, mutation support, and tests for grounded answers without relying on live web or an LLM judge. See DOC-01 decision-grade pattern.

make doc01-smoke

Fourth decision-grade task family: SUP-01

SUP-01 is the fourth hardened task-family pattern and the first operational/customer-style workflow. It uses synthetic support inboxes, deterministic triage.json, checked drafts.json, checked escalations.json, decision_log.md, mutation support, and tests for policy-grounded replies without live inbox, browser, SaaS, or real customer data. See SUP-01 decision-grade pattern.

make sup01-smoke

Fifth decision-grade task family: API-01

API-01 is the fifth hardened task-family pattern and the first local internal API/tool-registry workflow. It uses synthetic API catalogs, local state fixtures, deterministic tool_calls.json, checked result.json, decision_log.md, scorer-side state simulation, mutation support, and tests for forbidden-tool avoidance without live SaaS, MCP, browser, or real APIs. See API-01 decision-grade pattern.

make api01-smoke

Initial core suite

The recommended v0 core suite has seven task families:

ID	Task	Capability
CODE-01	Local regression patch	coding + test discipline
TERM-02	Log-driven config repair	terminal/debugging
APP-04	Airline rebooking under policy	stateful tools + policy
DATA-01	CSV + SQL memo	exact data work + concise reporting
DOC-01	Fixed-corpus grounded answer	citations + unsupported-claim checks
IF-01	Long spec contract artifact	strict instruction following
SEC-01	Hidden prompt injection in HTML/email	security + tool-output trust boundary

The initial core suite is a starter set for proving the runner/scorer/compare loop. It is not the full scope of Agent Bench Lab and should not be interpreted as coding-first. Future task families can cover support, knowledge work, spreadsheets, browser workflows, ticketing, internal APIs, and customer-specific private checks using the same task/scorer/run model.

Operational local suite

SUP-01 is intentionally not added to configs/suites/core.json by default. Operational/customer-style workflows start in:

configs/suites/ops-local.json

This keeps core focused while allowing support and ticketing tasks to grow under an ops-oriented local suite.

Tools local suite

API-01 is intentionally not added to configs/suites/core.json by default. Local tool/API workflows start in:

configs/suites/tools-local.json

This keeps live-service-free API/tool reasoning separate from the fast starter core and from operational support workflows.

Repository layout

agent-bench-lab/
  configs/              suite and agent config examples
  docs/                 public documentation
  fixtures/public/      public example fixtures only
  private/              gitignored private holdouts, if created locally
  schemas/              JSON schemas
  src/agent_bench_lab/  CLI and local harness skeleton
  tasks/                task cards, prompts, and scorer modules
  examples/artifacts/   local generated artifacts for smoke tests
  scripts/              helper scripts

Design rules

Prefer local fixtures over live services.
Prefer exact/state-based scoring over subjective judging.
Keep hidden holdouts separate from public examples.
Log traces, costs, latency, and tool calls.
Compare paired runs on the same seeds.
Treat safety and policy violations as hard gates where appropriate.
Do not tune prompts on the same cases used for final comparison.

Contributor docs

License

MIT. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Bench Lab

Scope: all agent task families

Relationship to consumer applications

Private eval and scorer contracts

Benchmark lifecycle and hardening gates

Research Radar

Current status

Why this exists

Core idea

Public/private split

Quick start

Run an external agent setup

Compare two agent setups

First decision-grade task family: IF-01

Second decision-grade task family: DATA-01

Third decision-grade task family: DOC-01

Fourth decision-grade task family: SUP-01

Fifth decision-grade task family: API-01

Initial core suite

Operational local suite

Tools local suite

Repository layout

Design rules

Contributor docs

License

About

Uh oh!

Releases 11

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
configs		configs
docs		docs
fixtures		fixtures
research		research
schemas		schemas
scripts		scripts
src/agent_bench_lab		src/agent_bench_lab
tasks		tasks
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Agent Bench Lab

Scope: all agent task families

Relationship to consumer applications

Private eval and scorer contracts

Benchmark lifecycle and hardening gates

Research Radar

Current status

Why this exists

Core idea

Public/private split

Quick start

Run an external agent setup

Compare two agent setups

First decision-grade task family: IF-01

Second decision-grade task family: DATA-01

Third decision-grade task family: DOC-01

Fourth decision-grade task family: SUP-01

Fifth decision-grade task family: API-01

Initial core suite

Operational local suite

Tools local suite

Repository layout

Design rules

Contributor docs

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 11

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages