OpenBench

A/B testing platform for Claude agents. Automates the plan → run → evaluate → repeat loop to find optimal agent configurations.

Preview

Example 1	Example 2

Install

pip install -e .

Requires claude-agent-sdk (uses Claude Max subscription — no API key needed).

Usage

# Run a manually written experiment
openbench run experiments/quicktest_model.py

# Automated research from a natural language goal
openbench research "Find the best system prompt for a concise Q&A assistant" --max-iter 3

# View results
openbench list
openbench compare <experiment-name>

How It Works

Plan — LLM generates an A/B experiment testing one hypothesis (e.g., system prompt variant)
Run — Both agents execute every task in isolated temp directories; metrics collected
Evaluate — LLM judge scores each output on quality, accuracy, conciseness
Repeat — Winner becomes the new baseline; next hypothesis is proposed

Key Concepts

Experiment: Two agent configs (agent_a vs agent_b) differing in exactly one variable
DiffSpec: The single variable being tested (system_prompt, model, max_turns, etc.)
ResearchProgram: Natural language objective driving the auto-loop
Results persist to results/<experiment-name>/ as JSONL + metadata JSON
Human-readable reports in reports/

Project Structure

src/openbench/     # Core library
experiments/       # Example experiment definitions
programs/          # Saved ResearchProgram JSON configs
results/           # Raw trial data (JSONL)
reports/           # Human-readable experiment reports
docs/              # Guides, memory, plans

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
docs		docs
experiments		experiments
programs		programs
reports		reports
results		results
src/openbench		src/openbench
tests		tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenBench

Preview

Install

Usage

How It Works

Key Concepts

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenBench

Preview

Install

Usage

How It Works

Key Concepts

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages