zero-agent-bench

A reproducible benchmark of frontier LLMs writing programs in Zero versus Python.

Language: zerolang.ai · Getting started
Upstream: vercel-labs/zero
Pattern skills: HKTITAN/zero-skills

Zero is positioned as "the programming language for agents." This benchmark measures whether that thesis holds: when an LLM is given the same task in both languages, does Zero produce higher pass rates, fewer fix-loop attempts, or fewer output tokens per successful solution?

Companion to vercel-labs/zero#104 which observed that 100% of Zero failures cluster on PAR100 and IMP001. This harness re-runs that experiment across model tiers with a structured fix loop and reports per-diagnostic-code error distribution.

What it measures

For each combination of (task, language, model):

Pass rate — did the final attempt produce a program that compiles and passes all hidden test cases?
Attempts to green — how many model turns of (generate → compile → run → feedback) were needed to pass?
Output tokens per success — how many tokens did the model generate in successful runs?
Error-code distribution — which Zero diagnostic codes (PAR100, IMP001, NAM003, …) or Python exception types come up most often before the model recovers?

Each language uses its native, agent-facing failure signal: Zero gets structured JSON from zero check --json, Python gets the stderr traceback from py_compile and runtime errors. The model is given that exact signal verbatim as feedback before its next attempt.

How to run

npm install

Set ANTHROPIC_API_KEY in the environment, or in zero-agent-bench/.env or the parent Zero/.env (loaded automatically).

export ANTHROPIC_API_KEY=sk-ant-...

# Cheap pilot: 5 tasks × 2 languages × Haiku 4.5, ~$1
npm run bench:pilot

# Full run: 15 tasks × 2 languages × Opus 4.7 + Sonnet 4.6 + Haiku 4.5
npm run bench:full

# Single model
npm run bench -- --model claude-sonnet-4-6

# Filter to a subset of tasks
npm run bench -- --filter strings

# Ablation: do the version-matched Zero skills actually help?
npm run bench -- --model claude-sonnet-4-6 --skip-zero-skills
npm run bench -- --model claude-sonnet-4-6

npm run report

# Resume tasks 11–15 after a partial full run, then merge:
npm run bench:remaining
npm run report:merge -- <remaining-run-id>   # append run id to scripts/merge-runs.mjs args in package.json

Requirements

Node ≥ 20
Python 3 (for the Python language adapter)
Zero installed and on PATH inside WSL Ubuntu (the harness shells out to wsl -d Ubuntu -- bash -lc 'zero …')
```
wsl -d Ubuntu -- bash -lc 'curl -fsSL https://zerolang.ai/install.sh | sh'
```

On macOS or Linux, replace the WSL invocations in src/languages/zero.ts with direct zero calls.

Fairness notes

The benchmark is deliberately written to be language-fair:

All tasks are pure CLI programs: read from args or stdin, write to stdout, no network or filesystem dependencies.
Tasks avoid features that meaningfully advantage one language (regex, comprehensions, advanced stdlib).
Inputs stay within i32 and use non-negative integers (Zero's parser does not currently accept negative integer literals, #104-adjacent).
Each language is told to print output exactly as specified — comparing on byte-equal stdout, not approximate output.
Both languages get the same prompt, the same fix-loop budget (3 attempts), and the same temperature defaults.
The Zero side optionally injects the zero-language, zero-stdlib, and zero-diagnostics skills from zero skills get into the system prompt. Use --skip-zero-skills for an ablation comparison.

Results

Published summaries:

Run	Summary
Pilot (5 tasks, Haiku)	results/PILOT.md
Full (15 tasks, all models)	results/RESULTS.md

After a run, results/raw/<timestamp>/report.md contains:

Pass rate per (model, language)
Mean attempts to green per (model, language)
Mean output tokens per successful run
Full error-code distribution per (model, language)

Raw per-task results are written to results/raw/<timestamp>/<task>__<lang>__<model>.json.

Tasks

The current set has 15 tasks across trivial, easy, and medium difficulty:

ID	Title	Difficulty
01-echo	Echo a single argument	trivial
02-add	Add two non-negative integers from args	easy
03-even-odd	Even or odd	easy
04-fizzbuzz	FizzBuzz to N	easy
05-reverse-string	Reverse an ASCII string	easy
06-max-of-three	Maximum of three integers	easy
07-count-vowels	Count lowercase vowels in a string	easy
08-factorial	Factorial of a small N	easy
09-sum-stdin	Sum integers from stdin	medium
10-is-prime	Primality test	medium
11-repeat-string	Repeat a string K times	easy
12-gcd	Greatest common divisor	medium
13-fibonacci	Nth Fibonacci number	medium
14-sort-three	Sort three integers ascending	medium
15-word-count	Count words in stdin	medium

Adding a task: drop a JSON file in tasks/ matching the Task type in src/types.ts.

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
reference		reference
results		results
scripts		scripts
src		src
tasks		tasks
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zero-agent-bench

What it measures

How to run

Requirements

Fairness notes

Results

Tasks

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zero-agent-bench

What it measures

How to run

Requirements

Fairness notes

Results

Tasks

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages