Short tutorial to run the benchmark locally with uv.
It runs lm-evaluation-harness multiple times while changing only the system identity prompt, then:
- saves per-run JSON files
- writes an aggregated summary
- generates static charts (PNG + SVG)
Identity/task definitions live in benchmark_config.json (including short display names and full prompts).
Default model is Qwen/Qwen3-4B-Instruct-2507 (configured in run_benchmark.py).
If you do not have uv yet:
curl -LsSf https://astral.sh/uv/install.sh | shRestart your shell, then verify:
uv --versionFrom this repository root:
uv syncuv run run_benchmark.py --help
uv run plot_results.py --helpUse --dry-run first:
uv run run_benchmark.py --dry-run
uv run run_benchmark.py --identity helpful --task arc_challenge --seeds 1,2,3 --dry-runExamples:
# Full matrix: all identities x all tasks x default seeds (1,2,3)
uv run run_benchmark.py
# One identity across all tasks
uv run run_benchmark.py --identity helpful
# One task across all identities
uv run run_benchmark.py --task arc_challenge
# One identity-task pair with explicit seeds
uv run run_benchmark.py --identity helpful --task arc_challenge --seeds 1,2,3uv run plot_results.pyBy default this reads results/summary.json and writes charts to results/.
If you run the benchmark in batches (different identities/tasks at different times),
you can regenerate a complete results/summary.json from existing run files:
uv run run_benchmark.py --summary-from-runsThis scans results/runs/*.json (excluding raw _lm_eval_raw_*.json files) and
recomputes the aggregate summary.
If you want one row per run for downstream analysis, export a CSV from saved run files:
uv run run_benchmark.py --csv-from-runsThis scans results/runs/*.json (excluding raw _lm_eval_raw_*.json files) and
writes results/runs.csv with schema:
task,identity,seed,score
To compute per-task p-values for each identity vs the configured baseline:
uv run run_benchmark.py --pvalues-from-runsThis prints a CSV-style table to stdout with columns:
task,identity,n,mean_delta_pp,p_value,p_holm
Method: exact two-sided paired sign-flip test across shared seeds. Multiple-comparison correction: Holm-Bonferroni.
Edit benchmark_config.json:
identities[*].short_namecontrols what appears in plots.identities[*].system_promptis the full system prompt sent tolm_eval.identities[*].keyis the recommended CLI selector for--identity.
Example identity entry:
{
"key": "helpful",
"short_name": "Helpful",
"system_prompt": "You are a helpful assistant."
}- Raw runs:
results/runs/*.json(schemaidentity_benchmark_seed.json) - Aggregated summary of last execution:
results/summary.json - Flat run export:
results/runs.csv - Charts (after plotting):
results/grouped_scores.pngresults/grouped_scores.svgresults/delta_vs_baseline.pngresults/delta_vs_baseline.svg
- Missing dependency error: run
uv sync. - CUDA/GPU issues: this runner is configured for
cudainrun_benchmark.py; if you need CPU fallback, updateMODEL_DEVICEthere. - Tests seem to be failing: this was tested on a 4070TI SUPER with 16 GB, you might need to choose a smaller model or quantize it.
- It doesn't seem to give any progress: executing tests is very slow, with a typical consumer-grade card you can expect a total runtime of 9-15 hours.