AI Identity Benchmark (Qwen 3 4B)

Short tutorial to run the benchmark locally with uv.

What this project does

It runs lm-evaluation-harness multiple times while changing only the system identity prompt, then:

saves per-run JSON files
writes an aggregated summary
generates static charts (PNG + SVG)

Identity/task definitions live in benchmark_config.json (including short display names and full prompts). Default model is Qwen/Qwen3-4B-Instruct-2507 (configured in run_benchmark.py).

1) Install `uv`

If you do not have uv yet:

curl -LsSf https://astral.sh/uv/install.sh | sh

Restart your shell, then verify:

uv --version

2) Create a Python environment

From this repository root:

uv sync

4) Sanity check CLI

uv run run_benchmark.py --help
uv run plot_results.py --help

5) Preview a run plan (no model execution)

Use --dry-run first:

uv run run_benchmark.py --dry-run
uv run run_benchmark.py --identity helpful --task arc_challenge --seeds 1,2,3 --dry-run

6) Run benchmark

Examples:

# Full matrix: all identities x all tasks x default seeds (1,2,3)
uv run run_benchmark.py

# One identity across all tasks
uv run run_benchmark.py --identity helpful

# One task across all identities
uv run run_benchmark.py --task arc_challenge

# One identity-task pair with explicit seeds
uv run run_benchmark.py --identity helpful --task arc_challenge --seeds 1,2,3

7) Plot results

uv run plot_results.py

By default this reads results/summary.json and writes charts to results/.

Rebuild summary from saved runs

If you run the benchmark in batches (different identities/tasks at different times), you can regenerate a complete results/summary.json from existing run files:

uv run run_benchmark.py --summary-from-runs

This scans results/runs/*.json (excluding raw _lm_eval_raw_*.json files) and recomputes the aggregate summary.

Export flat CSV from saved runs

If you want one row per run for downstream analysis, export a CSV from saved run files:

uv run run_benchmark.py --csv-from-runs

This scans results/runs/*.json (excluding raw _lm_eval_raw_*.json files) and writes results/runs.csv with schema:

task,identity,seed,score

Compute p-values from saved runs

To compute per-task p-values for each identity vs the configured baseline:

uv run run_benchmark.py --pvalues-from-runs

This prints a CSV-style table to stdout with columns:

task,identity,n,mean_delta_pp,p_value,p_holm

Method: exact two-sided paired sign-flip test across shared seeds. Multiple-comparison correction: Holm-Bonferroni.

Customize identities and labels

Edit benchmark_config.json:

identities[*].short_name controls what appears in plots.
identities[*].system_prompt is the full system prompt sent to lm_eval.
identities[*].key is the recommended CLI selector for --identity.

Example identity entry:

{
  "key": "helpful",
  "short_name": "Helpful",
  "system_prompt": "You are a helpful assistant."
}

Output files

Raw runs: results/runs/*.json (schema identity_benchmark_seed.json)
Aggregated summary of last execution: results/summary.json
Flat run export: results/runs.csv
Charts (after plotting):
- results/grouped_scores.png
- results/grouped_scores.svg
- results/delta_vs_baseline.png
- results/delta_vs_baseline.svg

Troubleshooting

Missing dependency error: run uv sync.
CUDA/GPU issues: this runner is configured for cuda in run_benchmark.py; if you need CPU fallback, update MODEL_DEVICE there.
Tests seem to be failing: this was tested on a 4070TI SUPER with 16 GB, you might need to choose a smaller model or quantize it.
It doesn't seem to give any progress: executing tests is very slow, with a typical consumer-grade card you can expect a total runtime of 9-15 hours.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
benchmark_config.json		benchmark_config.json
plot_results.py		plot_results.py
pyproject.toml		pyproject.toml
run_benchmark.py		run_benchmark.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Identity Benchmark (Qwen 3 4B)

What this project does

1) Install `uv`

2) Create a Python environment

4) Sanity check CLI

5) Preview a run plan (no model execution)

6) Run benchmark

7) Plot results

Rebuild summary from saved runs

Export flat CSV from saved runs

Compute p-values from saved runs

Customize identities and labels

Output files

Troubleshooting

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Identity Benchmark (Qwen 3 4B)

What this project does

1) Install uv

2) Create a Python environment

4) Sanity check CLI

5) Preview a run plan (no model execution)

6) Run benchmark

7) Plot results

Rebuild summary from saved runs

Export flat CSV from saved runs

Compute p-values from saved runs

Customize identities and labels

Output files

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

1) Install `uv`