Skip to content

Jojodicus/ai-identity-benchmark

Repository files navigation

AI Identity Benchmark (Qwen 3 4B)

Short tutorial to run the benchmark locally with uv.

What this project does

It runs lm-evaluation-harness multiple times while changing only the system identity prompt, then:

  • saves per-run JSON files
  • writes an aggregated summary
  • generates static charts (PNG + SVG)

Identity/task definitions live in benchmark_config.json (including short display names and full prompts). Default model is Qwen/Qwen3-4B-Instruct-2507 (configured in run_benchmark.py).

1) Install uv

If you do not have uv yet:

curl -LsSf https://astral.sh/uv/install.sh | sh

Restart your shell, then verify:

uv --version

2) Create a Python environment

From this repository root:

uv sync

4) Sanity check CLI

uv run run_benchmark.py --help
uv run plot_results.py --help

5) Preview a run plan (no model execution)

Use --dry-run first:

uv run run_benchmark.py --dry-run
uv run run_benchmark.py --identity helpful --task arc_challenge --seeds 1,2,3 --dry-run

6) Run benchmark

Examples:

# Full matrix: all identities x all tasks x default seeds (1,2,3)
uv run run_benchmark.py

# One identity across all tasks
uv run run_benchmark.py --identity helpful

# One task across all identities
uv run run_benchmark.py --task arc_challenge

# One identity-task pair with explicit seeds
uv run run_benchmark.py --identity helpful --task arc_challenge --seeds 1,2,3

7) Plot results

uv run plot_results.py

By default this reads results/summary.json and writes charts to results/.

Rebuild summary from saved runs

If you run the benchmark in batches (different identities/tasks at different times), you can regenerate a complete results/summary.json from existing run files:

uv run run_benchmark.py --summary-from-runs

This scans results/runs/*.json (excluding raw _lm_eval_raw_*.json files) and recomputes the aggregate summary.

Export flat CSV from saved runs

If you want one row per run for downstream analysis, export a CSV from saved run files:

uv run run_benchmark.py --csv-from-runs

This scans results/runs/*.json (excluding raw _lm_eval_raw_*.json files) and writes results/runs.csv with schema:

task,identity,seed,score

Compute p-values from saved runs

To compute per-task p-values for each identity vs the configured baseline:

uv run run_benchmark.py --pvalues-from-runs

This prints a CSV-style table to stdout with columns:

task,identity,n,mean_delta_pp,p_value,p_holm

Method: exact two-sided paired sign-flip test across shared seeds. Multiple-comparison correction: Holm-Bonferroni.

Customize identities and labels

Edit benchmark_config.json:

  • identities[*].short_name controls what appears in plots.
  • identities[*].system_prompt is the full system prompt sent to lm_eval.
  • identities[*].key is the recommended CLI selector for --identity.

Example identity entry:

{
  "key": "helpful",
  "short_name": "Helpful",
  "system_prompt": "You are a helpful assistant."
}

Output files

  • Raw runs: results/runs/*.json (schema identity_benchmark_seed.json)
  • Aggregated summary of last execution: results/summary.json
  • Flat run export: results/runs.csv
  • Charts (after plotting):
    • results/grouped_scores.png
    • results/grouped_scores.svg
    • results/delta_vs_baseline.png
    • results/delta_vs_baseline.svg

Troubleshooting

  • Missing dependency error: run uv sync.
  • CUDA/GPU issues: this runner is configured for cuda in run_benchmark.py; if you need CPU fallback, update MODEL_DEVICE there.
  • Tests seem to be failing: this was tested on a 4070TI SUPER with 16 GB, you might need to choose a smaller model or quantize it.
  • It doesn't seem to give any progress: executing tests is very slow, with a typical consumer-grade card you can expect a total runtime of 9-15 hours.

Contributors

Languages