Skip to content

Latest commit

 

History

History
280 lines (216 loc) · 11.4 KB

File metadata and controls

280 lines (216 loc) · 11.4 KB

eval_framework

A lightweight LLM evaluation harness built for Rubrics RL research — fast, repeatable scoring of the many checkpoints an RL run produces. It wraps a single vLLM-compatible inference path with task runners for IF-EVAL, IFBench, WritingBench, HealthBench, Arena-Hard, and AlpacaEval.

Why it's fast. Evaluation is driven through vLLM's continuous batching: the runner fires --num-threads concurrent requests at one server so the GPU stays saturated end-to-end (a single H100 holds ~700 W throughout the run). On one H100 a 4B checkpoint clears every task in roughly 15 minutes; examples/shard_parallel_eval.sh shards a single model across N GPUs to go faster still.

Installation

git clone https://github.com/THUAIS-Lab/eval_framework.git
cd eval_framework
uv venv && source .venv/bin/activate
uv pip install -e .
uv pip install vllm --torch-backend=auto

Or install the released package from PyPI (no clone needed):

pip install llm-eval-framework
# llm-eval-framework does NOT pull in vLLM (vLLM needs a torch backend matched
# to your CUDA). Install it separately depending on how you run the model:
#   • server mode — run `vllm serve` in a separate env/terminal (see Quick Start)
#   • --local mode — vLLM is imported in-process, so it must live in THIS env:
uv pip install vllm --torch-backend=auto   # or: pip install vllm

After installation the eval-framework command is available in the venv.

Most benchmark data ships with the package — Arena-Hard v2.0 questions/baselines and the AlpacaEval GPT-4 baseline are bundled under tasks/arena_hard/data/ and tasks/alpaca_eval/data/. IFBench is the one exception: its verifier source is not redistributed here, so clone it only if you plan to run that task:

git clone https://github.com/allenai/IFBench .external/IFBench   # only for --tasks ifbench

(or point --ifbench-dir at an existing checkout). Every other task runs without this step.

Quick Start

The runner needs a model to call. There are two ways to provide one.

Option A — server mode (recommended, saturates the GPU)

Start a vLLM server in one terminal:

vllm serve Qwen/Qwen3-4B \
  --served-model-name Qwen3-4B \
  --gpu-memory-utilization 0.95

Then point the runner at it. Begin with a 4-example smoke test to confirm the setup is wired up correctly (~30 s) before launching a full run:

eval-framework \
  --tasks ifeval \
  --model Qwen3-4B \
  --base-url http://localhost:8000/v1 \
  --max-examples 4 \
  --output-dir outputs/smoke

Drop --max-examples for the full task. The --model value must match the server's --served-model-name.

Option B — --local (zero setup, single process)

--local loads vLLM in-process, so no separate server is needed — handy for a quick one-off:

eval-framework \
  --tasks ifeval \
  --model Qwen/Qwen3-4B \
  --local \
  --max-examples 4 \
  --output-dir outputs/smoke

Use --local for convenience; use server mode for throughput. In server mode the runner drives vLLM with --num-threads concurrent requests, which vLLM batches continuously to keep the GPU fully utilised — this is what gets a single card to ~700 W and a 4B checkpoint through all tasks in ~15 minutes.

To sweep a whole RL run across many checkpoints and GPUs, use the ready-made scripts in examples/ — see Multi-GPU Batch Evaluation below.

Tasks

Task Judge needed? Key flags
ifeval No (rule-based) --ifeval-input
ifbench No (rule-based) --ifbench-dir, --ifbench-input
writingbench Yes --writingbench-query, --writingbench-write-excel
healthbench Yes --healthbench-data
arena-hard Yes --arena-hard-dir, --arena-hard-benchmark
alpaca-eval Yes --alpaca-eval-reference, --alpaca-eval-hf-dataset

Modes

  • --inference-only — generate responses, skip judging. Judge later with --judge-only.
  • --judge-only — score existing responses. Only supports writingbench / healthbench / arena-hard / alpaca-eval (ifeval and ifbench are rule-based and score during inference).

Multi-GPU Batch Evaluation

For RL experiments you typically need to evaluate many checkpoints across all benchmarks. We provide ready-to-use scripts in examples/:

Script Use case
examples/shard_parallel_eval.sh Evaluate ONE model on all benchmarks — shards data across N GPUs for max throughput
examples/batch_eval.sh Evaluate one training run — auto-detects checkpoints, schedules across N GPUs in rounds, judges, plots

Usage:

# 1. Copy and edit the CONFIG section at the top of the script
cp examples/batch_eval.sh my_eval.sh
vim my_eval.sh   # edit CKPT_DIR, OUT_DIR, STEPS, etc.

# 2. Run
bash my_eval.sh

What the scripts handle automatically:

  • Multi-round scheduling — if you have more checkpoints than GPUs, the script runs them in rounds and cleans up vLLM between rounds
  • vLLM lifecycle — starts servers, waits for health checks, kills process groups after eval
  • Judge batching — runs judge jobs in small batches to respect API rate limits (configurable JUDGE_BATCH_SIZE)
  • Phase control — set RUN_INFERENCE=0 / RUN_JUDGE=0 / RUN_PLOT=0 to skip phases (e.g. re-run judge only after fixing an issue)
  • Logging — all vLLM and eval logs go to LOG_DIR for debugging; judge stderr (tqdm) is tee'd to terminal

vLLM Tips

  • Do NOT set --max-model-len unless you know exactly what you're doing. Let the model use its native context length (e.g. 32768 for Qwen3-4B). Setting it too low causes VLLMValidationError on long prompts.
  • --gpu-memory-utilization 0.95 is safe for H100s and maximizes KV cache.
  • Increase --num-threads when GPU utilization is low and the serving backend has available capacity.
  • Kill process groups, not just PIDskill -- -${pid} ensures all vLLM child processes are cleaned up. Follow with pkill -f "vllm serve" between rounds.

Output structure

outputs/
├── step_120/
│   ├── run_0/                     # one subdir per sample (mean@N evaluation)
│   │   ├── ifeval/       # summary.json, responses.jsonl
│   │   ├── ifbench/      # summary.json, responses.jsonl, eval_results_*.jsonl
│   │   ├── writingbench/ # responses.jsonl, scores.jsonl, summary.json
│   │   ├── healthbench/  # responses.jsonl, scores.jsonl, summary.json
│   │   ├── arena-hard/   # model_answer/, model_judgment/, summary.json
│   │   └── alpaca-eval/  # model_answer/, model_judgment/, summary.json
│   ├── run_1/ ...                 # up to run_{N-1}
│   ├── ifeval/summary_agg.json    # aggregated mean / std / sem / per_run
│   ├── healthbench/summary_agg.json
│   └── ...
├── step_240/
│   └── ...
└── plots/
    ├── ifeval.png
    ├── ifbench.png
    ├── healthbench.png
    ├── writingbench.png
    ├── arena-hard.png
    ├── alpaca-eval.png
    └── all_tasks.png

run_k/ holds the k-th sample's raw artifacts; summary_agg.json at the step root is what plotting consumes. With N=1 everything still works but error bars collapse to zero width.

Sampling variance (mean@N + error bars)

batch_eval.sh runs each checkpoint N times per task and then aggregates. Because the same live vLLM server handles all N samples, prefix caching amortises prefill — wall time is roughly decode(N)×, not cold starts.

Per-task defaults (override with env vars):

Task Default N Why
ifeval / ifbench 8 Rule-based scoring, cost is only GPU decode
healthbench 8 Rubric-based, judge cost 8× but gives honest error bars
writingbench 4 Large rubric per prompt; 4 samples is usually enough
arena-hard / alpaca-eval 1 These already report internal bootstrap CI; extra sampling rarely helps

Override any of them:

N_SAMPLES_HEALTHBENCH=4 N_SAMPLES_WRITINGBENCH=1 bash examples/batch_eval.sh

Set them all to 1 to reproduce the original single-run behavior.

Plotting

After inference + judge + aggregate, combine eval results from any set of checkpoints into training curves. Steps that carry a summary_agg.json get error bars automatically; those without fall back to a plain line.

python tools/plot_training_curves.py \
  --runs "run_a=outputs/run_a" \
  --runs "run_b=outputs/run_b" \
  --name-pattern "run_a=step_{step}" \
  --name-pattern "run_b=step_{step}" \
  --steps "120,240,360,480,600" \
  --tasks "ifeval,ifbench,healthbench,writingbench,arena-hard,alpaca-eval" \
  --plot-dir outputs/plots \
  --show-errorbar ci95          # ci95 (1.96·SEM) | sem | std | none

batch_eval.sh runs aggregate_runs.py during its plotting phase. To aggregate manually:

python tools/aggregate_runs.py \
  --out-dir outputs/run_a \
  --steps   120,240,360,480,600 \
  --tasks   ifeval,ifbench,healthbench,writingbench,arena-hard,alpaca-eval \
  --n-samples ifeval=8,ifbench=8,healthbench=8,writingbench=4,arena-hard=1,alpaca-eval=1

Judge comparison

Compare scores from different judge models:

python tools/judge_compare.py \
  --judges flash=outputs/qwen3-4B \
  --judges plus=outputs/qwen3-4B-judge-qwen-plus \
  --out outputs/judge_compare.json

Global request throttle

When running many judge jobs in parallel (e.g. 5 background eval-framework processes), all remote API requests share a file-lock-based global throttle to prevent 429 rate-limit errors.

Env var Default Description
MIN_INTERVAL_S 0.005 (≈200 QPS) Minimum interval between consecutive API requests across all threads/processes
EVAL_THROTTLE_STATE_PATH /tmp/eval_framework_global_throttle.state Shared state file path; processes using the same path share one throttle
export MIN_INTERVAL_S=0.01          # ~100 QPS global cap
export EVAL_THROTTLE_STATE_PATH=/tmp/eval_framework_global_throttle.state

Set MIN_INTERVAL_S=0 to disable throttling entirely.

Notes

  • --output-dir controls where responses/scores/summaries go. With --tasks, output is written to <output-dir>/<task>/.
  • If you set --served-model-name in vllm serve, pass that same name via --model.
  • IFBench test data is bundled at tasks/ifbench/data/IFBench_test.jsonl. The AllenAI verifier source resolves from .external/IFBench unless you pass --ifbench-dir.
  • Arena-Hard questions and baselines (o3-mini-2025-01-31, gemini-2.0-flash-001 for v2.0) are bundled at tasks/arena_hard/data/. Falls back to .external/arena-hard-auto if present. Override with --arena-hard-dir to use a custom repo (e.g. a newer bench version).
  • AlpacaEval reference outputs auto-download from HuggingFace. Override with --alpaca-eval-reference.
  • IFBench also needs emoji + syllapy installed (included in pyproject.toml deps).
  • setuptools<81 is pinned because syllapy depends on pkg_resources which was removed in setuptools 82.

License And Third-Party Assets

The framework code is released under Apache-2.0. Bundled benchmark assets remain under their original upstream licenses and citation requirements. Before redistributing modified benchmark data, check the upstream projects for the current license and attribution terms.