Skip to content

eval: src/eval/ framework with bootstrap CI, calibration, baselines #14

Description

@mmtftr

What

Reusable evaluation framework under src/eval/, adapted from the methodology in the hallucination-probes paper (Obeso et al. 2025, arxiv.org/abs/2509.03531) and split along two paths so the old (sample-level) probe and the new (token-level) probe are clearly separated.

Two evaluation paths:

Sample-level (legacy / baseline) Token-level (new / primary)
Dataset data/pairs.jsonl data/dataset.jsonl (SVEN-paired, char-range token_labels)
Activations data/activations_v2/activations_layer*.npz (last-token) per-token vectors (blocked on #17)
Probe data/probe.npz (shipped sample-level probe) output of src/train_probe_spanmax.py
Entry src.eval.protocol.full_report / scripts/run_eval_framework.py src.eval.token_protocol.full_token_report / scripts/run_token_eval.py
Smoke test scripts/test_eval_framework.py scripts/test_token_eval.py
Notebooks eval_probe_report, eval_calibration, eval_baselines eval_token_probe

The old sample-level probe is wired in as baselines.ProbeBaseline / BroadcastProbeBaseline so it shows up alongside random / length / regex in either path's comparison.

Metrics added vs. the existing scripts/eval_splits.py

  • Bootstrap 95% CI on AUC — with 150–700 test examples per split, point estimates lie.
  • Recall at fixed FPR (0.05 / 0.10 / 0.20) — operational metric for the streaming UI; the paper's headline.
  • Calibration (Brier + ECE, reliability curve) — the UI uses a fixed threshold, so probabilities must mean something, not just rank.
  • Three aggregation levels (token-level path): all / span / span_max per paper Section 4. span_max pads negative examples with a whole-file negative span on the SVEN corpus (zero sanitizer annotations there) so it remains a proper binary classifier.
  • Trivial baselines (random / length / regex) + shipped sample-level probe on every split.
  • Leakage-aware splits with pair_group_key(row) falling back _origin_repo(_file_name, _func_name) → row identity, so paired safe/vulnerable variants never split across train/test on either corpus.

Layout

src/eval/
  metrics.py         compute_clf_metrics, bootstrap_auc_ci, calibration_metrics, reliability_curve, span_max_metrics
  splits.py          random / group_repo / heldout_cwe / heldout_lang / heldout_source; pair_group_key
  baselines.py       RandomBaseline, LengthBaseline, RegexBaseline, ProbeBaseline, BroadcastProbeBaseline
  probe_io.py        Probe dataclass, load_probe, load_activations, load_pairs, fit_logreg_on_split
  protocol.py        Sample-level: evaluate_split, full_report
  token_data.py      Token-level dataset: load_token_dataset, parse_spans, char_spans_to_token_spans
  token_protocol.py  Token-level: evaluate_token_split, full_token_report (all / span / span_max)
  report.py          Markdown + JSON renderers
  runtime.py         Per-token probe overhead measurement
  README.md          Documents both paths

scripts/
  run_eval_framework.py    Sample-level CLI (refit / fixed-probe modes)
  run_token_eval.py        Token-level CLI (per-row probs/offsets .npz inputs)
  test_eval_framework.py   Sample-level smoke test
  test_token_eval.py       Token-level smoke test (oracle on real spans)

notebooks/  (jupytext .py source + executed .ipynb)
  eval_probe_report   sample-level top-line + per-layer + ROC + headline
  eval_calibration    sample-level reliability + threshold sweep
  eval_baselines      sample-level probe vs trivial baselines
  eval_token_probe    token-level three-level table (demo + real modes)

Findings worth flagging in WRITEUP.md

Running scripts/run_eval_framework.py on data/activations_v2/activations_layer17.npz reproduces existing numbers and surfaces:

  1. Length baseline gets AUC 0.88–0.94 on every split. Positives are systematically longer than negatives — the probe's headline AUC inherits part of that.
  2. Regex baseline beats the probe on heldout_cwe::CWE-328 (0.991 vs 0.944) — every weak-hash positive contains md5(.

DoD

  • src/eval/ modules + README
  • Sample-level CLI + smoke test
  • Token-level CLI + smoke test (synthetic oracle on real data/dataset.jsonl spans)
  • Four notebooks (.py source + executed .ipynb)
  • data/eval/report.{md,json} regenerable in one command on the sample-level path
  • Old probe wired as ProbeBaseline / BroadcastProbeBaseline on both paths
  • Bootstrap CIs + calibration + recall@FPR + three aggregation levels
  • pair_group_key fallback for SVEN-style data
  • End-to-end token-level numbers (blocked on bug: extract_token_activations.py ignores token_labels and falls back to last-5-tokens labelling #17)
  • Hook RegexBaseline into the streaming demo as a free fallback signal (optional follow-up)

Related

Commits: 969993d initial framework, ffdf09c review fixes, 1f22aa4 token-level path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions