What
Reusable evaluation framework under src/eval/, adapted from the methodology in the hallucination-probes paper (Obeso et al. 2025, arxiv.org/abs/2509.03531) and split along two paths so the old (sample-level) probe and the new (token-level) probe are clearly separated.
Two evaluation paths:
|
Sample-level (legacy / baseline) |
Token-level (new / primary) |
| Dataset |
data/pairs.jsonl |
data/dataset.jsonl (SVEN-paired, char-range token_labels) |
| Activations |
data/activations_v2/activations_layer*.npz (last-token) |
per-token vectors (blocked on #17) |
| Probe |
data/probe.npz (shipped sample-level probe) |
output of src/train_probe_spanmax.py |
| Entry |
src.eval.protocol.full_report / scripts/run_eval_framework.py |
src.eval.token_protocol.full_token_report / scripts/run_token_eval.py |
| Smoke test |
scripts/test_eval_framework.py |
scripts/test_token_eval.py |
| Notebooks |
eval_probe_report, eval_calibration, eval_baselines |
eval_token_probe |
The old sample-level probe is wired in as baselines.ProbeBaseline / BroadcastProbeBaseline so it shows up alongside random / length / regex in either path's comparison.
Metrics added vs. the existing scripts/eval_splits.py
- Bootstrap 95% CI on AUC — with 150–700 test examples per split, point estimates lie.
- Recall at fixed FPR (0.05 / 0.10 / 0.20) — operational metric for the streaming UI; the paper's headline.
- Calibration (Brier + ECE, reliability curve) — the UI uses a fixed threshold, so probabilities must mean something, not just rank.
- Three aggregation levels (token-level path):
all / span / span_max per paper Section 4. span_max pads negative examples with a whole-file negative span on the SVEN corpus (zero sanitizer annotations there) so it remains a proper binary classifier.
- Trivial baselines (random / length / regex) + shipped sample-level probe on every split.
- Leakage-aware splits with
pair_group_key(row) falling back _origin_repo → (_file_name, _func_name) → row identity, so paired safe/vulnerable variants never split across train/test on either corpus.
Layout
src/eval/
metrics.py compute_clf_metrics, bootstrap_auc_ci, calibration_metrics, reliability_curve, span_max_metrics
splits.py random / group_repo / heldout_cwe / heldout_lang / heldout_source; pair_group_key
baselines.py RandomBaseline, LengthBaseline, RegexBaseline, ProbeBaseline, BroadcastProbeBaseline
probe_io.py Probe dataclass, load_probe, load_activations, load_pairs, fit_logreg_on_split
protocol.py Sample-level: evaluate_split, full_report
token_data.py Token-level dataset: load_token_dataset, parse_spans, char_spans_to_token_spans
token_protocol.py Token-level: evaluate_token_split, full_token_report (all / span / span_max)
report.py Markdown + JSON renderers
runtime.py Per-token probe overhead measurement
README.md Documents both paths
scripts/
run_eval_framework.py Sample-level CLI (refit / fixed-probe modes)
run_token_eval.py Token-level CLI (per-row probs/offsets .npz inputs)
test_eval_framework.py Sample-level smoke test
test_token_eval.py Token-level smoke test (oracle on real spans)
notebooks/ (jupytext .py source + executed .ipynb)
eval_probe_report sample-level top-line + per-layer + ROC + headline
eval_calibration sample-level reliability + threshold sweep
eval_baselines sample-level probe vs trivial baselines
eval_token_probe token-level three-level table (demo + real modes)
Findings worth flagging in WRITEUP.md
Running scripts/run_eval_framework.py on data/activations_v2/activations_layer17.npz reproduces existing numbers and surfaces:
- Length baseline gets AUC 0.88–0.94 on every split. Positives are systematically longer than negatives — the probe's headline AUC inherits part of that.
- Regex baseline beats the probe on
heldout_cwe::CWE-328 (0.991 vs 0.944) — every weak-hash positive contains md5(.
DoD
Related
Commits: 969993d initial framework, ffdf09c review fixes, 1f22aa4 token-level path.
What
Reusable evaluation framework under
src/eval/, adapted from the methodology in the hallucination-probes paper (Obeso et al. 2025, arxiv.org/abs/2509.03531) and split along two paths so the old (sample-level) probe and the new (token-level) probe are clearly separated.Two evaluation paths:
data/pairs.jsonldata/dataset.jsonl(SVEN-paired, char-rangetoken_labels)data/activations_v2/activations_layer*.npz(last-token)data/probe.npz(shipped sample-level probe)src/train_probe_spanmax.pysrc.eval.protocol.full_report/scripts/run_eval_framework.pysrc.eval.token_protocol.full_token_report/scripts/run_token_eval.pyscripts/test_eval_framework.pyscripts/test_token_eval.pyeval_probe_report,eval_calibration,eval_baselineseval_token_probeThe old sample-level probe is wired in as
baselines.ProbeBaseline/BroadcastProbeBaselineso it shows up alongside random / length / regex in either path's comparison.Metrics added vs. the existing
scripts/eval_splits.pyall/span/span_maxper paper Section 4.span_maxpads negative examples with a whole-file negative span on the SVEN corpus (zero sanitizer annotations there) so it remains a proper binary classifier.pair_group_key(row)falling back_origin_repo→(_file_name, _func_name)→ row identity, so paired safe/vulnerable variants never split across train/test on either corpus.Layout
Findings worth flagging in WRITEUP.md
Running
scripts/run_eval_framework.pyondata/activations_v2/activations_layer17.npzreproduces existing numbers and surfaces:heldout_cwe::CWE-328(0.991 vs 0.944) — every weak-hash positive containsmd5(.DoD
src/eval/modules + READMEdata/dataset.jsonlspans).pysource + executed.ipynb)data/eval/report.{md,json}regenerable in one command on the sample-level pathProbeBaseline/BroadcastProbeBaselineon both pathspair_group_keyfallback for SVEN-style dataRegexBaselineinto the streaming demo as a free fallback signal (optional follow-up)Related
--sourceflag (local-only refactor)Commits:
969993dinitial framework,ffdf09creview fixes,1f22aa4token-level path.