eval: src/eval/ framework with bootstrap CI, calibration, baselines

## What

Reusable evaluation framework under `src/eval/`, adapted from the methodology in the hallucination-probes paper (Obeso et al. 2025, [arxiv.org/abs/2509.03531](https://arxiv.org/abs/2509.03531)) and split along two paths so the old (sample-level) probe and the new (token-level) probe are clearly separated.

Two evaluation paths:

|   | Sample-level (legacy / baseline) | Token-level (new / primary) |
|---|---|---|
| Dataset | `data/pairs.jsonl` | `data/dataset.jsonl` (SVEN-paired, char-range `token_labels`) |
| Activations | `data/activations_v2/activations_layer*.npz` (last-token) | per-token vectors (blocked on #17) |
| Probe | `data/probe.npz` (shipped sample-level probe) | output of `src/train_probe_spanmax.py` |
| Entry | `src.eval.protocol.full_report` / `scripts/run_eval_framework.py` | `src.eval.token_protocol.full_token_report` / `scripts/run_token_eval.py` |
| Smoke test | `scripts/test_eval_framework.py` | `scripts/test_token_eval.py` |
| Notebooks | `eval_probe_report`, `eval_calibration`, `eval_baselines` | `eval_token_probe` |

The old sample-level probe is wired in as `baselines.ProbeBaseline` / `BroadcastProbeBaseline` so it shows up alongside random / length / regex in either path's comparison.

## Metrics added vs. the existing `scripts/eval_splits.py`

- **Bootstrap 95% CI on AUC** — with 150–700 test examples per split, point estimates lie.
- **Recall at fixed FPR** (0.05 / 0.10 / 0.20) — operational metric for the streaming UI; the paper's headline.
- **Calibration** (Brier + ECE, reliability curve) — the UI uses a fixed threshold, so probabilities must mean something, not just rank.
- **Three aggregation levels** (token-level path): `all` / `span` / `span_max` per paper Section 4. `span_max` pads negative examples with a whole-file negative span on the SVEN corpus (zero sanitizer annotations there) so it remains a proper binary classifier.
- **Trivial baselines** (random / length / regex) + **shipped sample-level probe** on every split.
- **Leakage-aware splits** with `pair_group_key(row)` falling back `_origin_repo` → `(_file_name, _func_name)` → row identity, so paired safe/vulnerable variants never split across train/test on either corpus.

## Layout

```
src/eval/
  metrics.py         compute_clf_metrics, bootstrap_auc_ci, calibration_metrics, reliability_curve, span_max_metrics
  splits.py          random / group_repo / heldout_cwe / heldout_lang / heldout_source; pair_group_key
  baselines.py       RandomBaseline, LengthBaseline, RegexBaseline, ProbeBaseline, BroadcastProbeBaseline
  probe_io.py        Probe dataclass, load_probe, load_activations, load_pairs, fit_logreg_on_split
  protocol.py        Sample-level: evaluate_split, full_report
  token_data.py      Token-level dataset: load_token_dataset, parse_spans, char_spans_to_token_spans
  token_protocol.py  Token-level: evaluate_token_split, full_token_report (all / span / span_max)
  report.py          Markdown + JSON renderers
  runtime.py         Per-token probe overhead measurement
  README.md          Documents both paths

scripts/
  run_eval_framework.py    Sample-level CLI (refit / fixed-probe modes)
  run_token_eval.py        Token-level CLI (per-row probs/offsets .npz inputs)
  test_eval_framework.py   Sample-level smoke test
  test_token_eval.py       Token-level smoke test (oracle on real spans)

notebooks/  (jupytext .py source + executed .ipynb)
  eval_probe_report   sample-level top-line + per-layer + ROC + headline
  eval_calibration    sample-level reliability + threshold sweep
  eval_baselines      sample-level probe vs trivial baselines
  eval_token_probe    token-level three-level table (demo + real modes)
```

## Findings worth flagging in WRITEUP.md

Running `scripts/run_eval_framework.py` on `data/activations_v2/activations_layer17.npz` reproduces existing numbers and surfaces:

1. **Length baseline gets AUC 0.88–0.94 on every split.** Positives are systematically longer than negatives — the probe's headline AUC inherits part of that.
2. **Regex baseline beats the probe on `heldout_cwe::CWE-328`** (0.991 vs 0.944) — every weak-hash positive contains `md5(`.

## DoD

- [x] `src/eval/` modules + README
- [x] Sample-level CLI + smoke test
- [x] Token-level CLI + smoke test (synthetic oracle on real `data/dataset.jsonl` spans)
- [x] Four notebooks (`.py` source + executed `.ipynb`)
- [x] `data/eval/report.{md,json}` regenerable in one command on the sample-level path
- [x] Old probe wired as `ProbeBaseline` / `BroadcastProbeBaseline` on both paths
- [x] Bootstrap CIs + calibration + recall@FPR + three aggregation levels
- [x] `pair_group_key` fallback for SVEN-style data
- [ ] End-to-end token-level numbers (blocked on #17)
- [ ] Hook `RegexBaseline` into the streaming demo as a free fallback signal (optional follow-up)

## Related

- Builds on the split bundle established in #12.
- Materials for the per-language coverage matrix tracked in #7.
- Token-level path's end-to-end runnability is blocked by the upstream issues:
  - #15 canonical dataset path
  - #16 Colab `--source` flag (local-only refactor)
  - #17 token-level extractor (blocks token end-to-end)
  - #18 train-side leakage in sample-level training scripts

Commits: `969993d` initial framework, `ffdf09c` review fixes, `1f22aa4` token-level path.

	Sample-level (legacy / baseline)	Token-level (new / primary)
Dataset	`data/pairs.jsonl`	`data/dataset.jsonl` (SVEN-paired, char-range `token_labels`)
Activations	`data/activations_v2/activations_layer*.npz` (last-token)	per-token vectors (blocked on #17)
Probe	`data/probe.npz` (shipped sample-level probe)	output of `src/train_probe_spanmax.py`
Entry	`src.eval.protocol.full_report` / `scripts/run_eval_framework.py`	`src.eval.token_protocol.full_token_report` / `scripts/run_token_eval.py`
Smoke test	`scripts/test_eval_framework.py`	`scripts/test_token_eval.py`
Notebooks	`eval_probe_report`, `eval_calibration`, `eval_baselines`	`eval_token_probe`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval: src/eval/ framework with bootstrap CI, calibration, baselines #14

What

Metrics added vs. the existing `scripts/eval_splits.py`

Layout

Findings worth flagging in WRITEUP.md

DoD

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

eval: src/eval/ framework with bootstrap CI, calibration, baselines #14

Description

What

Metrics added vs. the existing scripts/eval_splits.py

Layout

Findings worth flagging in WRITEUP.md

DoD

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Metrics added vs. the existing `scripts/eval_splits.py`