research: Fix train/inference semantic mismatch (pre-emission vs emitted-token)

**Current state:**
- Training: `h_last(full_code_snippet) → vulnerable/safe` (one snippet-level label, last-token activation)
- Inference: `h_before_next_token → risk for the token about to be sampled`

These are different objectives. The probe is being asked at inference to do something it was never trained for. Hence the per-token noise the writeup acknowledges.

Two options:

**Option A — emitted-token risk.** Feed the generated token back in, then score the hidden state AT that token's position. `h_after_token_t → risk(token_t)`. Matches what the probe was trained on. Loses the "before the token lands" framing.

**Option B — pre-token risk.** Keep current inference semantics, retrain on prefixes: `h_before_token_t → does the next K tokens enter a vulnerable span?` Requires per-position prefix labels. The hackathon's narrative depends on this option being viable.

Note: `src/train_probe_spanmax.py` (already in repo, see issue #1) implements span-max which is closer to Option B but uses inside-the-span labels rather than next-K-tokens.

DoD: pick A or B (B preferred for the narrative); retrain; verify per-token visualisation makes intuitive sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: Fix train/inference semantic mismatch (pre-emission vs emitted-token) #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research: Fix train/inference semantic mismatch (pre-emission vs emitted-token) #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions