Current state:
- Training:
h_last(full_code_snippet) → vulnerable/safe (one snippet-level label, last-token activation)
- Inference:
h_before_next_token → risk for the token about to be sampled
These are different objectives. The probe is being asked at inference to do something it was never trained for. Hence the per-token noise the writeup acknowledges.
Two options:
Option A — emitted-token risk. Feed the generated token back in, then score the hidden state AT that token's position. h_after_token_t → risk(token_t). Matches what the probe was trained on. Loses the "before the token lands" framing.
Option B — pre-token risk. Keep current inference semantics, retrain on prefixes: h_before_token_t → does the next K tokens enter a vulnerable span? Requires per-position prefix labels. The hackathon's narrative depends on this option being viable.
Note: src/train_probe_spanmax.py (already in repo, see issue #1) implements span-max which is closer to Option B but uses inside-the-span labels rather than next-K-tokens.
DoD: pick A or B (B preferred for the narrative); retrain; verify per-token visualisation makes intuitive sense.
Current state:
h_last(full_code_snippet) → vulnerable/safe(one snippet-level label, last-token activation)h_before_next_token → risk for the token about to be sampledThese are different objectives. The probe is being asked at inference to do something it was never trained for. Hence the per-token noise the writeup acknowledges.
Two options:
Option A — emitted-token risk. Feed the generated token back in, then score the hidden state AT that token's position.
h_after_token_t → risk(token_t). Matches what the probe was trained on. Loses the "before the token lands" framing.Option B — pre-token risk. Keep current inference semantics, retrain on prefixes:
h_before_token_t → does the next K tokens enter a vulnerable span?Requires per-position prefix labels. The hackathon's narrative depends on this option being viable.Note:
src/train_probe_spanmax.py(already in repo, see issue #1) implements span-max which is closer to Option B but uses inside-the-span labels rather than next-K-tokens.DoD: pick A or B (B preferred for the narrative); retrain; verify per-token visualisation makes intuitive sense.