Skip to content

research: Fix train/inference semantic mismatch (pre-emission vs emitted-token) #13

Description

@peaktwilight

Current state:

  • Training: h_last(full_code_snippet) → vulnerable/safe (one snippet-level label, last-token activation)
  • Inference: h_before_next_token → risk for the token about to be sampled

These are different objectives. The probe is being asked at inference to do something it was never trained for. Hence the per-token noise the writeup acknowledges.

Two options:

Option A — emitted-token risk. Feed the generated token back in, then score the hidden state AT that token's position. h_after_token_t → risk(token_t). Matches what the probe was trained on. Loses the "before the token lands" framing.

Option B — pre-token risk. Keep current inference semantics, retrain on prefixes: h_before_token_t → does the next K tokens enter a vulnerable span? Requires per-position prefix labels. The hackathon's narrative depends on this option being viable.

Note: src/train_probe_spanmax.py (already in repo, see issue #1) implements span-max which is closer to Option B but uses inside-the-span labels rather than next-K-tokens.

DoD: pick A or B (B preferred for the narrative); retrain; verify per-token visualisation makes intuitive sense.

Metadata

Metadata

Assignees

No one assigned

    Labels

    researchResearch / experiments / paper-tracking

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions