feat(session): opt-in cross-turn signal accumulation (#35 Phase 2)#52
feat(session): opt-in cross-turn signal accumulation (#35 Phase 2)#52evemcgivern wants to merge 3 commits into
Conversation
Closes the remaining multi-turn blind spots — gradual memory poisoning
and contextual-learning drift — where biased signals repeat across turns
without any single turn tripping a threshold.
- CROSS_TURN_SIGNAL_PATTERNS: sub-threshold signal patterns (biased
answer-shaping; comparative-superiority + favour-over-alternatives
directives) kept OUT of the per-turn pipeline so they never fire as
standalone threats.
- scanAccumulation: per trap type, accumulates each turn's signal (capped
per turn) and decays prior turns by session.decay; emits a
CrossTurnThreat when the running score crosses the strictness threshold
AND at least two distinct turns contributed.
- Opt-in via session.accumulation (default off); replaces the Phase-1
"not yet implemented" warning. decay default 0.7.
- Precision boundary: the favour directive requires manipulative verbs
("lead with / recommend X over alternatives"), not user-opinion
"prefer X" — so a benign strong preference does not accumulate.
Validation (accumulation enabled in the harness):
- mt-mem-001 + mt-ctx-001 now caught cross-turn at balanced and strict
(expectedToday flipped blind-spot → cumulative); blind-spots 0/5.
- two new benign hard-negatives (persistent technical preference;
honest reassuring support script) stay clean — no accumulation FP.
- single-string eval:gate unchanged (82.1/91.0/91.0, 0% FP) since the
signal patterns are session-only.
- 73 tests pass (accumulation off = inactive; on = catches gradual bias;
benign near-miss clean; single biased turn does not accumulate).
Docs: README multi-turn/session section (covers scanSession from Phases
0-2), interception-points table, roadmap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…e signals (#52 review) Addresses the Codex adversarial review (BLOCK): the latent-memory signal patterns were declarative and fired on benign content. "X is more reliable than the alternatives" crossed the balanced threshold in two ordinary product-comparison turns, and "recommend X over competitors" matched legitimate sales scripts — preference shaping is structurally identical to legitimate recommendation and cannot be separated from it by regex without false positives. - Removed lm-sig-favour-over-alternatives, lm-sig-lead-with-before (memory poisoning) and ct-sig-positivity-shaping / ct-sig-downplay-phrase (fired on benign "upbeat" instructions and standalone "completely safe"). - Accumulation now uses a single high-specificity signal: a scripted risk-downplaying ANSWER exemplar ("when asked X, reply 'completely safe'"), confidence 0.3. No legitimate analogue, so low FP. - Gradual memory poisoning (mt-mem-001) is now a DOCUMENTED blind spot, not a high-FP detector. README + fixture comment say so. - New benign hard-negatives stay clean: two product comparisons (Codex FP-1) and a kids-bot upbeat script (Codex FP-3). - Fixed orphaned scanCrossTurn docstring left by the Phase 2 edit. Validation: mt-ctx-001 caught cross-turn at balanced + strict; mt-mem-001 blind-spot; all 4 benign clean; eval:gate unchanged (82.1/91.0/91.0, 0% FP); 74 tests pass (accumulation off = inactive; on = catches contextual drift; product-comparison + kids-bot + single-turn all clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Deploying agent-armor with
|
| Latest commit: |
ef0663c
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://4a3fd50d.agent-armor.pages.dev |
| Branch Preview URL: | https://feat-35-phase2-signal-accumu.agent-armor.pages.dev |
…w FP-NEW)
Addresses the Codex re-review's one remaining FP: a benign medical/
wellness FAQ scripting case-specific reassurance ("a good reply is
nothing to worry about for mild symptoms" + "should get completely safe
confirmation after screening") had the same shape, turn count, and score
as the malicious mt-ctx-001, so no threshold could separate them.
The discriminator is generalization: a manipulative trap scripts the
downplaying answer as a STANDING rule ("from now on", "for all
questions", "regardless"), while legitimate support copy scripts it for a
specific, genuinely-safe case. Accumulation now fires only when a
scripted-downplaying-answer signal co-occurs with a GENERALIZATION_MARKERS
match in the window. Scripting "always say it's safe" is itself the
dangerous pattern; scripting "for mild symptoms, reassure" is not.
- mt-ctx-001 still caught (its "from here on" is the marker).
- New benign hard-negative mt-benign-005 (Codex's medical-FAQ FP) stays
clean — no blanket-rule marker.
- Docs (README + docstring) state the standing-rule requirement.
Verified: Codex FP-NEW repro now returns zero crossTurnThreats; 75 tests
pass; eval:gate unchanged (82.1/91.0/91.0, 0% FP); eval:multi-turn no
regressions (mt-ctx-001 caught, mt-mem-001 blind-spot, all 5 benign clean).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Closing without merge. Three adversarial review passes (Codex/GPT-5.4) established that regex cross-turn signal accumulation cannot reach acceptable precision: malicious "always downplay risk" scripting is separated from benign reassurance scripting only by semantic coreference (does the generalization marker refer back to the downplaying answers?), which regex fundamentally cannot capture. Each refinement relocated the false positive rather than fixing it. Decision: defer cross-turn semantic accumulation to the ML classifier. The clean, structural Phase 0/1 (split-payload window) is already merged (#50) and unaffected. Roadmap updated. The Phase 2 approach and its three review findings are preserved in this branch's history for whoever picks up the ML path. |
… to ML (#35) (#53) Phase 0/1 (scanSession + cross-turn split-payload detection) shipped in #50 without user-facing docs, and the session.accumulation flag still implied a regex Phase 2 was coming. Three adversarial review passes (#52) established that cross-turn SEMANTIC accumulation cannot reach acceptable precision via regex — malicious "always downplay risk" scripting is lexically identical to legitimate reassurance scripting, separated only by semantic intent. Deferred to the ML classifier. - README: new "Multi-Turn / Session Scanning" section documenting scanSession + the split-payload window; interception-points table row; roadmap "cross-turn" entry updated (split payloads shipped, semantic accumulation deferred to ML). - SessionConfig.accumulation doc + the runtime warning now say "not available in the regex SDK (deferred to the ML classifier)" instead of "Phase 2 not yet implemented" — the flag stays inert and warns once. - Dropped internal "Phase 0/1/2" labels from public-facing docstrings. No behavior change: scanSession (per-turn + split-payload) unchanged; accumulation remains inert. 71 tests pass; eval:gate + eval:multi-turn unchanged. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Phase 2 of stateful multi-turn scanning (#35). Closes the remaining blind spots — gradual memory poisoning and contextual-learning drift — where biased signals repeat across turns without any single turn tripping a threshold. Builds on the Phase 0/1
scanSessionAPI (#50).Opt-in (
session.accumulation: true, default off) because semantic accumulation is inherently lower-precision than structural detection.How it works
CROSS_TURN_SIGNAL_PATTERNS: sub-threshold signal patterns (biased answer-shaping like scripting "reply that it's completely safe"; comparative-superiority + favour-over-alternatives directives). Kept out of the per-turn pipeline, so they never fire as standalone per-turn threats.scanAccumulation: per trap type, accumulates each turn's signal (capped per turn) and decays prior turns bysession.decay(default 0.7); emits aCrossTurnThreatwhen the running score crosses the strictness threshold and ≥2 distinct turns contributed.Precision boundary (the key design choice)
The favour directive requires manipulative verbs ("lead with / recommend / prioritize X over alternatives") — a directive to the agent — not user-opinion "prefer X". So a benign strong technical preference ("Postgres is more reliable… prefer Postgres over the competitors") does not accumulate to a threat. Two new benign hard-negatives lock this in.
Test cases (plain English)
Verification
eval:multi-turn(accumulation enabled in harness):mt-mem-001+mt-ctx-001caught cross-turn at balanced and strict; blind-spots 0/5; all 4 benign conversations clean.expectedTodayflippedblind-spot→cumulative.eval:gateunchanged (82.1 / 91.0 / 91.0, 0% FP) — signal patterns are session-only.typecheck,typecheck:eval, 73 tests pass.Scope / follow-ups
examples/multi-turn-session.tswalkthrough is a reasonable follow-up (README covers usage).Relates to #35.
🤖 Generated with Claude Code