feat(session): opt-in cross-turn signal accumulation (#35 Phase 2) by evemcgivern · Pull Request #52 · stylusnexus/agent-armor

evemcgivern · 2026-06-13T14:18:30Z

Summary

Phase 2 of stateful multi-turn scanning (#35). Closes the remaining blind spots — gradual memory poisoning and contextual-learning drift — where biased signals repeat across turns without any single turn tripping a threshold. Builds on the Phase 0/1 scanSession API (#50).

Opt-in (session.accumulation: true, default off) because semantic accumulation is inherently lower-precision than structural detection.

How it works

CROSS_TURN_SIGNAL_PATTERNS: sub-threshold signal patterns (biased answer-shaping like scripting "reply that it's completely safe"; comparative-superiority + favour-over-alternatives directives). Kept out of the per-turn pipeline, so they never fire as standalone per-turn threats.
scanAccumulation: per trap type, accumulates each turn's signal (capped per turn) and decays prior turns by session.decay (default 0.7); emits a CrossTurnThreat when the running score crosses the strictness threshold and ≥2 distinct turns contributed.
Replaces the Phase-1 "not yet implemented" warning with the real implementation.

Precision boundary (the key design choice)

The favour directive requires manipulative verbs ("lead with / recommend / prioritize X over alternatives") — a directive to the agent — not user-opinion "prefer X". So a benign strong technical preference ("Postgres is more reliable… prefer Postgres over the competitors") does not accumulate to a threat. Two new benign hard-negatives lock this in.

Test cases (plain English)

With accumulation off (default), a gradually-biased conversation does not accumulate — stays clean.
With accumulation on, gradual memory poisoning (comparative claim + persistent favour directive) and contextual drift (repeated biased answer exemplars) are caught as cross-turn threats naming their contributing turns.
A benign persistent technical preference does not fire (accumulation near-miss).
An honest reassuring support script does not fire (positivity near-miss).
A single biased turn does not accumulate (needs two contributing turns).

Verification

eval:multi-turn (accumulation enabled in harness): mt-mem-001 + mt-ctx-001 caught cross-turn at balanced and strict; blind-spots 0/5; all 4 benign conversations clean. expectedToday flipped blind-spot → cumulative.
Single-string eval:gate unchanged (82.1 / 91.0 / 91.0, 0% FP) — signal patterns are session-only.
typecheck, typecheck:eval, 73 tests pass.

Scope / follow-ups

Accumulation is a first cut at a genuinely hard, lower-precision problem; tuning (decay, thresholds, signal patterns) is expected to evolve with real traffic.
An examples/multi-turn-session.ts walkthrough is a reasonable follow-up (README covers usage).

Relates to #35.

🤖 Generated with Claude Code

Closes the remaining multi-turn blind spots — gradual memory poisoning and contextual-learning drift — where biased signals repeat across turns without any single turn tripping a threshold. - CROSS_TURN_SIGNAL_PATTERNS: sub-threshold signal patterns (biased answer-shaping; comparative-superiority + favour-over-alternatives directives) kept OUT of the per-turn pipeline so they never fire as standalone threats. - scanAccumulation: per trap type, accumulates each turn's signal (capped per turn) and decays prior turns by session.decay; emits a CrossTurnThreat when the running score crosses the strictness threshold AND at least two distinct turns contributed. - Opt-in via session.accumulation (default off); replaces the Phase-1 "not yet implemented" warning. decay default 0.7. - Precision boundary: the favour directive requires manipulative verbs ("lead with / recommend X over alternatives"), not user-opinion "prefer X" — so a benign strong preference does not accumulate. Validation (accumulation enabled in the harness): - mt-mem-001 + mt-ctx-001 now caught cross-turn at balanced and strict (expectedToday flipped blind-spot → cumulative); blind-spots 0/5. - two new benign hard-negatives (persistent technical preference; honest reassuring support script) stay clean — no accumulation FP. - single-string eval:gate unchanged (82.1/91.0/91.0, 0% FP) since the signal patterns are session-only. - 73 tests pass (accumulation off = inactive; on = catches gradual bias; benign near-miss clean; single biased turn does not accumulate). Docs: README multi-turn/session section (covers scanSession from Phases 0-2), interception-points table, roadmap. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…e signals (#52 review) Addresses the Codex adversarial review (BLOCK): the latent-memory signal patterns were declarative and fired on benign content. "X is more reliable than the alternatives" crossed the balanced threshold in two ordinary product-comparison turns, and "recommend X over competitors" matched legitimate sales scripts — preference shaping is structurally identical to legitimate recommendation and cannot be separated from it by regex without false positives. - Removed lm-sig-favour-over-alternatives, lm-sig-lead-with-before (memory poisoning) and ct-sig-positivity-shaping / ct-sig-downplay-phrase (fired on benign "upbeat" instructions and standalone "completely safe"). - Accumulation now uses a single high-specificity signal: a scripted risk-downplaying ANSWER exemplar ("when asked X, reply 'completely safe'"), confidence 0.3. No legitimate analogue, so low FP. - Gradual memory poisoning (mt-mem-001) is now a DOCUMENTED blind spot, not a high-FP detector. README + fixture comment say so. - New benign hard-negatives stay clean: two product comparisons (Codex FP-1) and a kids-bot upbeat script (Codex FP-3). - Fixed orphaned scanCrossTurn docstring left by the Phase 2 edit. Validation: mt-ctx-001 caught cross-turn at balanced + strict; mt-mem-001 blind-spot; all 4 benign clean; eval:gate unchanged (82.1/91.0/91.0, 0% FP); 74 tests pass (accumulation off = inactive; on = catches contextual drift; product-comparison + kids-bot + single-turn all clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-06-13T14:32:39Z

Deploying agent-armor with Cloudflare Pages

Latest commit:	`ef0663c`
Status:	✅ Deploy successful!
Preview URL:	https://4a3fd50d.agent-armor.pages.dev
Branch Preview URL:	https://feat-35-phase2-signal-accumu.agent-armor.pages.dev

View logs

…w FP-NEW) Addresses the Codex re-review's one remaining FP: a benign medical/ wellness FAQ scripting case-specific reassurance ("a good reply is nothing to worry about for mild symptoms" + "should get completely safe confirmation after screening") had the same shape, turn count, and score as the malicious mt-ctx-001, so no threshold could separate them. The discriminator is generalization: a manipulative trap scripts the downplaying answer as a STANDING rule ("from now on", "for all questions", "regardless"), while legitimate support copy scripts it for a specific, genuinely-safe case. Accumulation now fires only when a scripted-downplaying-answer signal co-occurs with a GENERALIZATION_MARKERS match in the window. Scripting "always say it's safe" is itself the dangerous pattern; scripting "for mild symptoms, reassure" is not. - mt-ctx-001 still caught (its "from here on" is the marker). - New benign hard-negative mt-benign-005 (Codex's medical-FAQ FP) stays clean — no blanket-rule marker. - Docs (README + docstring) state the standing-rule requirement. Verified: Codex FP-NEW repro now returns zero crossTurnThreats; 75 tests pass; eval:gate unchanged (82.1/91.0/91.0, 0% FP); eval:multi-turn no regressions (mt-ctx-001 caught, mt-mem-001 blind-spot, all 5 benign clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern · 2026-06-13T15:05:42Z

Closing without merge. Three adversarial review passes (Codex/GPT-5.4) established that regex cross-turn signal accumulation cannot reach acceptable precision: malicious "always downplay risk" scripting is separated from benign reassurance scripting only by semantic coreference (does the generalization marker refer back to the downplaying answers?), which regex fundamentally cannot capture. Each refinement relocated the false positive rather than fixing it.

Decision: defer cross-turn semantic accumulation to the ML classifier. The clean, structural Phase 0/1 (split-payload window) is already merged (#50) and unaffected. Roadmap updated. The Phase 2 approach and its three review findings are preserved in this branch's history for whoever picks up the ML path.

… to ML (#35) (#53) Phase 0/1 (scanSession + cross-turn split-payload detection) shipped in #50 without user-facing docs, and the session.accumulation flag still implied a regex Phase 2 was coming. Three adversarial review passes (#52) established that cross-turn SEMANTIC accumulation cannot reach acceptable precision via regex — malicious "always downplay risk" scripting is lexically identical to legitimate reassurance scripting, separated only by semantic intent. Deferred to the ML classifier. - README: new "Multi-Turn / Session Scanning" section documenting scanSession + the split-payload window; interception-points table row; roadmap "cross-turn" entry updated (split payloads shipped, semantic accumulation deferred to ML). - SessionConfig.accumulation doc + the runtime warning now say "not available in the regex SDK (deferred to the ML classifier)" instead of "Phase 2 not yet implemented" — the flag stays inert and warns once. - Dropped internal "Phase 0/1/2" labels from public-facing docstrings. No behavior change: scanSession (per-turn + split-payload) unchanged; accumulation remains inert. 71 tests pass; eval:gate + eval:multi-turn unchanged. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern and others added 2 commits June 13, 2026 09:18

evemcgivern closed this Jun 13, 2026

evemcgivern deleted the feat/35-phase2-signal-accumulation branch June 13, 2026 15:05

evemcgivern mentioned this pull request Jun 13, 2026

docs(session): document scanSession split-payload, defer accumulation to ML (#35) #53

Merged

evemcgivern mentioned this pull request Jun 13, 2026

Stateful multi-turn conversation scanning (cross-turn decomposition) #35

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(session): opt-in cross-turn signal accumulation (#35 Phase 2)#52

feat(session): opt-in cross-turn signal accumulation (#35 Phase 2)#52
evemcgivern wants to merge 3 commits into
mainfrom
feat/35-phase2-signal-accumulation

evemcgivern commented Jun 13, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

evemcgivern commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evemcgivern commented Jun 13, 2026

Summary

How it works

Precision boundary (the key design choice)

Test cases (plain English)

Verification

Scope / follow-ups

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agent-armor with Cloudflare Pages

Uh oh!

evemcgivern commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cloudflare-workers-and-pages Bot commented Jun 13, 2026 •

edited

Loading