feat(escalation): cosine retrieval gate, faithfulness gate, and deterministic deflection pipeline with E7 eval#27
Merged
Conversation
…nistic deflection pipeline + E7 eval (US-046-059, ADR-0003) Implements Epic D of the Phase-2 kit: the escalation & deflection pipeline and its E7 eval, per ADR-0003. Runtime (backend/escalation.py): - US-046: surface pre-fusion raw cosine on retrieval results (RRF overwrites similarity, so cosine is plumbed through before any gate). - US-047: cosine-defined retrieval gate, pure arithmetic on scores (strong = top1_cosine >= tau_sim AND n_cleared >= n_min); no LLM/reranker dependency. - US-048: one-call runtime faithfulness gate; an unset/failed judge fails CLOSED (escalate), never auto-sends. JUDGE_MODEL is a judge-role selector that does not chain through OPENAI_MODEL (docs/model-surface.md). - US-049: deterministic deflection orchestrator (off-topic/empty-draft/ unfaithful -> escalate; supported -> answer) with no reason leak on escalation. - US-050: EscalationConfig (three validated global knobs) + a STANDALONE false-resolve ceiling that is structurally un-wireable into the per-request pipeline. E7 eval (evals/retrieval/e7*.py, escalation_gold.yaml): - US-051-057: golden-set schema + P1a/P2/P3 populations and the derived P1b viewer-parameterized no-access case (no privileged second pass exists), all classified through the real backend gate. - US-055: consolidated deflection / false-resolve / false-escalate rates; false-resolve is the pinned safety metric, others are tunable quality. - US-056: knob sweep + ceiling-constrained knee selection (memoized so the whole grid costs one operating point's worth of LLM calls). - US-058: deterministic P1b non-disclosure byte-equality assertion. CI placement (US-059): - Per-PR deterministic tripwire in retrieval-eval.yml (P1a/P1b gate + non-disclosure byte equality, no LLM) that can fail the build. - New escalation-eval-weekly.yml runs the full LLM-judged P2/P3 + sweep and enforces the false-resolve ceiling on the scheduled run only.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
# Conflicts: # .claude/agent/tasks/prd-phase2-implementation.md
…o they read weak at the retrieval gate (#27)
Contributor
Retrieval eval — PR vs
|
| Mode | recall@5 | MRR | nDCG@5 |
|---|---|---|---|
| vector | 0.860 (±0.000) | 0.772 (±0.000) | 0.779 (±0.000) |
| keyword | 0.110 (±0.000) | 0.120 (±0.000) | 0.112 (±0.000) |
| hybrid | 0.860 (±0.000) | 0.759 (±0.000) | 0.769 (±0.000) |
Per-category recall@5
| Mode | single_chunk | multi_hop | adversarial | paraphrase |
|---|---|---|---|---|
| vector | 0.900 (±0.000) | 0.933 (±0.000) | 0.600 (±0.000) | 1.000 (±0.000) |
| keyword | 0.250 (±0.000) | 0.033 (±0.000) | 0.000 (±0.000) | 0.000 (±0.000) |
| hybrid | 0.900 (±0.000) | 0.933 (±0.000) | 0.600 (±0.000) | 1.000 (±0.000) |
Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.
This was referenced Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
The developer asked the agent to implement user story US-059 from the Phase 2 implementation PRD. US-059 is a CI-placement story for the E7 escalation eval: it requires a fast, deterministic per-PR tripwire (retrieval-gate decisions plus P1b non-disclosure byte-equality checks) that can block merges, alongside a heavier weekly scheduled sweep that runs the LLM-judge legs. The developer also expected the work to honor deferral notes carried over from US-050/US-055, specifically enforcing the false-resolve ceiling as a pinned safety invariant in the runner's exit code. Implicit constraints from the developer's environment included keeping the per-PR tripwire runnable under the slim requirements-ci.txt (no Anthropic API key), matching existing conventions for runner code, tests, CI workflows, and PRD status lines, and updating documentation (docs/evals.md, README, PRD status table) accordingly.
What Changed
backend/escalation.pyimplementing a deterministic deflection pipeline: a cheap retrieval gate that calls retrieval "weak" using the raw pre-fusion vector cosine (tau_sim/n_min), a one-call faithfulness gate that fails closed (any judge error/refusal/parse failure escalates), andrun_deflection_pipelinewiring them as fixed control flow (retrieve → gate → [if strong] draft → faithfulness → answer-or-escalate) with a generic deferral message carrying no reason metadata;EscalationConfigresolves and validates the gate knobs from env.backend/retrieval.pynow carriescosine_similarityseparately from the RRF-fusedsimilarityso the gate thresholds on a real cosine, preserving it through hybrid fusion and reranking.evals/retrieval/e7.py,e7_runner.py,escalation_gold.yaml) that enforces the false-resolve ceiling as a pinned safety invariant in the runner exit code, gated solely on the P3 faithfulness leg, with positive-control guards that hard-fail when the P3 leg is empty or all-mislabeled so a drift-blinded gate cannot pass silently.retrieval-eval.yml, and a heavier weekly scheduled sweep running the LLM-judge legs inescalation-eval-weekly.yml; updateddocs/evals.md, README, CONTEXT glossary, and the PRD status table.Risk Assessment
Testing
Baseline: ran both eval unit suites under a Python 3.11 venv with the slim per-PR deps (httpx/openai/pydantic/pyyaml/asyncpg/pyjwt) — test_e7_runner (57 groups) and test_e7 (9 groups) pass, covering the real gate, metric consolidation, and the US-059 exit-code/ceiling decisions. Since no OPENAI_API_KEY is available (full retrieval needs Supabase + embeddings), I drove the real
e7_runner.main()CLI process for all four CI scenarios with only the network/DB/draft/judge boundary stubbed deterministically; everything downstream (gate arithmetic, ceiling gate, render, exit code, sys.exit) is the shipped code. The per-PR tripwire exits 0 on a clean run (and with noanthropicpackage present) and exits 1 on a leak, blocking the merge; the weekly sweep exits 0 when healthy and exits 1 on a false-resolve ceiling breach, with the JSON snapshot written before exit and correctly driving the weekly workflow's jq issue body. Both workflow YAMLs parse and the per-PR step carries no Anthropic key. No UI surface is involved (CI/eval-runner change), so evidence is CLI transcripts, rendered markdown snapshots, and JSON artifacts rather than screenshots. Worktree left clean.Evidence: Evidence index (scenario matrix + repro)
Evidence: E2E driver (drives real e7_runner.main(), only I/O boundary stubbed)
Evidence: Per-PR clean tripwire rendered report (exit 0)
Evidence: Per-PR leak rendered report (exit 1, blocks merge)
Evidence: Per-PR leak pinned-fail ERROR lines (CI surface)
ERROR ...: E7 P1b LEAK — 4 no-access-replay row(s) cleared the retrieval gate; the gold leaked to a no-access viewer ERROR ...: E7 P1b NON-DISCLOSURE LEAK — 4 P1b row(s) show the customer a DIFFERENT output than the P1a generic deferralEvidence: Weekly full-sweep rendered report (exit 0, knee curve)
/var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/weekly-ceiling-breach.stdout.md)Evidence: Weekly ceiling-breach pinned-fail ERROR lines
ERROR ...: E7 FALSE-RESOLVE CEILING BREACH — measured faithfulness-leg 100.0% (3/3 P3) exceeds the buyer's ceiling 5.0% (ESCALATION_FALSE_RESOLVE_CEILING, US-050)./var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-weekly-ceiling-breach.json)Pipeline
Updates from git push no-mistakes
✅ **intent** - passed
✅ No issues found.
✅ **Rebase** - passed
✅ No issues found.
.github/workflows/retrieval-eval.yml:231- The new E7 per-PR tripwire (e7_runner --include-p1b) is a HARD gate but has no execution-error resilience, unlike the E6 gate in the same job. Its deterministic decision is pure arithmetic, but its execution makes ~7 live OpenAI embedding calls (hybrid retrieval embeds each query) plus asyncpg chunk_acl writes for the P1b replay. e7_runner.amain has no retry/backoff and the step has noset +e/continue-on-error, so any transient OpenAI/DB blip exits non-zero and blocks the merge of an innocent PR. The step's own comment (lines 203-211) says "pure arithmetic on cosine scores - NO LLM judge - so a real verdict can't flake," which conflates decision-determinism with execution-reliability - exactly the distinction E6's comment (lines 17, 178-188) carefully draws and mitigates with retry+non-blocking-on-execution-error. Recommend mirroring E6's execution-vs-verdict split here, or at least documenting the accepted flake exposure.evals/retrieval/e7_runner.py:1641- The pinned safety metric (consolidated false-resolve rate) is computed over a denominator of P1a + P1b + P3 questions, but the only non-deterministic contributor to the numerator in healthy operation is the P3 faithfulness leg (P1a/P1b cleared-gate counts are 0 unless retrieval/RLS is broken). P1a and P1b are deterministically-escalated easy true-negatives, so padding the denominator with them systematically lowers the measured rate and can mask a real P3 (faithfulness-leg) false-resolve below the buyer's ceiling - the masking grows as more P2 questions are authored (each becomes a P1b row in the denominator). With the current 3/4/3 gold one P3 false-resolve = 10% and still trips the 5% ceiling, so this is latent rather than active, but it weakens the safety invariant's sensitivity for any larger gold set. Consider also surfacing/gating the faithfulness-leg false-resolve over the P3 population specifically, so the hard signal isn't diluted by easy escalations. This is the documented/tested intended definition, so flagging for confirmation rather than auto-fix.evals/retrieval/e7_runner.py:2137- run_e7_sweep computes each grid point's metrics viacompute_e7_metrics(p1a, p2, p3)(no P1b), while the top-level weekly metrics and the enforced ceiling verdict (lines 2623/2641) include P1b. The knee's reported false-resolve and feasibility are therefore computed on a smaller (stricter) denominator than the gate that actually enforces the ceiling. Not a safety bug - the sweep is strictly more conservative, so a chosen knee always satisfies the looser enforced gate - but the recommended-knee numbers won't match what the ceiling gate measures at that config, which can confuse interpretation of the weekly snapshot. Worth a note or aligning the two computations.🔧 Fix: gate E7 false-resolve ceiling on faithfulness-leg (P3) rate only
1 warning still open:
evals/retrieval/e7_runner.py:2761- This commit makes the pinned false-resolve ceiling fed SOLELY by the P3 faithfulness leg, but a structurally-blind P3 leg in the weekly run now silently disarms that gate. In amain,not p3_result.passedis only logged at ERROR (lines 2761-2765) and never setsfailed = True- unlike the P1a/P1b/non-disclosure blindness guards, which hard-fail (P1a at 2806-2811).E7P3Result.passedalready models the exact positive-control signal (returns False on an empty P3 population, mirroring P1a per its docstring at 880-889), but amain ignores it for the exit code. So if gold drift drops or mislabels all P3 rows in the--include-p3weekly run,false_resolve.ratebecomes None ->assert_false_resolve_ceilingis inert (None is not a breach) -> the weekly run exits 0 (green) with the pinned safety invariant silently unmeasured. The changed docstring onassert_false_resolve_ceiling(~line 1810) asserts 'the per-leg blindness guards ... own that case,' but for P3 the guard is only a soft log, so nothing actually owns it. Before this change the safety number was anchored by P1a's always-present (measured-0%) contribution; now P3 is the only meaningful weekly contributor, so its non-hard-failing blindness is consequential. Recommend translating a requested-but-blind P3 leg (p3_result is not None and not p3_result.passed) into a hard failure, consistent with P1a's positive-control philosophy.🔧 Fix: hard-fail weekly E7 run on structurally-blind P3 leg
1 warning still open:
evals/retrieval/e7_runner.py:2389- The Round 2 fix hard-fails a structurally-blind (empty) P3 leg, butE7P3Result.passedis purelyn_questions > 0(line 890), so a NON-empty-but-all-mislabeledP3 leg still counts as a pass. Now that the false-resolve ceiling is fed SOLELY by P3, this leaves a second silent-disarm path the new guard does not cover: if config/gold drift makes every P3 row escalate BEFORE the faithfulness gate (e.g.tau_simraised so P3 retrieval reads weak, or the answerer degrades so all rows escalate at the draft leg — neither trips the P1a gate, whose weak no-context rows still escalate correctly), thenfalse_resolves == [], socompute_e7_metricsreports a MEASURED 0% (denominator = n_questions, numerator = 0), NOTNone. The ceiling verdict is thereforebreached=False, the P3 guard at line 2389 does not fire (passed=True), and the weekly run exits 0 (green) having NEVER exercised the faithfulness gate the ceiling exists to protect — only a non-paginglog.warningfor mislabeled rows (line 2916) surfaces it. This is the same silent-disarm class the Round 2 fix closed for empty P3, reached via a different (gold/config drift) path. The guard comment (line 2385) and docs/evals.md ("if gold drift drops or mislabels every P3 row, the P3 positive control fails") overstate coverage: the control catches only the dropped/empty case (rate→None), not present-but-never-reached-the-gate (rate→vacuous 0%). Consider extending the positive control to require ≥1 P3 row to actually reach the faithfulness gate (i.e.escalated_at_faithfulness+false_resolvesnon-empty, equivalently NOT all P3 rowsmislabeled) and fail closed otherwise, mirroring the empty-P3 guard.🔧 Fix: hard-fail weekly E7 run on all-mislabeled P3 leg
1 warning still open:
evals/retrieval/e7_runner.py:1702- The false-resolve ceiling rate's denominator is the full P3 population (p3.n_questionshere;len(self.decisions)infalse_resolve_rateat line 882), which includesmislabeledrows that never reached the faithfulness judge. The Round 3 fix added anexercised-based positive control that hard-fails only when ZERO rows exercise the gate (empty or all-mislabeled). But a PARTIALLY-mislabeled P3 leg (≥1 exercised, the rest mislabeled) passes the positive control while the mislabeled rows dilute the rate downward — reintroducing the very dilution the prior rounds were chartered to eliminate, now via gold/config drift inside P3 rather than P1a/P1b. Latent at the current 3-row P3 gold (1 false-resolve = 33%, always breaches 5%), but active as P3 gold grows: e.g. 1 false-resolve over 20 P3 rows with 18 mislabeled reads 1/20 = 5% and clears a 5% ceiling, whereas over the 2 EXERCISED rows it is 1/2 = 50% — a real faithfulness-leg breach masked by gold-defect rows. This contradicts the metric's own documented semantics:compute_e7_metrics's docstring (lines 1645-1646) defines the denominator as 'the cases where a false-resolve can actually occur once a draft clears the retrieval gate' (=exercised), andmislabeledis documented (line 860) as 'a gold-authoring defect, not a pipeline result' — so counting it as a clean handled row in the SAFETY denominator inflates apparent gate quality. Consider making the denominatorlen(p3.exercised)in BOTH the enforced gate (line 1702 / used at 2857) and the sweep (compute_e7_metrics at line 2209), withfalse_resolve_rate→None whenexercisedis empty (consistent with the positive control), and correcting the docstrings/docs that currently claim full drift-proofing.✅ **Test** - passed
✅ No issues found.
python -m evals.retrieval.test_e7_runner(57 groups PASS — incl. US-059e7_pinned_invariants_failedexit-code + ceiling-gate groups)python -m evals.retrieval.test_e7(9 groups PASS — shipped escalation_gold.yaml, 10 rows)Real CLI E2Ee7_runner --include-p1bclean per-PR run → process exit 0 (ran with noanthropicpackage installed)Real CLI E2Ee7_runner --include-p1bwith gold leaked to a no-access viewer → exit 1 (P1b gate-clear + non-disclosure leak, JSON written first)Real CLI E2Ee7_runner --include-p1b --include-p2 --include-p3 --sweephealthy weekly → exit 0 (P3 moat holds, knee curve, false-resolve 0% ≤ 5%)Real CLI E2E weekly with P3 auto-resolves → exit 1 (false-resolve 100% > 5% ceiling breach)Ran the weekly workflow'sjqissue-body builder against the breach + leak snapshotsyaml.safe_loadof retrieval-eval.yml and escalation-eval-weekly.yml; confirmed per-PR E7 step has OPENAI_API_KEY but no ANTHROPIC_API_KEY🔧 **Document** - 1 issue found → auto-fixed ✅
.claude/agent/tasks/PRD_phase2_agentic_rag_kit.md:5- The PRD planning doc added by this change (.claude/agent/tasks/PRD_phase2_agentic_rag_kit.md) asserts ADR-0003 and ADR-0004 are "on-disk ADRs in docs/adr/" (line 5) and points to "docs/adr/0003-escalation-signal-and-deflection-pipeline.md" (line 242), but neither file exists. I did NOT auto-fix this: (1) it is a planning/spec artifact (README labels .claude/ as "not needed to run the app"), a historical 2026-06-16 reconciliation snapshot, not deliverable documentation; (2) the published docs (README, docs/, CONTEXT.md) are internally consistent because file-less "ADR-NNNN" tags are an established convention here - ADR-0006 (model surface) is cited the same way with no on-disk file either; (3) resolving it is a judgment call between authoring ADR-0003/0004 versus rewording the "on-disk" claim, which needs design intent. Flagging so a maintainer can decide.🔧 Fix: correct PRD on-disk ADR-0003/0004 claims to ruled-not-written
✅ Re-checked - no issues remain.
✅ **Lint** - passed
✅ No issues found.
✅ **Push** - passed
✅ No issues found.