feat(escalation): cosine retrieval gate, faithfulness gate, and deterministic deflection pipeline with E7 eval by hcho22 · Pull Request #27 · hcho22/Agentic_RAG

hcho22 · 2026-06-23T21:48:04Z

Intent

The developer asked the agent to implement user story US-059 from the Phase 2 implementation PRD. US-059 is a CI-placement story for the E7 escalation eval: it requires a fast, deterministic per-PR tripwire (retrieval-gate decisions plus P1b non-disclosure byte-equality checks) that can block merges, alongside a heavier weekly scheduled sweep that runs the LLM-judge legs. The developer also expected the work to honor deferral notes carried over from US-050/US-055, specifically enforcing the false-resolve ceiling as a pinned safety invariant in the runner's exit code. Implicit constraints from the developer's environment included keeping the per-PR tripwire runnable under the slim requirements-ci.txt (no Anthropic API key), matching existing conventions for runner code, tests, CI workflows, and PRD status lines, and updating documentation (docs/evals.md, README, PRD status table) accordingly.

What Changed

Added backend/escalation.py implementing a deterministic deflection pipeline: a cheap retrieval gate that calls retrieval "weak" using the raw pre-fusion vector cosine (tau_sim / n_min), a one-call faithfulness gate that fails closed (any judge error/refusal/parse failure escalates), and run_deflection_pipeline wiring them as fixed control flow (retrieve → gate → [if strong] draft → faithfulness → answer-or-escalate) with a generic deferral message carrying no reason metadata; EscalationConfig resolves and validates the gate knobs from env. backend/retrieval.py now carries cosine_similarity separately from the RRF-fused similarity so the gate thresholds on a real cosine, preserving it through hybrid fusion and reranking.
Added the E7 escalation eval (evals/retrieval/e7.py, e7_runner.py, escalation_gold.yaml) that enforces the false-resolve ceiling as a pinned safety invariant in the runner exit code, gated solely on the P3 faithfulness leg, with positive-control guards that hard-fail when the P3 leg is empty or all-mislabeled so a drift-blinded gate cannot pass silently.
Added CI placement: a fast, deterministic per-PR E7 tripwire (retrieval-gate decisions plus P1b non-disclosure byte-equality checks, runnable under slim CI with no Anthropic key) in retrieval-eval.yml, and a heavier weekly scheduled sweep running the LLM-judge legs in escalation-eval-weekly.yml; updated docs/evals.md, README, CONTEXT glossary, and the PRD status table.

Risk Assessment

⚠️ Medium: The latest commit's all-mislabeled hard-fail is correct, well-tested, and safe to merge at the current gold size, but it leaves a residual partial-mislabeling dilution path that weakens the pinned safety invariant the prior rounds were explicitly chartered to make non-dilutable, and that becomes exploitable as the P3 gold grows.

Testing

Baseline: ran both eval unit suites under a Python 3.11 venv with the slim per-PR deps (httpx/openai/pydantic/pyyaml/asyncpg/pyjwt) — test_e7_runner (57 groups) and test_e7 (9 groups) pass, covering the real gate, metric consolidation, and the US-059 exit-code/ceiling decisions. Since no OPENAI_API_KEY is available (full retrieval needs Supabase + embeddings), I drove the real e7_runner.main() CLI process for all four CI scenarios with only the network/DB/draft/judge boundary stubbed deterministically; everything downstream (gate arithmetic, ceiling gate, render, exit code, sys.exit) is the shipped code. The per-PR tripwire exits 0 on a clean run (and with no anthropic package present) and exits 1 on a leak, blocking the merge; the weekly sweep exits 0 when healthy and exits 1 on a false-resolve ceiling breach, with the JSON snapshot written before exit and correctly driving the weekly workflow's jq issue body. Both workflow YAMLs parse and the per-PR step carries no Anthropic key. No UI surface is involved (CI/eval-runner change), so evidence is CLI transcripts, rendered markdown snapshots, and JSON artifacts rather than screenshots. Worktree left clean.

Evidence: Evidence index (scenario matrix + repro)

# US-059 E7 CI placement — test evidence

Branch `feat/escalation-deflection-pipeline` (407df7b). The intent: a fast **deterministic
per-PR tripwire** (retrieval-gate decisions + P1b non-disclosure byte equality) that can
block merges, alongside a heavier **weekly sweep** running the LLM-judged legs, with the
**false-resolve ceiling pinned as the runner's exit code**.

## 1. Unit suites (real gate, real metrics, real exit-code decision; fakes for I/O)

- `python -m evals.retrieval.test_e7_runner` → **57 groups PASS** (includes the 6 US-059
  `e7_pinned_invariants_failed` exit-code groups + 5 ceiling-gate groups).
- `python -m evals.retrieval.test_e7` → **9 groups PASS** (shipped `escalation_gold.yaml`:
  10 rows, all populations present).

## 2. End-to-end CLI process (the real `e7_runner.main()`, only the network/DB boundary stubbed)

Each row is one real process invocation of the exact command the CI workflow runs; the
process exit code is the real `sys.exit` captured by the shell.

| Scenario | CLI (as committed in CI) | Exit | Demonstrates |
|---|---|---|---|
| per-PR clean | `e7_runner --include-p1b` | **0** | deterministic tripwire passes an innocent PR; ran with **no `anthropic` package installed** |
| per-PR leak | `e7_runner --include-p1b` (gold leaked to no-access viewer) | **1** | P1b gate-clear **and** non-disclosure leak → **blocks the merge** |
| weekly healthy | `e7_runner --include-p1b --include-p2 --include-p3 --sweep` | **0** | full sweep, P3 moat holds, false-resolve 0% ≤ 5% ceiling, knee curve emitted |
| weekly regression | same | **1** | P3 unanswerables auto-resolved → false-resolve 100% > 5% → **ceiling breach fails the scheduled run** |

Snapshots: `snap-*.json` (the `docs/escalation-weekly/<DATE>.json` artifact) + `*.stdout.md`
(the rendered report the weekly workflow tees to `<DATE>.md`). Pinned-fail ERROR lines in
`*.stderr.log`.

## 3. Runner JSON → weekly-workflow issue body (integration point)

The weekly workflow's `jq` issue-body builder was run against the produced snapshots:
- breach snapshot → `- **False-resolve ceiling breach** ... measured 100% (3/3) exceeds the buyer ceiling 5% (ESCALATION_FALSE_RESOLVE_CEILING).`
- leak snapshot → `- **P1b no-access leak**: e7-p2-01..04` + `- **P1b non-disclosure leak** (4 row(s)).`

## 4. Workflow placement

- `.github/workflows/retrieval-eval.yml` (pull_request): E7 tripwire step runs
  `e7_runner --include-p1b`; env carries `OPENAI_API_KEY` + Supabase, **no `ANTHROPIC_API_KEY`**.
- `.github/workflows/escalation-eval-weekly.yml` (schedule + workflow_dispatch): full sweep.
- Both YAMLs parse.

## How to reproduce

`` `
python3.11 -m venv /tmp/e7-venv
/tmp/e7-venv/bin/pip install httpx openai pydantic pyyaml asyncpg pyjwt   # per-PR (slim) env
cd <repo> && /tmp/e7-venv/bin/python -m evals.retrieval.test_e7_runner
E7_REPO_ROOT=<repo> /tmp/e7-venv/bin/python e7_cli_e2e.py clean-per-pr snap-clean-per-pr.json
# weekly env additionally: /tmp/e7-venv/bin/pip install anthropic
`` `

Evidence: E2E driver (drives real e7_runner.main(), only I/O boundary stubbed)

"""End-to-end driver for the US-059 E7 escalation runner.

This drives the REAL `evals.retrieval.e7_runner` CLI process — argparse, the env
guards, the lazy `runner` import, leg orchestration, JSON snapshot write, the full
multi-section markdown render, the pinned exit-code decision, and `sys.exit` — for
the two CI invocations US-059 ships:

  * the PER-PR tripwire   : `e7_runner --include-p1b`            (deterministic only)
  * the WEEKLY full sweep : `e7_runner --include-p1b --include-p2 --include-p3 --sweep`

The ONLY thing stubbed is the genuine I/O boundary: hybrid retrieval
(`runner.run_query`, which would embed via OpenAI + hit Supabase), the P1b DB ACL
reset (`asyncpg` + the E4 reset helpers), the OpenAI drafter, and the offline Claude
judge. Everything downstream of that boundary — the real `escalation.retrieval_gate`,
the real metric consolidation, the real false-resolve ceiling gate, the real render
functions, and the real `e7_pinned_invariants_failed` exit-code decision — runs for
real, so the rendered snapshot + process exit code are exactly what CI would observe.

Usage:  python e7_cli_e2e.py <scenario> <out_json_path>
  scenarios: clean-per-pr | p1b-leak | weekly-clean | weekly-ceiling-breach
"""
from __future__ import annotations

import os
import sys
import uuid
from pathlib import Path

ROOT = Path(os.environ["E7_REPO_ROOT"]).resolve()
sys.path.insert(0, str(ROOT))
sys.path.insert(0, str(ROOT / "backend"))

SCENARIO = sys.argv[1]
OUT = sys.argv[2]

# --- fake env so amain's guards pass (no real network is ever reached) --------
os.environ["SUPABASE_URL"] = "http://127.0.0.1:54321"
os.environ["OPENAI_API_KEY"] = "sk-fake-e2e"
os.environ["SUPABASE_SERVICE_ROLE_KEY"] = "service-role-fake"
os.environ["SUPABASE_ANON_KEY"] = "anon-fake"
os.environ["SUPABASE_JWT_SECRET"] = "super-secret-jwt-token-with-at-least-32-characters-long"
os.environ["CORPUS_SEED_DATABASE_URL"] = "postgresql://fake/db"
os.environ["DATABASE_URL"] = "postgresql://fake/db"
if SCENARIO.startswith("weekly"):
    os.environ["ANTHROPIC_API_KEY"] = "sk-ant-fake-e2e"

from retrieval import SearchDocumentsResult  # noqa: E402
from evals.retrieval import e7, e7_runner  # noqa: E402
from evals.retrieval import runner as r  # noqa: E402

# Real shipped gold → {question_text: label}. The driver classifies retrieval
# strength off the genuine labels in escalation_gold.yaml (10 rows).
LABEL = {row["question"]: row["escalation"] for row in e7.load_escalation_questions()}

OWNER_IDENTITY = f"id:{r.CORPUS_USER_ID}"


def _rows(*cosines: float) -> list[SearchDocumentsResult]:
    return [
        SearchDocumentsResult(
            id=f"c{i}",
            document_id=f"doc-c{i}",
            chunk_index=0,
            content=f"content c{i}",
            similarity=cos,
            filename=f"c{i}.txt",
            cosine_similarity=cos,
        )
        for i, cos in enumerate(cosines)
    ]


WEAK = _rows(0.10, 0.09)            # top1 0.10 < tau 0.4 -> weak gate -> escalate
STRONG = _rows(0.70, 0.62, 0.55)    # top1 >= tau AND >= n_min cleared -> strong gate


# --- boundary stubs -----------------------------------------------------------
async def fake_run_query(mode, openai_client, http, supabase_url, headers, question, top_k=None):
    identity = headers.get("_eval_identity")
    label = LABEL.get(question)
    if identity == OWNER_IDENTITY:
        # Full-access corpus owner: P1a (no_context) is genuinely weak; the
        # answerable/should-escalate populations retrieve strong gold.
        return WEAK if label == "no_context" else STRONG
    # No-access viewer (P1b replay): the gold is ACL-revoked, so retrieval is weak
    # and the row MUST escalate exactly like P1a — UNLESS we are simulating a leak.
    if SCENARIO == "p1b-leak":
        return STRONG  # the gold leaked to a no-access viewer -> clears the gate
    return WEAK


def fake_user_headers(jwt_token, key):
    return {"_eval_identity": jwt_token}


def fake_mint_user_jwt(user_id, email, secret):
    return f"id:{user_id}"


async def fake_ensure_viewer_users(database_url):
    return None


async def fake_fetch_stable_id_map(database_url):
    # {chunk_id(uuid str): stable_id}. e7_runner builds uuid.UUID(chunk_id), so the
    # keys must be valid UUIDs.
    return {str(uuid.uuid4()): f"sid-{i}" for i in range(3)}


def fake_compute_visible_stable_ids(viewer, q, all_sids, extra):
    return []  # the no-access viewer sees none of this question's gold


async def fake_reset_viewer_acls(conn, visible_chunk_ids):
    return None


class _FakeConn:
    async def close(self):
        return None


async def fake_asyncpg_connect(url):
    return _FakeConn()


async def fake_draft(client, message, chunks):
    # A non-empty drafted answer so P2/P3 reach the faithfulness judge.
    return f"Drafted answer for: {message[:40]}"


async def fake_judge(anthropic_client, question, reference, context, answer):
    label = LABEL.get(question)
    if label == "answerable_faithful":
        return {"faithfulness": 5, "helpfulness": 5}   # P2 faithful -> auto-resolve (deflection)
    if label == "should_escalate":
        if SCENARIO == "weekly-ceiling-breach":
            # Regression: the pipeline judges the unfaithful P3 drafts FAITHFUL, so
            # they auto-resolve -> false-resolve (the Risk #3 safety failure).
            return {"faithfulness": 5, "helpfulness": 5}
        # Healthy moat: P3 drafts are unfaithful -> escalate at the faithfulness gate.
        return {"faithfulness": 1, "helpfulness": 2}
    return {"faithfulness": 1, "helpfulness": 1}


# install stubs at the real boundary
r.run_query = fake_run_query
r.user_headers = fake_user_headers
r.mint_user_jwt = fake_mint_user_jwt
r.ensure_viewer_users = fake_ensure_viewer_users
r.fetch_stable_id_map = fake_fetch_stable_id_map
r.compute_visible_stable_ids = fake_compute_visible_stable_ids
r.reset_viewer_acls = fake_reset_viewer_acls
r.judge_answer = fake_judge
e7_runner.draft_support_answer = fake_draft

import asyncpg  # noqa: E402

asyncpg.connect = fake_asyncpg_connect

# --- build the real CLI argv for this scenario --------------------------------
argv = ["e7_runner", "--out", OUT]
if SCENARIO in ("clean-per-pr", "p1b-leak"):
    argv += ["--include-p1b"]
elif SCENARIO in ("weekly-clean", "weekly-ceiling-breach"):
    argv += ["--include-p1b", "--include-p2", "--include-p3", "--sweep"]
else:
    raise SystemExit(f"unknown scenario {SCENARIO!r}")

sys.argv = argv
# main() does `raise SystemExit(asyncio.run(amain()))`, so the process exits with
# the REAL pinned exit code — captured by the shell as $?.
e7_runner.main()

Evidence: Per-PR clean tripwire rendered report (exit 0)


### E7 P1a (US-052) — genuinely-no-context escalation (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Every P1a question is genuinely unanswerable, so it must escalate at the deterministic retrieval gate with **no draft / judge call**.

| Question | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p1a-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 3 P1a rows escalated at the retrieval gate (0 draft / 0 judge calls).

### E7 P1b (US-057) — no-access replay of P2 (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Each P1b row is a P2 (`answerable_faithful`) question replayed under a **no-access viewer** (its gold ACL-revoked), so the gold is invisible and the row must escalate at the retrieval gate — the same output as P1a, with no privileged second pass (US-057/058).

| Question (P2 source) | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p2-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-04` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 4 P1b rows (P2 questions under a no-access viewer) escalated at the retrieval gate (0 draft / 0 judge).

### E7 P1b non-disclosure (US-058) — customer-output byte equality

Pinned security invariant: the bytes a **no-access** customer sees on a P1b escalation must be byte-for-byte identical to the P1a generic deferral (`escalation.GENERIC_DEFERRAL`), so escalating never discloses that restricted content exists. Deterministic (no LLM) → per-PR hard block, pinned `fail` / un-downgradable (US-059).

Reference (P1a deferral, 157 bytes): “Thanks for reaching out. I don't have enough information to answer this confidently, so I've passed it along to our team — a human will follow up with you.”

**Verdict:** PASS — all 4 P1b row(s) show the customer byte-for-byte the SAME generic deferral as P1a (no reason / restricted-to / existence bit disclosed).

### E7 consolidated metrics (US-055) — deflection / false-resolve / false-escalate

Operating objective: **maximize deflection subject to false-resolve ≤ ceiling**. Answerable = P2; the gated false-resolve rate is the faithfulness leg (P3 rows that auto-resolved, over P3). The retrieval-leg P1a/P1b false-resolves are shown `[monitor-only]` — a zero-tolerance invariant the P1a/P1b gates hard-fail unconditionally, NOT diluting the ceiling-gated rate. Each rate carries its numerator/denominator + per-population breakdown so the false-resolve number is verifiable against the buyer's ceiling (US-059), not folded into an opaque accuracy score. Reported here; the ceiling is enforced in US-059.

| Metric | Rate | n/d | Class | By population |
|---|---|---|---|---|
| deflection | — (blind) | 0/0 | tunable quality | — |
| false_resolve | — (blind) | 0/0 | 🔒 safety (pinned) | P1a 0/3 [monitor-only], P1b 0/4 [monitor-only] |
| false_escalate | — (blind) | 0/0 | tunable quality | — |

### E7 false-resolve ceiling gate (US-059) — pinned safety invariant

- Buyer ceiling: **5%** (`ESCALATION_FALSE_RESOLVE_CEILING`, US-050)
- Measured false-resolve (faithfulness leg, P3): **— (blind: no P3 faithfulness-leg rows scored)**
- Verdict: **not measured**

A measured faithfulness-leg false-resolve rate above the ceiling fails the run (pinned `fail`, never downgraded to a comment — unlike the tunable deflection/false-escalate metrics). The LLM-judged P3 leg is scored only in the weekly sweep, so per-PR (P1a/P1b only) this rate is not measured and the gate is inert (the retrieval-leg P1a/P1b false-resolves are pinned separately); the faithfulness-leg false-resolve carries an accepted up-to-a-week detection latency.

→ /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-clean-per-pr.json

Evidence: Per-PR leak rendered report (exit 1, blocks merge)


### E7 P1a (US-052) — genuinely-no-context escalation (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Every P1a question is genuinely unanswerable, so it must escalate at the deterministic retrieval gate with **no draft / judge call**.

| Question | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p1a-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 3 P1a rows escalated at the retrieval gate (0 draft / 0 judge calls).

### E7 P1b (US-057) — no-access replay of P2 (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Each P1b row is a P2 (`answerable_faithful`) question replayed under a **no-access viewer** (its gold ACL-revoked), so the gold is invisible and the row must escalate at the retrieval gate — the same output as P1a, with no privileged second pass (US-057/058).

| Question (P2 source) | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p2-01` | draft ⚠️ leak | 0.7000 | 3 | strong |
| `e7-p2-02` | draft ⚠️ leak | 0.7000 | 3 | strong |
| `e7-p2-03` | draft ⚠️ leak | 0.7000 | 3 | strong |
| `e7-p2-04` | draft ⚠️ leak | 0.7000 | 3 | strong |

**Verdict:** FAIL — 4 P1b row(s) CLEARED the retrieval gate: the gold leaked to a no-access viewer (isolation/disclosure failure): `e7-p2-01`, `e7-p2-02`, `e7-p2-03`, `e7-p2-04`.

### E7 P1b non-disclosure (US-058) — customer-output byte equality

Pinned security invariant: the bytes a **no-access** customer sees on a P1b escalation must be byte-for-byte identical to the P1a generic deferral (`escalation.GENERIC_DEFERRAL`), so escalating never discloses that restricted content exists. Deterministic (no LLM) → per-PR hard block, pinned `fail` / un-downgradable (US-059).

Reference (P1a deferral, 157 bytes): “Thanks for reaching out. I don't have enough information to answer this confidently, so I've passed it along to our team — a human will follow up with you.”

**Verdict:** FAIL — 4 P1b row(s) DISCLOSED: their customer-facing output differs from the P1a generic deferral: `e7-p2-01` (drafted_answer_disclosed), `e7-p2-02` (drafted_answer_disclosed), `e7-p2-03` (drafted_answer_disclosed), `e7-p2-04` (drafted_answer_disclosed).

### E7 consolidated metrics (US-055) — deflection / false-resolve / false-escalate

Operating objective: **maximize deflection subject to false-resolve ≤ ceiling**. Answerable = P2; the gated false-resolve rate is the faithfulness leg (P3 rows that auto-resolved, over P3). The retrieval-leg P1a/P1b false-resolves are shown `[monitor-only]` — a zero-tolerance invariant the P1a/P1b gates hard-fail unconditionally, NOT diluting the ceiling-gated rate. Each rate carries its numerator/denominator + per-population breakdown so the false-resolve number is verifiable against the buyer's ceiling (US-059), not folded into an opaque accuracy score. Reported here; the ceiling is enforced in US-059.

| Metric | Rate | n/d | Class | By population |
|---|---|---|---|---|
| deflection | — (blind) | 0/0 | tunable quality | — |
| false_resolve | — (blind) | 0/0 | 🔒 safety (pinned) | P1a 0/3 [monitor-only], P1b 4/4 [monitor-only] |
| false_escalate | — (blind) | 0/0 | tunable quality | — |

### E7 false-resolve ceiling gate (US-059) — pinned safety invariant

- Buyer ceiling: **5%** (`ESCALATION_FALSE_RESOLVE_CEILING`, US-050)
- Measured false-resolve (faithfulness leg, P3): **— (blind: no P3 faithfulness-leg rows scored)**
- Verdict: **not measured**

A measured faithfulness-leg false-resolve rate above the ceiling fails the run (pinned `fail`, never downgraded to a comment — unlike the tunable deflection/false-escalate metrics). The LLM-judged P3 leg is scored only in the weekly sweep, so per-PR (P1a/P1b only) this rate is not measured and the gate is inert (the retrieval-leg P1a/P1b false-resolves are pinned separately); the faithfulness-leg false-resolve carries an accepted up-to-a-week detection latency.

→ /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-p1b-leak.json

Evidence: Per-PR leak pinned-fail ERROR lines (CI surface)

ERROR ...: E7 P1b LEAK — 4 no-access-replay row(s) cleared the retrieval gate; the gold leaked to a no-access viewer ERROR ...: E7 P1b NON-DISCLOSURE LEAK — 4 P1b row(s) show the customer a DIFFERENT output than the P1a generic deferral

ERROR agentic_rag.evals.retrieval.e7: E7 P1b LEAK — 4 no-access-replay row(s) cleared the retrieval gate; the gold leaked to a no-access viewer (isolation/disclosure failure):
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-01 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-02 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-03 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-04 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7: E7 P1b NON-DISCLOSURE LEAK — 4 P1b row(s) show the customer a DIFFERENT output than the P1a generic deferral (restricted-content existence disclosed):
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-01 (drafted_answer_disclosed)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-02 (drafted_answer_disclosed)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-03 (drafted_answer_disclosed)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-04 (drafted_answer_disclosed)

Evidence: Weekly full-sweep rendered report (exit 0, knee curve)


### E7 P1a (US-052) — genuinely-no-context escalation (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Every P1a question is genuinely unanswerable, so it must escalate at the deterministic retrieval gate with **no draft / judge call**.

| Question | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p1a-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 3 P1a rows escalated at the retrieval gate (0 draft / 0 judge calls).

### E7 P1b (US-057) — no-access replay of P2 (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Each P1b row is a P2 (`answerable_faithful`) question replayed under a **no-access viewer** (its gold ACL-revoked), so the gold is invisible and the row must escalate at the retrieval gate — the same output as P1a, with no privileged second pass (US-057/058).

| Question (P2 source) | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p2-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-04` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 4 P1b rows (P2 questions under a no-access viewer) escalated at the retrieval gate (0 draft / 0 judge).

### E7 P1b non-disclosure (US-058) — customer-output byte equality

Pinned security invariant: the bytes a **no-access** customer sees on a P1b escalation must be byte-for-byte identical to the P1a generic deferral (`escalation.GENERIC_DEFERRAL`), so escalating never discloses that restricted content exists. Deterministic (no LLM) → per-PR hard block, pinned `fail` / un-downgradable (US-059).

Reference (P1a deferral, 157 bytes): “Thanks for reaching out. I don't have enough information to answer this confidently, so I've passed it along to our team — a human will follow up with you.”

**Verdict:** PASS — all 4 P1b row(s) show the customer byte-for-byte the SAME generic deferral as P1a (no reason / restricted-to / existence bit disclosed).

### E7 P2 (US-053) — answerable + faithful auto-resolve (offline judge)

τ_sim=0.4 · N_min=2 · match_threshold=0.3 · faithfulness≥4/5 (offline judge `claude-haiku-4-5`). Every P2 question has a faithful grounded answer, so the pipeline should **auto-resolve** it; an escalation is a false-escalate. Quality metric — not a per-PR hard block (US-059).

| Question | Decision | top1_cosine | faithfulness | leg |
|---|---|---|---|---|
| `e7-p2-01` | auto_resolve | 0.7000 | 5/5 | — |
| `e7-p2-02` | auto_resolve | 0.7000 | 5/5 | — |
| `e7-p2-03` | auto_resolve | 0.7000 | 5/5 | — |
| `e7-p2-04` | auto_resolve | 0.7000 | 5/5 | — |

**Verdict:** deflection 100% (4/4 auto-resolved); 0 false-escalate(s).

### E7 P3 (US-054) — should-escalate / the moat (offline judge)

τ_sim=0.4 · N_min=2 · match_threshold=0.3 · faithfulness≥4/5 (offline judge `claude-haiku-4-5`). Every P3 question has strong retrieval but **no faithful grounded answer**, so the pipeline must clear the retrieval gate, draft, and then **escalate at the faithfulness gate**. An auto-resolve is a **false-resolve** (the Risk #3 safety failure the ceiling governs, US-055/059) — not a per-PR block here (LLM-judged).

| Question | Decision | top1_cosine | faithfulness | leg | flag |
|---|---|---|---|---|---|
| `e7-p3-01` | escalate | 0.7000 | 1/5 | faithfulness |  |
| `e7-p3-02` | escalate | 0.7000 | 1/5 | faithfulness |  |
| `e7-p3-03` | escalate | 0.7000 | 1/5 | faithfulness |  |

**Verdict:** false-resolve 0% (0/3); 3 correctly escalated at the faithfulness gate.

### E7 consolidated metrics (US-055) — deflection / false-resolve / false-escalate

Operating objective: **maximize deflection subject to false-resolve ≤ ceiling**. Answerable = P2; the gated false-resolve rate is the faithfulness leg (P3 rows that auto-resolved, over P3). The retrieval-leg P1a/P1b false-resolves are shown `[monitor-only]` — a zero-tolerance invariant the P1a/P1b gates hard-fail unconditionally, NOT diluting the ceiling-gated rate. Each rate carries its numerator/denominator + per-population breakdown so the false-resolve number is verifiable against the buyer's ceiling (US-059), not folded into an opaque accuracy score. Reported here; the ceiling is enforced in US-059.

| Metric | Rate | n/d | Class | By population |
|---|---|---|---|---|
| deflection | 100% | 4/4 | tunable quality | P2 4/4 |
| false_resolve | 0% | 0/3 | 🔒 safety (pinned) | P1a 0/3 [monitor-only], P1b 0/4 [monitor-only], P3 0/3 |
| false_escalate | 0% | 0/4 | tunable quality | P2 0/4 |

### E7 false-resolve ceiling gate (US-059) — pinned safety invariant

- Buyer ceiling: **5%** (`ESCALATION_FALSE_RESOLVE_CEILING`, US-050)
- Measured false-resolve (faithfulness leg, P3): **0% (0/3)**
- Verdict: **✅ within ceiling**

A measured faithfulness-leg false-resolve rate above the ceiling fails the run (pinned `fail`, never downgraded to a comment — unlike the tunable deflection/false-escalate metrics). The LLM-judged P3 leg is scored only in the weekly sweep, so per-PR (P1a/P1b only) this rate is not measured and the gate is inert (the retrieval-leg P1a/P1b false-resolves are pinned separately); the faithfulness-leg false-resolve carries an accepted up-to-a-week detection latency.

### E7 knob sweep (US-056) — deflection-vs-false-resolve curve + knee

Grid of 18 operating point(s) over τ_sim × N_min × faithfulness-floor (offline 1-5 judge `claude-haiku-4-5`, match_threshold=0.3). Objective: **maximize deflection subject to false-resolve ≤ 5%** (the buyer's ceiling) — not maximize accuracy. LLM-judged → scheduled/weekly, never a per-PR block (US-059).

| τ_sim | N_min | faith≥ | deflection | false-resolve | false-escalate | ≤ ceiling |
|---|---|---|---|---|---|---|
| 0.3 | 1 | 4/5 | 100% | 0% | 0% | ✅ ⭐ knee |
| 0.3 | 1 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 2 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 2 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 3 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 3 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 1 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 1 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 2 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 2 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 3 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 3 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 1 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 1 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 2 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 2 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 3 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 3 | 5/5 | 100% | 0% | 0% | ✅ |

**Knee:** τ_sim=0.3, N_min=1, faithfulness≥4/5 → deflection 100%, false-resolve 0% (≤ ceiling 5%).
**Recommended US-050 defaults:** `ESCALATION_TAU_SIM=0.3`, `ESCALATION_N_MIN=1`. The offline faithfulness floor (4/5) is guidance only — the runtime `ESCALATION_FAITHFULNESS_CUTOFF` is a different [0,1] scale (US-048/050), tuned separately.

→ /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-weekly-clean.json

- Evidence: Weekly ceiling-breach rendered report (exit 1) (local file:

/var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/weekly-ceiling-breach.stdout.md

)

Evidence: Weekly ceiling-breach pinned-fail ERROR lines

ERROR ...: E7 FALSE-RESOLVE CEILING BREACH — measured faithfulness-leg 100.0% (3/3 P3) exceeds the buyer's ceiling 5.0% (ESCALATION_FALSE_RESOLVE_CEILING, US-050).

ERROR agentic_rag.evals.retrieval.e7: E7 P3 FALSE-RESOLVE(s) (the Risk #3 safety failure — an unanswerable question auto-resolved; the US-055/059 ceiling gate governs this rate): e7-p3-01, e7-p3-02, e7-p3-03
ERROR agentic_rag.evals.retrieval.e7: E7 FALSE-RESOLVE CEILING BREACH — measured faithfulness-leg 100.0% (3/3 P3) exceeds the buyer's ceiling 5.0% (ESCALATION_FALSE_RESOLVE_CEILING, US-050). The deflection pipeline auto-resolved too many unanswerable questions (the Risk #3 safety failure).

- Evidence: Weekly snapshot JSON (docs/escalation-weekly/<DATE>.json artifact) (local file:

/var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-weekly-ceiling-breach.json

)

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

⚠️ **Review** - 1 warning

⚠️ .github/workflows/retrieval-eval.yml:231 - The new E7 per-PR tripwire (e7_runner --include-p1b) is a HARD gate but has no execution-error resilience, unlike the E6 gate in the same job. Its deterministic decision is pure arithmetic, but its execution makes ~7 live OpenAI embedding calls (hybrid retrieval embeds each query) plus asyncpg chunk_acl writes for the P1b replay. e7_runner.amain has no retry/backoff and the step has no set +e/continue-on-error, so any transient OpenAI/DB blip exits non-zero and blocks the merge of an innocent PR. The step's own comment (lines 203-211) says "pure arithmetic on cosine scores - NO LLM judge - so a real verdict can't flake," which conflates decision-determinism with execution-reliability - exactly the distinction E6's comment (lines 17, 178-188) carefully draws and mitigates with retry+non-blocking-on-execution-error. Recommend mirroring E6's execution-vs-verdict split here, or at least documenting the accepted flake exposure.
⚠️ evals/retrieval/e7_runner.py:1641 - The pinned safety metric (consolidated false-resolve rate) is computed over a denominator of P1a + P1b + P3 questions, but the only non-deterministic contributor to the numerator in healthy operation is the P3 faithfulness leg (P1a/P1b cleared-gate counts are 0 unless retrieval/RLS is broken). P1a and P1b are deterministically-escalated easy true-negatives, so padding the denominator with them systematically lowers the measured rate and can mask a real P3 (faithfulness-leg) false-resolve below the buyer's ceiling - the masking grows as more P2 questions are authored (each becomes a P1b row in the denominator). With the current 3/4/3 gold one P3 false-resolve = 10% and still trips the 5% ceiling, so this is latent rather than active, but it weakens the safety invariant's sensitivity for any larger gold set. Consider also surfacing/gating the faithfulness-leg false-resolve over the P3 population specifically, so the hard signal isn't diluted by easy escalations. This is the documented/tested intended definition, so flagging for confirmation rather than auto-fix.
ℹ️ evals/retrieval/e7_runner.py:2137 - run_e7_sweep computes each grid point's metrics via compute_e7_metrics(p1a, p2, p3) (no P1b), while the top-level weekly metrics and the enforced ceiling verdict (lines 2623/2641) include P1b. The knee's reported false-resolve and feasibility are therefore computed on a smaller (stricter) denominator than the gate that actually enforces the ceiling. Not a safety bug - the sweep is strictly more conservative, so a chosen knee always satisfies the looser enforced gate - but the recommended-knee numbers won't match what the ceiling gate measures at that config, which can confuse interpretation of the weekly snapshot. Worth a note or aligning the two computations.

🔧 Fix: gate E7 false-resolve ceiling on faithfulness-leg (P3) rate only
1 warning still open:

⚠️ evals/retrieval/e7_runner.py:2761 - This commit makes the pinned false-resolve ceiling fed SOLELY by the P3 faithfulness leg, but a structurally-blind P3 leg in the weekly run now silently disarms that gate. In amain, not p3_result.passed is only logged at ERROR (lines 2761-2765) and never sets failed = True - unlike the P1a/P1b/non-disclosure blindness guards, which hard-fail (P1a at 2806-2811). E7P3Result.passed already models the exact positive-control signal (returns False on an empty P3 population, mirroring P1a per its docstring at 880-889), but amain ignores it for the exit code. So if gold drift drops or mislabels all P3 rows in the --include-p3 weekly run, false_resolve.rate becomes None -> assert_false_resolve_ceiling is inert (None is not a breach) -> the weekly run exits 0 (green) with the pinned safety invariant silently unmeasured. The changed docstring on assert_false_resolve_ceiling (~line 1810) asserts 'the per-leg blindness guards ... own that case,' but for P3 the guard is only a soft log, so nothing actually owns it. Before this change the safety number was anchored by P1a's always-present (measured-0%) contribution; now P3 is the only meaningful weekly contributor, so its non-hard-failing blindness is consequential. Recommend translating a requested-but-blind P3 leg (p3_result is not None and not p3_result.passed) into a hard failure, consistent with P1a's positive-control philosophy.

🔧 Fix: hard-fail weekly E7 run on structurally-blind P3 leg
1 warning still open:

⚠️ evals/retrieval/e7_runner.py:2389 - The Round 2 fix hard-fails a structurally-blind (empty) P3 leg, but E7P3Result.passed is purely n_questions > 0 (line 890), so a NON-empty-but-all-mislabeled P3 leg still counts as a pass. Now that the false-resolve ceiling is fed SOLELY by P3, this leaves a second silent-disarm path the new guard does not cover: if config/gold drift makes every P3 row escalate BEFORE the faithfulness gate (e.g. tau_sim raised so P3 retrieval reads weak, or the answerer degrades so all rows escalate at the draft leg — neither trips the P1a gate, whose weak no-context rows still escalate correctly), then false_resolves == [], so compute_e7_metrics reports a MEASURED 0% (denominator = n_questions, numerator = 0), NOT None. The ceiling verdict is therefore breached=False, the P3 guard at line 2389 does not fire (passed=True), and the weekly run exits 0 (green) having NEVER exercised the faithfulness gate the ceiling exists to protect — only a non-paging log.warning for mislabeled rows (line 2916) surfaces it. This is the same silent-disarm class the Round 2 fix closed for empty P3, reached via a different (gold/config drift) path. The guard comment (line 2385) and docs/evals.md ("if gold drift drops or mislabels every P3 row, the P3 positive control fails") overstate coverage: the control catches only the dropped/empty case (rate→None), not present-but-never-reached-the-gate (rate→vacuous 0%). Consider extending the positive control to require ≥1 P3 row to actually reach the faithfulness gate (i.e. escalated_at_faithfulness + false_resolves non-empty, equivalently NOT all P3 rows mislabeled) and fail closed otherwise, mirroring the empty-P3 guard.

🔧 Fix: hard-fail weekly E7 run on all-mislabeled P3 leg
1 warning still open:

⚠️ evals/retrieval/e7_runner.py:1702 - The false-resolve ceiling rate's denominator is the full P3 population (p3.n_questions here; len(self.decisions) in false_resolve_rate at line 882), which includes mislabeled rows that never reached the faithfulness judge. The Round 3 fix added an exercised-based positive control that hard-fails only when ZERO rows exercise the gate (empty or all-mislabeled). But a PARTIALLY-mislabeled P3 leg (≥1 exercised, the rest mislabeled) passes the positive control while the mislabeled rows dilute the rate downward — reintroducing the very dilution the prior rounds were chartered to eliminate, now via gold/config drift inside P3 rather than P1a/P1b. Latent at the current 3-row P3 gold (1 false-resolve = 33%, always breaches 5%), but active as P3 gold grows: e.g. 1 false-resolve over 20 P3 rows with 18 mislabeled reads 1/20 = 5% and clears a 5% ceiling, whereas over the 2 EXERCISED rows it is 1/2 = 50% — a real faithfulness-leg breach masked by gold-defect rows. This contradicts the metric's own documented semantics: compute_e7_metrics's docstring (lines 1645-1646) defines the denominator as 'the cases where a false-resolve can actually occur once a draft clears the retrieval gate' (= exercised), and mislabeled is documented (line 860) as 'a gold-authoring defect, not a pipeline result' — so counting it as a clean handled row in the SAFETY denominator inflates apparent gate quality. Consider making the denominator len(p3.exercised) in BOTH the enforced gate (line 1702 / used at 2857) and the sweep (compute_e7_metrics at line 2209), with false_resolve_rate→None when exercised is empty (consistent with the positive control), and correcting the docstrings/docs that currently claim full drift-proofing.

✅ **Test** - passed

✅ No issues found.

python -m evals.retrieval.test_e7_runner (57 groups PASS — incl. US-059 e7_pinned_invariants_failed exit-code + ceiling-gate groups)
python -m evals.retrieval.test_e7 (9 groups PASS — shipped escalation_gold.yaml, 10 rows)
Real CLI E2E e7_runner --include-p1b clean per-PR run → process exit 0 (ran with no anthropic package installed)
Real CLI E2E e7_runner --include-p1b with gold leaked to a no-access viewer → exit 1 (P1b gate-clear + non-disclosure leak, JSON written first)
Real CLI E2E e7_runner --include-p1b --include-p2 --include-p3 --sweep healthy weekly → exit 0 (P3 moat holds, knee curve, false-resolve 0% ≤ 5%)
Real CLI E2E weekly with P3 auto-resolves → exit 1 (false-resolve 100% > 5% ceiling breach)
Ran the weekly workflow's jq issue-body builder against the breach + leak snapshots
yaml.safe_load of retrieval-eval.yml and escalation-eval-weekly.yml; confirmed per-PR E7 step has OPENAI_API_KEY but no ANTHROPIC_API_KEY

🔧 **Document** - 1 issue found → auto-fixed ✅

ℹ️ .claude/agent/tasks/PRD_phase2_agentic_rag_kit.md:5 - The PRD planning doc added by this change (.claude/agent/tasks/PRD_phase2_agentic_rag_kit.md) asserts ADR-0003 and ADR-0004 are "on-disk ADRs in docs/adr/" (line 5) and points to "docs/adr/0003-escalation-signal-and-deflection-pipeline.md" (line 242), but neither file exists. I did NOT auto-fix this: (1) it is a planning/spec artifact (README labels .claude/ as "not needed to run the app"), a historical 2026-06-16 reconciliation snapshot, not deliverable documentation; (2) the published docs (README, docs/, CONTEXT.md) are internally consistent because file-less "ADR-NNNN" tags are an established convention here - ADR-0006 (model surface) is cited the same way with no on-disk file either; (3) resolving it is a judgment call between authoring ADR-0003/0004 versus rewording the "on-disk" claim, which needs design intent. Flagging so a maintainer can decide.

🔧 Fix: correct PRD on-disk ADR-0003/0004 claims to ruled-not-written
✅ Re-checked - no issues remain.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

…nistic deflection pipeline + E7 eval (US-046-059, ADR-0003) Implements Epic D of the Phase-2 kit: the escalation & deflection pipeline and its E7 eval, per ADR-0003. Runtime (backend/escalation.py): - US-046: surface pre-fusion raw cosine on retrieval results (RRF overwrites similarity, so cosine is plumbed through before any gate). - US-047: cosine-defined retrieval gate, pure arithmetic on scores (strong = top1_cosine >= tau_sim AND n_cleared >= n_min); no LLM/reranker dependency. - US-048: one-call runtime faithfulness gate; an unset/failed judge fails CLOSED (escalate), never auto-sends. JUDGE_MODEL is a judge-role selector that does not chain through OPENAI_MODEL (docs/model-surface.md). - US-049: deterministic deflection orchestrator (off-topic/empty-draft/ unfaithful -> escalate; supported -> answer) with no reason leak on escalation. - US-050: EscalationConfig (three validated global knobs) + a STANDALONE false-resolve ceiling that is structurally un-wireable into the per-request pipeline. E7 eval (evals/retrieval/e7*.py, escalation_gold.yaml): - US-051-057: golden-set schema + P1a/P2/P3 populations and the derived P1b viewer-parameterized no-access case (no privileged second pass exists), all classified through the real backend gate. - US-055: consolidated deflection / false-resolve / false-escalate rates; false-resolve is the pinned safety metric, others are tunable quality. - US-056: knob sweep + ceiling-constrained knee selection (memoized so the whole grid costs one operating point's worth of LLM calls). - US-058: deterministic P1b non-disclosure byte-equality assertion. CI placement (US-059): - Per-PR deterministic tripwire in retrieval-eval.yml (P1a/P1b gate + non-disclosure byte equality, no LLM) that can fail the build. - New escalation-eval-weekly.yml runs the full LLM-judged P2/P3 + sweep and enforces the false-resolve ceiling on the scheduled run only.

…ULT_TOP_K

…g (P3) rate only

… leg

…CONTEXT glossary

…led-not-written

…lation gate test

vercel · 2026-06-23T21:48:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentic-rag	Ready	Preview, Comment	Jun 24, 2026 2:02am

…ong (#27)

# Conflicts: # .claude/agent/tasks/prd-phase2-implementation.md

…o they read weak at the retrieval gate (#27)

github-actions · 2026-06-24T02:09:32Z

Retrieval eval — PR vs `main`

n = 50 questions × 3 modes (vector, keyword, hybrid) on a 14-chunk corpus. PR ran in 77.19s; main in 71.16s.

Headline (each cell: PR value, Δ vs `main`)

Mode	recall@5	MRR	nDCG@5
vector	0.860 (±0.000)	0.772 (±0.000)	0.779 (±0.000)
keyword	0.110 (±0.000)	0.120 (±0.000)	0.112 (±0.000)
hybrid	0.860 (±0.000)	0.759 (±0.000)	0.769 (±0.000)

Per-category recall@5

Mode	single_chunk	multi_hop	adversarial	paraphrase
vector	0.900 (±0.000)	0.933 (±0.000)	0.600 (±0.000)	1.000 (±0.000)
keyword	0.250 (±0.000)	0.033 (±0.000)	0.000 (±0.000)	0.000 (±0.000)
hybrid	0.900 (±0.000)	0.933 (±0.000)	0.600 (±0.000)	1.000 (±0.000)

_{Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.}

hcho22 added 8 commits June 23, 2026 07:46

no-mistakes(review): Match E7 eval retrieval depth to production DEFA…

a727cd9

…ULT_TOP_K

no-mistakes(review): gate E7 false-resolve ceiling on faithfulness-le…

98ea3ca

…g (P3) rate only

no-mistakes(review): hard-fail weekly E7 run on structurally-blind P3…

8e7f537

… leg

no-mistakes(review): hard-fail weekly E7 run on all-mislabeled P3 leg

407df7b

no-mistakes(document): document E7 escalation pipeline in README and …

bae6b32

…CONTEXT glossary

no-mistakes(document): correct PRD on-disk ADR-0003/0004 claims to ru…

7c072ce

…led-not-written

no-mistakes(lint): remove unused RetrievalGateDecision import in esca…

8fece33

…lation gate test

fix(evals): P1b leak check asserts gold-chunk exclusion, not gate-str…

2896507

…ong (#27)

vercel Bot deployed to Preview June 24, 2026 00:33 View deployment

hcho22 closed this Jun 24, 2026

hcho22 reopened this Jun 24, 2026

Merge remote-tracking branch 'origin/main' into pr27

ff1db18

# Conflicts: # .claude/agent/tasks/prd-phase2-implementation.md

vercel Bot deployed to Preview June 24, 2026 00:45 View deployment

fix(evals): re-author P1a no-context gold to maximally off-topic Qs s…

4275b5d

…o they read weak at the retrieval gate (#27)

vercel Bot deployed to Preview June 24, 2026 02:02 View deployment

hcho22 merged commit 9fbf4d7 into main Jun 24, 2026
3 checks passed

hcho22 deleted the feat/escalation-deflection-pipeline branch June 24, 2026 03:35

This was referenced Jun 24, 2026

feat(conversations): status state machine + escalation latch (US-067) #29

Merged

feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(escalation): cosine retrieval gate, faithfulness gate, and deterministic deflection pipeline with E7 eval#27

feat(escalation): cosine retrieval gate, faithfulness gate, and deterministic deflection pipeline with E7 eval#27
hcho22 merged 11 commits into
mainfrom
feat/escalation-deflection-pipeline

hcho22 commented Jun 23, 2026

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hcho22 commented Jun 23, 2026

Intent

What Changed

Risk Assessment

Testing

Pipeline

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

Retrieval eval — PR vs main

Headline (each cell: PR value, Δ vs main)

Per-category recall@5

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 23, 2026 •

edited

Loading

Retrieval eval — PR vs `main`

Headline (each cell: PR value, Δ vs `main`)