Skip to content

feat(escalation): cosine retrieval gate, faithfulness gate, and deterministic deflection pipeline with E7 eval#27

Merged
hcho22 merged 11 commits into
mainfrom
feat/escalation-deflection-pipeline
Jun 24, 2026
Merged

feat(escalation): cosine retrieval gate, faithfulness gate, and deterministic deflection pipeline with E7 eval#27
hcho22 merged 11 commits into
mainfrom
feat/escalation-deflection-pipeline

Conversation

@hcho22

@hcho22 hcho22 commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Intent

The developer asked the agent to implement user story US-059 from the Phase 2 implementation PRD. US-059 is a CI-placement story for the E7 escalation eval: it requires a fast, deterministic per-PR tripwire (retrieval-gate decisions plus P1b non-disclosure byte-equality checks) that can block merges, alongside a heavier weekly scheduled sweep that runs the LLM-judge legs. The developer also expected the work to honor deferral notes carried over from US-050/US-055, specifically enforcing the false-resolve ceiling as a pinned safety invariant in the runner's exit code. Implicit constraints from the developer's environment included keeping the per-PR tripwire runnable under the slim requirements-ci.txt (no Anthropic API key), matching existing conventions for runner code, tests, CI workflows, and PRD status lines, and updating documentation (docs/evals.md, README, PRD status table) accordingly.

What Changed

  • Added backend/escalation.py implementing a deterministic deflection pipeline: a cheap retrieval gate that calls retrieval "weak" using the raw pre-fusion vector cosine (tau_sim / n_min), a one-call faithfulness gate that fails closed (any judge error/refusal/parse failure escalates), and run_deflection_pipeline wiring them as fixed control flow (retrieve → gate → [if strong] draft → faithfulness → answer-or-escalate) with a generic deferral message carrying no reason metadata; EscalationConfig resolves and validates the gate knobs from env. backend/retrieval.py now carries cosine_similarity separately from the RRF-fused similarity so the gate thresholds on a real cosine, preserving it through hybrid fusion and reranking.
  • Added the E7 escalation eval (evals/retrieval/e7.py, e7_runner.py, escalation_gold.yaml) that enforces the false-resolve ceiling as a pinned safety invariant in the runner exit code, gated solely on the P3 faithfulness leg, with positive-control guards that hard-fail when the P3 leg is empty or all-mislabeled so a drift-blinded gate cannot pass silently.
  • Added CI placement: a fast, deterministic per-PR E7 tripwire (retrieval-gate decisions plus P1b non-disclosure byte-equality checks, runnable under slim CI with no Anthropic key) in retrieval-eval.yml, and a heavier weekly scheduled sweep running the LLM-judge legs in escalation-eval-weekly.yml; updated docs/evals.md, README, CONTEXT glossary, and the PRD status table.

Risk Assessment

⚠️ Medium: The latest commit's all-mislabeled hard-fail is correct, well-tested, and safe to merge at the current gold size, but it leaves a residual partial-mislabeling dilution path that weakens the pinned safety invariant the prior rounds were explicitly chartered to make non-dilutable, and that becomes exploitable as the P3 gold grows.

Testing

Baseline: ran both eval unit suites under a Python 3.11 venv with the slim per-PR deps (httpx/openai/pydantic/pyyaml/asyncpg/pyjwt) — test_e7_runner (57 groups) and test_e7 (9 groups) pass, covering the real gate, metric consolidation, and the US-059 exit-code/ceiling decisions. Since no OPENAI_API_KEY is available (full retrieval needs Supabase + embeddings), I drove the real e7_runner.main() CLI process for all four CI scenarios with only the network/DB/draft/judge boundary stubbed deterministically; everything downstream (gate arithmetic, ceiling gate, render, exit code, sys.exit) is the shipped code. The per-PR tripwire exits 0 on a clean run (and with no anthropic package present) and exits 1 on a leak, blocking the merge; the weekly sweep exits 0 when healthy and exits 1 on a false-resolve ceiling breach, with the JSON snapshot written before exit and correctly driving the weekly workflow's jq issue body. Both workflow YAMLs parse and the per-PR step carries no Anthropic key. No UI surface is involved (CI/eval-runner change), so evidence is CLI transcripts, rendered markdown snapshots, and JSON artifacts rather than screenshots. Worktree left clean.

Evidence: Evidence index (scenario matrix + repro)
# US-059 E7 CI placement — test evidence

Branch `feat/escalation-deflection-pipeline` (407df7b). The intent: a fast **deterministic
per-PR tripwire** (retrieval-gate decisions + P1b non-disclosure byte equality) that can
block merges, alongside a heavier **weekly sweep** running the LLM-judged legs, with the
**false-resolve ceiling pinned as the runner's exit code**.

## 1. Unit suites (real gate, real metrics, real exit-code decision; fakes for I/O)

- `python -m evals.retrieval.test_e7_runner` → **57 groups PASS** (includes the 6 US-059
  `e7_pinned_invariants_failed` exit-code groups + 5 ceiling-gate groups).
- `python -m evals.retrieval.test_e7` → **9 groups PASS** (shipped `escalation_gold.yaml`:
  10 rows, all populations present).

## 2. End-to-end CLI process (the real `e7_runner.main()`, only the network/DB boundary stubbed)

Each row is one real process invocation of the exact command the CI workflow runs; the
process exit code is the real `sys.exit` captured by the shell.

| Scenario | CLI (as committed in CI) | Exit | Demonstrates |
|---|---|---|---|
| per-PR clean | `e7_runner --include-p1b` | **0** | deterministic tripwire passes an innocent PR; ran with **no `anthropic` package installed** |
| per-PR leak | `e7_runner --include-p1b` (gold leaked to no-access viewer) | **1** | P1b gate-clear **and** non-disclosure leak → **blocks the merge** |
| weekly healthy | `e7_runner --include-p1b --include-p2 --include-p3 --sweep` | **0** | full sweep, P3 moat holds, false-resolve 0% ≤ 5% ceiling, knee curve emitted |
| weekly regression | same | **1** | P3 unanswerables auto-resolved → false-resolve 100% > 5% → **ceiling breach fails the scheduled run** |

Snapshots: `snap-*.json` (the `docs/escalation-weekly/<DATE>.json` artifact) + `*.stdout.md`
(the rendered report the weekly workflow tees to `<DATE>.md`). Pinned-fail ERROR lines in
`*.stderr.log`.

## 3. Runner JSON → weekly-workflow issue body (integration point)

The weekly workflow's `jq` issue-body builder was run against the produced snapshots:
- breach snapshot → `- **False-resolve ceiling breach** ... measured 100% (3/3) exceeds the buyer ceiling 5% (ESCALATION_FALSE_RESOLVE_CEILING).`
- leak snapshot → `- **P1b no-access leak**: e7-p2-01..04` + `- **P1b non-disclosure leak** (4 row(s)).`

## 4. Workflow placement

- `.github/workflows/retrieval-eval.yml` (pull_request): E7 tripwire step runs
  `e7_runner --include-p1b`; env carries `OPENAI_API_KEY` + Supabase, **no `ANTHROPIC_API_KEY`**.
- `.github/workflows/escalation-eval-weekly.yml` (schedule + workflow_dispatch): full sweep.
- Both YAMLs parse.

## How to reproduce

`` `
python3.11 -m venv /tmp/e7-venv
/tmp/e7-venv/bin/pip install httpx openai pydantic pyyaml asyncpg pyjwt   # per-PR (slim) env
cd <repo> && /tmp/e7-venv/bin/python -m evals.retrieval.test_e7_runner
E7_REPO_ROOT=<repo> /tmp/e7-venv/bin/python e7_cli_e2e.py clean-per-pr snap-clean-per-pr.json
# weekly env additionally: /tmp/e7-venv/bin/pip install anthropic
`` `
Evidence: E2E driver (drives real e7_runner.main(), only I/O boundary stubbed)
"""End-to-end driver for the US-059 E7 escalation runner.

This drives the REAL `evals.retrieval.e7_runner` CLI process — argparse, the env
guards, the lazy `runner` import, leg orchestration, JSON snapshot write, the full
multi-section markdown render, the pinned exit-code decision, and `sys.exit` — for
the two CI invocations US-059 ships:

  * the PER-PR tripwire   : `e7_runner --include-p1b`            (deterministic only)
  * the WEEKLY full sweep : `e7_runner --include-p1b --include-p2 --include-p3 --sweep`

The ONLY thing stubbed is the genuine I/O boundary: hybrid retrieval
(`runner.run_query`, which would embed via OpenAI + hit Supabase), the P1b DB ACL
reset (`asyncpg` + the E4 reset helpers), the OpenAI drafter, and the offline Claude
judge. Everything downstream of that boundary — the real `escalation.retrieval_gate`,
the real metric consolidation, the real false-resolve ceiling gate, the real render
functions, and the real `e7_pinned_invariants_failed` exit-code decision — runs for
real, so the rendered snapshot + process exit code are exactly what CI would observe.

Usage:  python e7_cli_e2e.py <scenario> <out_json_path>
  scenarios: clean-per-pr | p1b-leak | weekly-clean | weekly-ceiling-breach
"""
from __future__ import annotations

import os
import sys
import uuid
from pathlib import Path

ROOT = Path(os.environ["E7_REPO_ROOT"]).resolve()
sys.path.insert(0, str(ROOT))
sys.path.insert(0, str(ROOT / "backend"))

SCENARIO = sys.argv[1]
OUT = sys.argv[2]

# --- fake env so amain's guards pass (no real network is ever reached) --------
os.environ["SUPABASE_URL"] = "http://127.0.0.1:54321"
os.environ["OPENAI_API_KEY"] = "sk-fake-e2e"
os.environ["SUPABASE_SERVICE_ROLE_KEY"] = "service-role-fake"
os.environ["SUPABASE_ANON_KEY"] = "anon-fake"
os.environ["SUPABASE_JWT_SECRET"] = "super-secret-jwt-token-with-at-least-32-characters-long"
os.environ["CORPUS_SEED_DATABASE_URL"] = "postgresql://fake/db"
os.environ["DATABASE_URL"] = "postgresql://fake/db"
if SCENARIO.startswith("weekly"):
    os.environ["ANTHROPIC_API_KEY"] = "sk-ant-fake-e2e"

from retrieval import SearchDocumentsResult  # noqa: E402
from evals.retrieval import e7, e7_runner  # noqa: E402
from evals.retrieval import runner as r  # noqa: E402

# Real shipped gold → {question_text: label}. The driver classifies retrieval
# strength off the genuine labels in escalation_gold.yaml (10 rows).
LABEL = {row["question"]: row["escalation"] for row in e7.load_escalation_questions()}

OWNER_IDENTITY = f"id:{r.CORPUS_USER_ID}"


def _rows(*cosines: float) -> list[SearchDocumentsResult]:
    return [
        SearchDocumentsResult(
            id=f"c{i}",
            document_id=f"doc-c{i}",
            chunk_index=0,
            content=f"content c{i}",
            similarity=cos,
            filename=f"c{i}.txt",
            cosine_similarity=cos,
        )
        for i, cos in enumerate(cosines)
    ]


WEAK = _rows(0.10, 0.09)            # top1 0.10 < tau 0.4 -> weak gate -> escalate
STRONG = _rows(0.70, 0.62, 0.55)    # top1 >= tau AND >= n_min cleared -> strong gate


# --- boundary stubs -----------------------------------------------------------
async def fake_run_query(mode, openai_client, http, supabase_url, headers, question, top_k=None):
    identity = headers.get("_eval_identity")
    label = LABEL.get(question)
    if identity == OWNER_IDENTITY:
        # Full-access corpus owner: P1a (no_context) is genuinely weak; the
        # answerable/should-escalate populations retrieve strong gold.
        return WEAK if label == "no_context" else STRONG
    # No-access viewer (P1b replay): the gold is ACL-revoked, so retrieval is weak
    # and the row MUST escalate exactly like P1a — UNLESS we are simulating a leak.
    if SCENARIO == "p1b-leak":
        return STRONG  # the gold leaked to a no-access viewer -> clears the gate
    return WEAK


def fake_user_headers(jwt_token, key):
    return {"_eval_identity": jwt_token}


def fake_mint_user_jwt(user_id, email, secret):
    return f"id:{user_id}"


async def fake_ensure_viewer_users(database_url):
    return None


async def fake_fetch_stable_id_map(database_url):
    # {chunk_id(uuid str): stable_id}. e7_runner builds uuid.UUID(chunk_id), so the
    # keys must be valid UUIDs.
    return {str(uuid.uuid4()): f"sid-{i}" for i in range(3)}


def fake_compute_visible_stable_ids(viewer, q, all_sids, extra):
    return []  # the no-access viewer sees none of this question's gold


async def fake_reset_viewer_acls(conn, visible_chunk_ids):
    return None


class _FakeConn:
    async def close(self):
        return None


async def fake_asyncpg_connect(url):
    return _FakeConn()


async def fake_draft(client, message, chunks):
    # A non-empty drafted answer so P2/P3 reach the faithfulness judge.
    return f"Drafted answer for: {message[:40]}"


async def fake_judge(anthropic_client, question, reference, context, answer):
    label = LABEL.get(question)
    if label == "answerable_faithful":
        return {"faithfulness": 5, "helpfulness": 5}   # P2 faithful -> auto-resolve (deflection)
    if label == "should_escalate":
        if SCENARIO == "weekly-ceiling-breach":
            # Regression: the pipeline judges the unfaithful P3 drafts FAITHFUL, so
            # they auto-resolve -> false-resolve (the Risk #3 safety failure).
            return {"faithfulness": 5, "helpfulness": 5}
        # Healthy moat: P3 drafts are unfaithful -> escalate at the faithfulness gate.
        return {"faithfulness": 1, "helpfulness": 2}
    return {"faithfulness": 1, "helpfulness": 1}


# install stubs at the real boundary
r.run_query = fake_run_query
r.user_headers = fake_user_headers
r.mint_user_jwt = fake_mint_user_jwt
r.ensure_viewer_users = fake_ensure_viewer_users
r.fetch_stable_id_map = fake_fetch_stable_id_map
r.compute_visible_stable_ids = fake_compute_visible_stable_ids
r.reset_viewer_acls = fake_reset_viewer_acls
r.judge_answer = fake_judge
e7_runner.draft_support_answer = fake_draft

import asyncpg  # noqa: E402

asyncpg.connect = fake_asyncpg_connect

# --- build the real CLI argv for this scenario --------------------------------
argv = ["e7_runner", "--out", OUT]
if SCENARIO in ("clean-per-pr", "p1b-leak"):
    argv += ["--include-p1b"]
elif SCENARIO in ("weekly-clean", "weekly-ceiling-breach"):
    argv += ["--include-p1b", "--include-p2", "--include-p3", "--sweep"]
else:
    raise SystemExit(f"unknown scenario {SCENARIO!r}")

sys.argv = argv
# main() does `raise SystemExit(asyncio.run(amain()))`, so the process exits with
# the REAL pinned exit code — captured by the shell as $?.
e7_runner.main()
Evidence: Per-PR clean tripwire rendered report (exit 0)

### E7 P1a (US-052) — genuinely-no-context escalation (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Every P1a question is genuinely unanswerable, so it must escalate at the deterministic retrieval gate with **no draft / judge call**.

| Question | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p1a-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 3 P1a rows escalated at the retrieval gate (0 draft / 0 judge calls).

### E7 P1b (US-057) — no-access replay of P2 (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Each P1b row is a P2 (`answerable_faithful`) question replayed under a **no-access viewer** (its gold ACL-revoked), so the gold is invisible and the row must escalate at the retrieval gate — the same output as P1a, with no privileged second pass (US-057/058).

| Question (P2 source) | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p2-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-04` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 4 P1b rows (P2 questions under a no-access viewer) escalated at the retrieval gate (0 draft / 0 judge).

### E7 P1b non-disclosure (US-058) — customer-output byte equality

Pinned security invariant: the bytes a **no-access** customer sees on a P1b escalation must be byte-for-byte identical to the P1a generic deferral (`escalation.GENERIC_DEFERRAL`), so escalating never discloses that restricted content exists. Deterministic (no LLM) → per-PR hard block, pinned `fail` / un-downgradable (US-059).

Reference (P1a deferral, 157 bytes): “Thanks for reaching out. I don't have enough information to answer this confidently, so I've passed it along to our team — a human will follow up with you.”

**Verdict:** PASS — all 4 P1b row(s) show the customer byte-for-byte the SAME generic deferral as P1a (no reason / restricted-to / existence bit disclosed).

### E7 consolidated metrics (US-055) — deflection / false-resolve / false-escalate

Operating objective: **maximize deflection subject to false-resolve ≤ ceiling**. Answerable = P2; the gated false-resolve rate is the faithfulness leg (P3 rows that auto-resolved, over P3). The retrieval-leg P1a/P1b false-resolves are shown `[monitor-only]` — a zero-tolerance invariant the P1a/P1b gates hard-fail unconditionally, NOT diluting the ceiling-gated rate. Each rate carries its numerator/denominator + per-population breakdown so the false-resolve number is verifiable against the buyer's ceiling (US-059), not folded into an opaque accuracy score. Reported here; the ceiling is enforced in US-059.

| Metric | Rate | n/d | Class | By population |
|---|---|---|---|---|
| deflection | — (blind) | 0/0 | tunable quality | — |
| false_resolve | — (blind) | 0/0 | 🔒 safety (pinned) | P1a 0/3 [monitor-only], P1b 0/4 [monitor-only] |
| false_escalate | — (blind) | 0/0 | tunable quality | — |

### E7 false-resolve ceiling gate (US-059) — pinned safety invariant

- Buyer ceiling: **5%** (`ESCALATION_FALSE_RESOLVE_CEILING`, US-050)
- Measured false-resolve (faithfulness leg, P3): **— (blind: no P3 faithfulness-leg rows scored)**
- Verdict: **not measured**

A measured faithfulness-leg false-resolve rate above the ceiling fails the run (pinned `fail`, never downgraded to a comment — unlike the tunable deflection/false-escalate metrics). The LLM-judged P3 leg is scored only in the weekly sweep, so per-PR (P1a/P1b only) this rate is not measured and the gate is inert (the retrieval-leg P1a/P1b false-resolves are pinned separately); the faithfulness-leg false-resolve carries an accepted up-to-a-week detection latency.

→ /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-clean-per-pr.json
Evidence: Per-PR leak rendered report (exit 1, blocks merge)

### E7 P1a (US-052) — genuinely-no-context escalation (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Every P1a question is genuinely unanswerable, so it must escalate at the deterministic retrieval gate with **no draft / judge call**.

| Question | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p1a-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 3 P1a rows escalated at the retrieval gate (0 draft / 0 judge calls).

### E7 P1b (US-057) — no-access replay of P2 (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Each P1b row is a P2 (`answerable_faithful`) question replayed under a **no-access viewer** (its gold ACL-revoked), so the gold is invisible and the row must escalate at the retrieval gate — the same output as P1a, with no privileged second pass (US-057/058).

| Question (P2 source) | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p2-01` | draft ⚠️ leak | 0.7000 | 3 | strong |
| `e7-p2-02` | draft ⚠️ leak | 0.7000 | 3 | strong |
| `e7-p2-03` | draft ⚠️ leak | 0.7000 | 3 | strong |
| `e7-p2-04` | draft ⚠️ leak | 0.7000 | 3 | strong |

**Verdict:** FAIL — 4 P1b row(s) CLEARED the retrieval gate: the gold leaked to a no-access viewer (isolation/disclosure failure): `e7-p2-01`, `e7-p2-02`, `e7-p2-03`, `e7-p2-04`.

### E7 P1b non-disclosure (US-058) — customer-output byte equality

Pinned security invariant: the bytes a **no-access** customer sees on a P1b escalation must be byte-for-byte identical to the P1a generic deferral (`escalation.GENERIC_DEFERRAL`), so escalating never discloses that restricted content exists. Deterministic (no LLM) → per-PR hard block, pinned `fail` / un-downgradable (US-059).

Reference (P1a deferral, 157 bytes): “Thanks for reaching out. I don't have enough information to answer this confidently, so I've passed it along to our team — a human will follow up with you.”

**Verdict:** FAIL — 4 P1b row(s) DISCLOSED: their customer-facing output differs from the P1a generic deferral: `e7-p2-01` (drafted_answer_disclosed), `e7-p2-02` (drafted_answer_disclosed), `e7-p2-03` (drafted_answer_disclosed), `e7-p2-04` (drafted_answer_disclosed).

### E7 consolidated metrics (US-055) — deflection / false-resolve / false-escalate

Operating objective: **maximize deflection subject to false-resolve ≤ ceiling**. Answerable = P2; the gated false-resolve rate is the faithfulness leg (P3 rows that auto-resolved, over P3). The retrieval-leg P1a/P1b false-resolves are shown `[monitor-only]` — a zero-tolerance invariant the P1a/P1b gates hard-fail unconditionally, NOT diluting the ceiling-gated rate. Each rate carries its numerator/denominator + per-population breakdown so the false-resolve number is verifiable against the buyer's ceiling (US-059), not folded into an opaque accuracy score. Reported here; the ceiling is enforced in US-059.

| Metric | Rate | n/d | Class | By population |
|---|---|---|---|---|
| deflection | — (blind) | 0/0 | tunable quality | — |
| false_resolve | — (blind) | 0/0 | 🔒 safety (pinned) | P1a 0/3 [monitor-only], P1b 4/4 [monitor-only] |
| false_escalate | — (blind) | 0/0 | tunable quality | — |

### E7 false-resolve ceiling gate (US-059) — pinned safety invariant

- Buyer ceiling: **5%** (`ESCALATION_FALSE_RESOLVE_CEILING`, US-050)
- Measured false-resolve (faithfulness leg, P3): **— (blind: no P3 faithfulness-leg rows scored)**
- Verdict: **not measured**

A measured faithfulness-leg false-resolve rate above the ceiling fails the run (pinned `fail`, never downgraded to a comment — unlike the tunable deflection/false-escalate metrics). The LLM-judged P3 leg is scored only in the weekly sweep, so per-PR (P1a/P1b only) this rate is not measured and the gate is inert (the retrieval-leg P1a/P1b false-resolves are pinned separately); the faithfulness-leg false-resolve carries an accepted up-to-a-week detection latency.

→ /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-p1b-leak.json
Evidence: Per-PR leak pinned-fail ERROR lines (CI surface)

ERROR ...: E7 P1b LEAK — 4 no-access-replay row(s) cleared the retrieval gate; the gold leaked to a no-access viewer ERROR ...: E7 P1b NON-DISCLOSURE LEAK — 4 P1b row(s) show the customer a DIFFERENT output than the P1a generic deferral

ERROR agentic_rag.evals.retrieval.e7: E7 P1b LEAK — 4 no-access-replay row(s) cleared the retrieval gate; the gold leaked to a no-access viewer (isolation/disclosure failure):
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-01 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-02 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-03 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-04 top1_cosine=0.7000 n_cleared=3 (strong)
ERROR agentic_rag.evals.retrieval.e7: E7 P1b NON-DISCLOSURE LEAK — 4 P1b row(s) show the customer a DIFFERENT output than the P1a generic deferral (restricted-content existence disclosed):
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-01 (drafted_answer_disclosed)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-02 (drafted_answer_disclosed)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-03 (drafted_answer_disclosed)
ERROR agentic_rag.evals.retrieval.e7:   e7-p2-04 (drafted_answer_disclosed)
Evidence: Weekly full-sweep rendered report (exit 0, knee curve)

### E7 P1a (US-052) — genuinely-no-context escalation (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Every P1a question is genuinely unanswerable, so it must escalate at the deterministic retrieval gate with **no draft / judge call**.

| Question | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p1a-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p1a-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 3 P1a rows escalated at the retrieval gate (0 draft / 0 judge calls).

### E7 P1b (US-057) — no-access replay of P2 (retrieval gate)

τ_sim=0.4 · N_min=2 · match_threshold=0.3. Each P1b row is a P2 (`answerable_faithful`) question replayed under a **no-access viewer** (its gold ACL-revoked), so the gold is invisible and the row must escalate at the retrieval gate — the same output as P1a, with no privileged second pass (US-057/058).

| Question (P2 source) | Decision | top1_cosine | n_cleared | gate reason |
|---|---|---|---|---|
| `e7-p2-01` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-02` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-03` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |
| `e7-p2-04` | escalate | 0.1000 | 0 | weak: top1_cosine 0.1000 < tau_sim 0.4000 |

**Verdict:** PASS — all 4 P1b rows (P2 questions under a no-access viewer) escalated at the retrieval gate (0 draft / 0 judge).

### E7 P1b non-disclosure (US-058) — customer-output byte equality

Pinned security invariant: the bytes a **no-access** customer sees on a P1b escalation must be byte-for-byte identical to the P1a generic deferral (`escalation.GENERIC_DEFERRAL`), so escalating never discloses that restricted content exists. Deterministic (no LLM) → per-PR hard block, pinned `fail` / un-downgradable (US-059).

Reference (P1a deferral, 157 bytes): “Thanks for reaching out. I don't have enough information to answer this confidently, so I've passed it along to our team — a human will follow up with you.”

**Verdict:** PASS — all 4 P1b row(s) show the customer byte-for-byte the SAME generic deferral as P1a (no reason / restricted-to / existence bit disclosed).

### E7 P2 (US-053) — answerable + faithful auto-resolve (offline judge)

τ_sim=0.4 · N_min=2 · match_threshold=0.3 · faithfulness≥4/5 (offline judge `claude-haiku-4-5`). Every P2 question has a faithful grounded answer, so the pipeline should **auto-resolve** it; an escalation is a false-escalate. Quality metric — not a per-PR hard block (US-059).

| Question | Decision | top1_cosine | faithfulness | leg |
|---|---|---|---|---|
| `e7-p2-01` | auto_resolve | 0.7000 | 5/5 | — |
| `e7-p2-02` | auto_resolve | 0.7000 | 5/5 | — |
| `e7-p2-03` | auto_resolve | 0.7000 | 5/5 | — |
| `e7-p2-04` | auto_resolve | 0.7000 | 5/5 | — |

**Verdict:** deflection 100% (4/4 auto-resolved); 0 false-escalate(s).

### E7 P3 (US-054) — should-escalate / the moat (offline judge)

τ_sim=0.4 · N_min=2 · match_threshold=0.3 · faithfulness≥4/5 (offline judge `claude-haiku-4-5`). Every P3 question has strong retrieval but **no faithful grounded answer**, so the pipeline must clear the retrieval gate, draft, and then **escalate at the faithfulness gate**. An auto-resolve is a **false-resolve** (the Risk #3 safety failure the ceiling governs, US-055/059) — not a per-PR block here (LLM-judged).

| Question | Decision | top1_cosine | faithfulness | leg | flag |
|---|---|---|---|---|---|
| `e7-p3-01` | escalate | 0.7000 | 1/5 | faithfulness |  |
| `e7-p3-02` | escalate | 0.7000 | 1/5 | faithfulness |  |
| `e7-p3-03` | escalate | 0.7000 | 1/5 | faithfulness |  |

**Verdict:** false-resolve 0% (0/3); 3 correctly escalated at the faithfulness gate.

### E7 consolidated metrics (US-055) — deflection / false-resolve / false-escalate

Operating objective: **maximize deflection subject to false-resolve ≤ ceiling**. Answerable = P2; the gated false-resolve rate is the faithfulness leg (P3 rows that auto-resolved, over P3). The retrieval-leg P1a/P1b false-resolves are shown `[monitor-only]` — a zero-tolerance invariant the P1a/P1b gates hard-fail unconditionally, NOT diluting the ceiling-gated rate. Each rate carries its numerator/denominator + per-population breakdown so the false-resolve number is verifiable against the buyer's ceiling (US-059), not folded into an opaque accuracy score. Reported here; the ceiling is enforced in US-059.

| Metric | Rate | n/d | Class | By population |
|---|---|---|---|---|
| deflection | 100% | 4/4 | tunable quality | P2 4/4 |
| false_resolve | 0% | 0/3 | 🔒 safety (pinned) | P1a 0/3 [monitor-only], P1b 0/4 [monitor-only], P3 0/3 |
| false_escalate | 0% | 0/4 | tunable quality | P2 0/4 |

### E7 false-resolve ceiling gate (US-059) — pinned safety invariant

- Buyer ceiling: **5%** (`ESCALATION_FALSE_RESOLVE_CEILING`, US-050)
- Measured false-resolve (faithfulness leg, P3): **0% (0/3)**
- Verdict: **✅ within ceiling**

A measured faithfulness-leg false-resolve rate above the ceiling fails the run (pinned `fail`, never downgraded to a comment — unlike the tunable deflection/false-escalate metrics). The LLM-judged P3 leg is scored only in the weekly sweep, so per-PR (P1a/P1b only) this rate is not measured and the gate is inert (the retrieval-leg P1a/P1b false-resolves are pinned separately); the faithfulness-leg false-resolve carries an accepted up-to-a-week detection latency.

### E7 knob sweep (US-056) — deflection-vs-false-resolve curve + knee

Grid of 18 operating point(s) over τ_sim × N_min × faithfulness-floor (offline 1-5 judge `claude-haiku-4-5`, match_threshold=0.3). Objective: **maximize deflection subject to false-resolve ≤ 5%** (the buyer's ceiling) — not maximize accuracy. LLM-judged → scheduled/weekly, never a per-PR block (US-059).

| τ_sim | N_min | faith≥ | deflection | false-resolve | false-escalate | ≤ ceiling |
|---|---|---|---|---|---|---|
| 0.3 | 1 | 4/5 | 100% | 0% | 0% | ✅ ⭐ knee |
| 0.3 | 1 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 2 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 2 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 3 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.3 | 3 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 1 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 1 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 2 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 2 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 3 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.4 | 3 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 1 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 1 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 2 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 2 | 5/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 3 | 4/5 | 100% | 0% | 0% | ✅ |
| 0.5 | 3 | 5/5 | 100% | 0% | 0% | ✅ |

**Knee:** τ_sim=0.3, N_min=1, faithfulness≥4/5 → deflection 100%, false-resolve 0% (≤ ceiling 5%).
**Recommended US-050 defaults:** `ESCALATION_TAU_SIM=0.3`, `ESCALATION_N_MIN=1`. The offline faithfulness floor (4/5) is guidance only — the runtime `ESCALATION_FAITHFULNESS_CUTOFF` is a different [0,1] scale (US-048/050), tuned separately.

→ /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-weekly-clean.json
- Evidence: Weekly ceiling-breach rendered report (exit 1) (local file: /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/weekly-ceiling-breach.stdout.md)
Evidence: Weekly ceiling-breach pinned-fail ERROR lines

ERROR ...: E7 FALSE-RESOLVE CEILING BREACH — measured faithfulness-leg 100.0% (3/3 P3) exceeds the buyer's ceiling 5.0% (ESCALATION_FALSE_RESOLVE_CEILING, US-050).

ERROR agentic_rag.evals.retrieval.e7: E7 P3 FALSE-RESOLVE(s) (the Risk #3 safety failure — an unanswerable question auto-resolved; the US-055/059 ceiling gate governs this rate): e7-p3-01, e7-p3-02, e7-p3-03
ERROR agentic_rag.evals.retrieval.e7: E7 FALSE-RESOLVE CEILING BREACH — measured faithfulness-leg 100.0% (3/3 P3) exceeds the buyer's ceiling 5.0% (ESCALATION_FALSE_RESOLVE_CEILING, US-050). The deflection pipeline auto-resolved too many unanswerable questions (the Risk #3 safety failure).
- Evidence: Weekly snapshot JSON (docs/escalation-weekly/<DATE>.json artifact) (local file: /var/folders/9t/k_yy9fqs5vd27rf12jx_rzqh0000gn/T/no-mistakes-evidence/01KVTHACF2PBB4NP9556FTSCDV/snap-weekly-ceiling-breach.json)

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

⚠️ **Review** - 1 warning
  • ⚠️ .github/workflows/retrieval-eval.yml:231 - The new E7 per-PR tripwire (e7_runner --include-p1b) is a HARD gate but has no execution-error resilience, unlike the E6 gate in the same job. Its deterministic decision is pure arithmetic, but its execution makes ~7 live OpenAI embedding calls (hybrid retrieval embeds each query) plus asyncpg chunk_acl writes for the P1b replay. e7_runner.amain has no retry/backoff and the step has no set +e/continue-on-error, so any transient OpenAI/DB blip exits non-zero and blocks the merge of an innocent PR. The step's own comment (lines 203-211) says "pure arithmetic on cosine scores - NO LLM judge - so a real verdict can't flake," which conflates decision-determinism with execution-reliability - exactly the distinction E6's comment (lines 17, 178-188) carefully draws and mitigates with retry+non-blocking-on-execution-error. Recommend mirroring E6's execution-vs-verdict split here, or at least documenting the accepted flake exposure.
  • ⚠️ evals/retrieval/e7_runner.py:1641 - The pinned safety metric (consolidated false-resolve rate) is computed over a denominator of P1a + P1b + P3 questions, but the only non-deterministic contributor to the numerator in healthy operation is the P3 faithfulness leg (P1a/P1b cleared-gate counts are 0 unless retrieval/RLS is broken). P1a and P1b are deterministically-escalated easy true-negatives, so padding the denominator with them systematically lowers the measured rate and can mask a real P3 (faithfulness-leg) false-resolve below the buyer's ceiling - the masking grows as more P2 questions are authored (each becomes a P1b row in the denominator). With the current 3/4/3 gold one P3 false-resolve = 10% and still trips the 5% ceiling, so this is latent rather than active, but it weakens the safety invariant's sensitivity for any larger gold set. Consider also surfacing/gating the faithfulness-leg false-resolve over the P3 population specifically, so the hard signal isn't diluted by easy escalations. This is the documented/tested intended definition, so flagging for confirmation rather than auto-fix.
  • ℹ️ evals/retrieval/e7_runner.py:2137 - run_e7_sweep computes each grid point's metrics via compute_e7_metrics(p1a, p2, p3) (no P1b), while the top-level weekly metrics and the enforced ceiling verdict (lines 2623/2641) include P1b. The knee's reported false-resolve and feasibility are therefore computed on a smaller (stricter) denominator than the gate that actually enforces the ceiling. Not a safety bug - the sweep is strictly more conservative, so a chosen knee always satisfies the looser enforced gate - but the recommended-knee numbers won't match what the ceiling gate measures at that config, which can confuse interpretation of the weekly snapshot. Worth a note or aligning the two computations.

🔧 Fix: gate E7 false-resolve ceiling on faithfulness-leg (P3) rate only
1 warning still open:

  • ⚠️ evals/retrieval/e7_runner.py:2761 - This commit makes the pinned false-resolve ceiling fed SOLELY by the P3 faithfulness leg, but a structurally-blind P3 leg in the weekly run now silently disarms that gate. In amain, not p3_result.passed is only logged at ERROR (lines 2761-2765) and never sets failed = True - unlike the P1a/P1b/non-disclosure blindness guards, which hard-fail (P1a at 2806-2811). E7P3Result.passed already models the exact positive-control signal (returns False on an empty P3 population, mirroring P1a per its docstring at 880-889), but amain ignores it for the exit code. So if gold drift drops or mislabels all P3 rows in the --include-p3 weekly run, false_resolve.rate becomes None -> assert_false_resolve_ceiling is inert (None is not a breach) -> the weekly run exits 0 (green) with the pinned safety invariant silently unmeasured. The changed docstring on assert_false_resolve_ceiling (~line 1810) asserts 'the per-leg blindness guards ... own that case,' but for P3 the guard is only a soft log, so nothing actually owns it. Before this change the safety number was anchored by P1a's always-present (measured-0%) contribution; now P3 is the only meaningful weekly contributor, so its non-hard-failing blindness is consequential. Recommend translating a requested-but-blind P3 leg (p3_result is not None and not p3_result.passed) into a hard failure, consistent with P1a's positive-control philosophy.

🔧 Fix: hard-fail weekly E7 run on structurally-blind P3 leg
1 warning still open:

  • ⚠️ evals/retrieval/e7_runner.py:2389 - The Round 2 fix hard-fails a structurally-blind (empty) P3 leg, but E7P3Result.passed is purely n_questions &gt; 0 (line 890), so a NON-empty-but-all-mislabeled P3 leg still counts as a pass. Now that the false-resolve ceiling is fed SOLELY by P3, this leaves a second silent-disarm path the new guard does not cover: if config/gold drift makes every P3 row escalate BEFORE the faithfulness gate (e.g. tau_sim raised so P3 retrieval reads weak, or the answerer degrades so all rows escalate at the draft leg — neither trips the P1a gate, whose weak no-context rows still escalate correctly), then false_resolves == [], so compute_e7_metrics reports a MEASURED 0% (denominator = n_questions, numerator = 0), NOT None. The ceiling verdict is therefore breached=False, the P3 guard at line 2389 does not fire (passed=True), and the weekly run exits 0 (green) having NEVER exercised the faithfulness gate the ceiling exists to protect — only a non-paging log.warning for mislabeled rows (line 2916) surfaces it. This is the same silent-disarm class the Round 2 fix closed for empty P3, reached via a different (gold/config drift) path. The guard comment (line 2385) and docs/evals.md ("if gold drift drops or mislabels every P3 row, the P3 positive control fails") overstate coverage: the control catches only the dropped/empty case (rate→None), not present-but-never-reached-the-gate (rate→vacuous 0%). Consider extending the positive control to require ≥1 P3 row to actually reach the faithfulness gate (i.e. escalated_at_faithfulness + false_resolves non-empty, equivalently NOT all P3 rows mislabeled) and fail closed otherwise, mirroring the empty-P3 guard.

🔧 Fix: hard-fail weekly E7 run on all-mislabeled P3 leg
1 warning still open:

  • ⚠️ evals/retrieval/e7_runner.py:1702 - The false-resolve ceiling rate's denominator is the full P3 population (p3.n_questions here; len(self.decisions) in false_resolve_rate at line 882), which includes mislabeled rows that never reached the faithfulness judge. The Round 3 fix added an exercised-based positive control that hard-fails only when ZERO rows exercise the gate (empty or all-mislabeled). But a PARTIALLY-mislabeled P3 leg (≥1 exercised, the rest mislabeled) passes the positive control while the mislabeled rows dilute the rate downward — reintroducing the very dilution the prior rounds were chartered to eliminate, now via gold/config drift inside P3 rather than P1a/P1b. Latent at the current 3-row P3 gold (1 false-resolve = 33%, always breaches 5%), but active as P3 gold grows: e.g. 1 false-resolve over 20 P3 rows with 18 mislabeled reads 1/20 = 5% and clears a 5% ceiling, whereas over the 2 EXERCISED rows it is 1/2 = 50% — a real faithfulness-leg breach masked by gold-defect rows. This contradicts the metric's own documented semantics: compute_e7_metrics's docstring (lines 1645-1646) defines the denominator as 'the cases where a false-resolve can actually occur once a draft clears the retrieval gate' (= exercised), and mislabeled is documented (line 860) as 'a gold-authoring defect, not a pipeline result' — so counting it as a clean handled row in the SAFETY denominator inflates apparent gate quality. Consider making the denominator len(p3.exercised) in BOTH the enforced gate (line 1702 / used at 2857) and the sweep (compute_e7_metrics at line 2209), with false_resolve_rate→None when exercised is empty (consistent with the positive control), and correcting the docstrings/docs that currently claim full drift-proofing.
✅ **Test** - passed

✅ No issues found.

  • python -m evals.retrieval.test_e7_runner (57 groups PASS — incl. US-059 e7_pinned_invariants_failed exit-code + ceiling-gate groups)
  • python -m evals.retrieval.test_e7 (9 groups PASS — shipped escalation_gold.yaml, 10 rows)
  • Real CLI E2E e7_runner --include-p1b clean per-PR run → process exit 0 (ran with no anthropic package installed)
  • Real CLI E2E e7_runner --include-p1b with gold leaked to a no-access viewer → exit 1 (P1b gate-clear + non-disclosure leak, JSON written first)
  • Real CLI E2E e7_runner --include-p1b --include-p2 --include-p3 --sweep healthy weekly → exit 0 (P3 moat holds, knee curve, false-resolve 0% ≤ 5%)
  • Real CLI E2E weekly with P3 auto-resolves → exit 1 (false-resolve 100% > 5% ceiling breach)
  • Ran the weekly workflow's jq issue-body builder against the breach + leak snapshots
  • yaml.safe_load of retrieval-eval.yml and escalation-eval-weekly.yml; confirmed per-PR E7 step has OPENAI_API_KEY but no ANTHROPIC_API_KEY
🔧 **Document** - 1 issue found → auto-fixed ✅
  • ℹ️ .claude/agent/tasks/PRD_phase2_agentic_rag_kit.md:5 - The PRD planning doc added by this change (.claude/agent/tasks/PRD_phase2_agentic_rag_kit.md) asserts ADR-0003 and ADR-0004 are "on-disk ADRs in docs/adr/" (line 5) and points to "docs/adr/0003-escalation-signal-and-deflection-pipeline.md" (line 242), but neither file exists. I did NOT auto-fix this: (1) it is a planning/spec artifact (README labels .claude/ as "not needed to run the app"), a historical 2026-06-16 reconciliation snapshot, not deliverable documentation; (2) the published docs (README, docs/, CONTEXT.md) are internally consistent because file-less "ADR-NNNN" tags are an established convention here - ADR-0006 (model surface) is cited the same way with no on-disk file either; (3) resolving it is a judgment call between authoring ADR-0003/0004 versus rewording the "on-disk" claim, which needs design intent. Flagging so a maintainer can decide.

🔧 Fix: correct PRD on-disk ADR-0003/0004 claims to ruled-not-written
✅ Re-checked - no issues remain.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

hcho22 added 8 commits June 23, 2026 07:46
…nistic deflection pipeline + E7 eval (US-046-059, ADR-0003)

Implements Epic D of the Phase-2 kit: the escalation & deflection pipeline
and its E7 eval, per ADR-0003.

Runtime (backend/escalation.py):
- US-046: surface pre-fusion raw cosine on retrieval results (RRF overwrites
  similarity, so cosine is plumbed through before any gate).
- US-047: cosine-defined retrieval gate, pure arithmetic on scores
  (strong = top1_cosine >= tau_sim AND n_cleared >= n_min); no LLM/reranker
  dependency.
- US-048: one-call runtime faithfulness gate; an unset/failed judge fails
  CLOSED (escalate), never auto-sends. JUDGE_MODEL is a judge-role selector
  that does not chain through OPENAI_MODEL (docs/model-surface.md).
- US-049: deterministic deflection orchestrator (off-topic/empty-draft/
  unfaithful -> escalate; supported -> answer) with no reason leak on
  escalation.
- US-050: EscalationConfig (three validated global knobs) + a STANDALONE
  false-resolve ceiling that is structurally un-wireable into the per-request
  pipeline.

E7 eval (evals/retrieval/e7*.py, escalation_gold.yaml):
- US-051-057: golden-set schema + P1a/P2/P3 populations and the derived
  P1b viewer-parameterized no-access case (no privileged second pass exists),
  all classified through the real backend gate.
- US-055: consolidated deflection / false-resolve / false-escalate rates;
  false-resolve is the pinned safety metric, others are tunable quality.
- US-056: knob sweep + ceiling-constrained knee selection (memoized so the
  whole grid costs one operating point's worth of LLM calls).
- US-058: deterministic P1b non-disclosure byte-equality assertion.

CI placement (US-059):
- Per-PR deterministic tripwire in retrieval-eval.yml (P1a/P1b gate +
  non-disclosure byte equality, no LLM) that can fail the build.
- New escalation-eval-weekly.yml runs the full LLM-judged P2/P3 + sweep and
  enforces the false-resolve ceiling on the scheduled run only.
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentic-rag Ready Ready Preview, Comment Jun 24, 2026 2:02am

# Conflicts:
#	.claude/agent/tasks/prd-phase2-implementation.md
@github-actions

Copy link
Copy Markdown
Contributor

Retrieval eval — PR vs main

n = 50 questions × 3 modes (vector, keyword, hybrid) on a 14-chunk corpus. PR ran in 77.19s; main in 71.16s.

Headline (each cell: PR value, Δ vs main)

Mode recall@5 MRR nDCG@5
vector 0.860 (±0.000) 0.772 (±0.000) 0.779 (±0.000)
keyword 0.110 (±0.000) 0.120 (±0.000) 0.112 (±0.000)
hybrid 0.860 (±0.000) 0.759 (±0.000) 0.769 (±0.000)

Per-category recall@5

Mode single_chunk multi_hop adversarial paraphrase
vector 0.900 (±0.000) 0.933 (±0.000) 0.600 (±0.000) 1.000 (±0.000)
keyword 0.250 (±0.000) 0.033 (±0.000) 0.000 (±0.000) 0.000 (±0.000)
hybrid 0.900 (±0.000) 0.933 (±0.000) 0.600 (±0.000) 1.000 (±0.000)

Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.

@hcho22 hcho22 merged commit 9fbf4d7 into main Jun 24, 2026
3 checks passed
@hcho22 hcho22 deleted the feat/escalation-deflection-pipeline branch June 24, 2026 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant