feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs by hcho22 · Pull Request #30 · hcho22/Agentic_RAG

hcho22 · 2026-06-24T14:24:50Z

Intent

Resolve issue #26: harden the E7 weekly false-resolve safety gate against a heavily-mislabeled P3 gold set, and tighten the false_resolve_rate / compute_e7_metrics docstrings so the "denominator misread" review finding cannot recur. Non-blocking follow-up to the escalation/deflection pipeline (Epic D, ADR-0003, US-046-059, merged PR #27).

Deliberate decisions a diff-only reviewer would not know:

The mislabel-ratio guard is a NEW pinned invariant in e7_pinned_invariants_failed. It COMPLEMENTS the existing P3 positive control (E7P3Result.passed), which only fails the run when EVERY P3 row is mislabeled (zero exercised); the new guard catches the PARTIAL case (heavily-but-not-entirely mislabeled), where at least one row exercises the gate so the positive control passes yet the false-resolve ceiling is measured over a diluted sample.
The guard is gated on E7P3Result.passed (>=1 exercised) ON PURPOSE so the empty / all-mislabeled cases stay owned by the positive control (one clear failure reason each), and it fires only when the mislabeled FRACTION over the FULL presented population STRICTLY exceeds the max (strict greater-than, not >=). The all-mislabeled test asserts the ratio guard stays OFF even at a lax 0.99 max, proving the two guards partition the space.
Denominator decision (the crux of the issue): false_resolve_rate and the new mislabel_ratio both use the FULL presented P3 population (n_questions), NOT len(exercised). The "should be exercised-only" denominator was REJECTED as a misread (it would contradict the operating-metric meaning and mix per-population denominators in the sum-based consolidated rate). I did NOT change the denominator (explicit non-goal of the issue); instead I tightened the false_resolve_rate and compute_e7_metrics docstrings to state the full-population semantics explicitly so the misread cannot recur, and added the mislabel-ratio guard to handle the dilution risk the original reviewer correctly identified.
Configurable via E7_P3_MISLABEL_RATIO_MAX env (default 0.5) or the new --p3-mislabel-ratio-max CLI flag; an unparseable or out-of-range value fails CLOSED (ValueError / argparse error) because a misconfigured safety ceiling must never read as "no ceiling".
New E7P3Result.mislabel_ratio property surfaced in the result JSON (additive; verified no other consumer asserts an exact P3 key set).

Tests: 5 new groups in test_e7_runner.py (property math + None-on-empty; partial-mislabel-over-max fails with the positive control passing AND the ceiling not breached so the guard is the SOLE cause; strict greater-than boundary + configurability via two thresholds on one leg; all-mislabeled stays owned by the positive control; env resolver default/override + fail-closed). Full suite 64/64 green; flake8 and mypy clean on the two changed files (remaining repo-wide mypy notes - missing yaml stubs, ragas double-module - are pre-existing infra, not introduced here). Branch fm/issue26-e7-mislabel-guard is off the latest origin/main (PR #27 already merged, so the E7 code is present). Closes #26.

What Changed

8a1db1a no-mistakes(document): document E7 P3 mislabel-ratio guard in evals.md
8d972a0 feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs (#26)

Risk Assessment

✅ Low: Tightly-scoped, additive safety-guard with accurate docstrings, fail-closed config on every path, correct partitioning with the existing positive control, no broken consumers, and thorough new tests.

Testing

Ran the full E7 runner suite (64/64 green) on both Anaconda Python 3.9 and a freshly-built CI-matching Python 3.11 venv, then drove the real production exit-code decision and P3 scoring leg over a realistic 2-of-3 mislabeled P3 gold set to show the guard failing the weekly run (exit 1) with its operator log while the positive control passes and the ceiling is not breached - the precise partial-dilution gap the issue closes. Also verified the new CLI flag in --help and the fail-closed behavior of both the flag and the env var, plus the additive mislabel_ratio JSON field. This is a backend/CLI eval tool change with no rendered UI surface, so evidence is CLI transcripts and the result JSON rather than screenshots. All checks passed and the worktree was left clean (temp venv and caches removed).

Evidence: End-to-end scenario transcript: mislabel-ratio guard fails the weekly run on a 2/3-mislabeled P3 gold (guard is sole cause)

E7 P3 MISLABEL-RATIO BREACH — 2/3 P3 row(s) (67%) escalated before the faithfulness gate (mislabeled), exceeding the max 50% (E7_P3_MISLABEL_RATIO_MAX, issue #26). The false-resolve ceiling is then measured over only 1 exercised row(s) — a diluted sample that can mask a bad faithfulness gate. Re-author the drifted P3 gold so retrieval reads strong (or check τ_sim / a degraded answerer): e7-p3-warranty-a(retrieval), e7-p3-returns-b(retrieval)

==============================================================================
Drifted weekly P3 gold: 2 of 3 rows mislabeled (escalate before the gate)
==============================================================================
  presented P3 population (n_questions) : 3
  exercised the faithfulness gate       : 1 -> ['e7-p3-shipping-c']
  mislabeled (escalated before the gate): 2 -> ['e7-p3-warranty-a', 'e7-p3-returns-b']
  mislabel_ratio (mislabeled / FULL presented pop, NOT exercised-only)
     = 2/3 = 0.6667  (the crux: denominator is 3, not 1)
  false_resolve_rate (same FULL denominator) = 0/3 = 0.0000

  Existing controls on this leg (both GREEN, so neither catches the dilution):
    positive control  E7P3Result.passed = True  (>=1 row exercised, so it does NOT fire)
    false-resolve ceiling breached       = False  (measured 0% over the exercised sample, <= 5% ceiling)

==============================================================================
Weekly exit-code decision at the DEFAULT ceiling (E7_P3_MISLABEL_RATIO_MAX=0.5)
==============================================================================
  [operator-facing log emitted by e7_pinned_invariants_failed:]


  -> weekly run FAILED = True   (exit code would be 1)
     The mislabel-ratio guard is the SOLE reason: positive control passed AND ceiling not breached.

==============================================================================
Same leg, LAXER ceiling (--p3-mislabel-ratio-max 0.7): configurable, passes
==============================================================================
  -> weekly run FAILED = False   (exit code would be 0); 0.67 ratio does not exceed a 0.7 ceiling.

==============================================================================
mislabel_ratio is surfaced additively in the result JSON (weekly artifact)
==============================================================================
{
  "population": "P3",
  "n_questions": 3,
  "n_mislabeled": 2,
  "mislabel_ratio": 0.6666666666666666,
  "n_false_resolve": 0,
  "false_resolve_rate": 0.0,
  "passed": true
}

Evidence: Operator-facing log emitted by the guard

E7 P3 MISLABEL-RATIO BREACH — 2/3 P3 row(s) (67%) escalated before the faithfulness gate (mislabeled), exceeding the max 50% (E7_P3_MISLABEL_RATIO_MAX, issue #26). The false-resolve ceiling is then measured over only 1 exercised row(s) — a diluted sample that can mask a bad faithfulness gate. Re-author the drifted P3 gold so retrieval reads strong (or check τ_sim / a degraded answerer): e7-p3-warranty-a(retrieval), e7-p3-returns-b(retrieval)

Evidence: Full test_e7_runner suite output (64/64) on CI-matching Python 3.11

E7 P3 scored 0 should-escalate questions — the false-resolve safety ceiling is fed solely by the P3 faithfulness leg, so a structurally blind P3 population leaves it unmeasured. Failing the run closed; check that --questions carries P3 rows.
E7 P3 scored 2 should-escalate row(s) but NONE exercised the faithfulness gate — every row escalated at the retrieval/draft leg (mislabeled), so the false-resolve ceiling reads a vacuous measured 0% (never a breach) and the safety invariant is unmeasured. Failing the run closed; re-author the P3 gold so retrieval reads strong: m-p3-weak-a(retrieval), m-p3-weak-b(retrieval)
E7 FALSE-RESOLVE CEILING BREACH — measured faithfulness-leg 50.0% (1/2 P3) exceeds the buyer's ceiling 5.0% (ESCALATION_FALSE_RESOLVE_CEILING, US-050). The deflection pipeline auto-resolved too many unanswerable questions (the Risk #3 safety failure).
E7 P1a FALSE-RESOLVE RISK — 1 no-context row(s) cleared the retrieval gate (would draft):
  m-leak top1_cosine=0.7000 n_cleared=2 (strong)
E7 P3 MISLABEL-RATIO BREACH — 2/3 P3 row(s) (67%) escalated before the faithfulness gate (mislabeled), exceeding the max 50% (E7_P3_MISLABEL_RATIO_MAX, issue #26). The false-resolve ceiling is then measured over only 1 exercised row(s) — a diluted sample that can mask a bad faithfulness gate. Re-author the drifted P3 gold so retrieval reads strong (or check τ_sim / a degraded answerer): e7-p3-pm-a(retrieval), e7-p3-pm-b(retrieval)
E7 P3 MISLABEL-RATIO BREACH — 1/2 P3 row(s) (50%) escalated before the faithfulness gate (mislabeled), exceeding the max 40% (E7_P3_MISLABEL_RATIO_MAX, issue #26). The false-resolve ceiling is then measured over only 1 exercised row(s) — a diluted sample that can mask a bad faithfulness gate. Re-author the drifted P3 gold so retrieval reads strong (or check τ_sim / a degraded answerer): e7-p3-bd-a(retrieval)
E7 P3 scored 2 should-escalate row(s) but NONE exercised the faithfulness gate — every row escalated at the retrieval/draft leg (mislabeled), so the false-resolve ceiling reads a vacuous measured 0% (never a breach) and the safety invariant is unmeasured. Failing the run closed; re-author the P3 gold so retrieval reads strong: e7-p3-am-a(retrieval), e7-p3-am-b(retrieval)
ok: P1a rows escalate at the retrieval gate, 0 draft + 0 judge, scores recorded
ok: a near-miss escalates and records top1_cosine/n_cleared for visibility
ok: a P1a row that clears the gate is flagged as a false-resolve risk and fails the run
ok: the gate thresholds cosine, not the RRF similarity artifact
ok: an empty P1a population is not a pass (structurally-blind guard)
ok: non-P1a rows are ignored by the P1a leg
ok: the P1a scoring path is deterministic
ok: result/decision to_dict carry the audited fields
ok: a faithful P2 row auto-resolves, counts as deflection, 1 draft + 1 judge
ok: the P2 faithfulness leg is the offline judge (sees the reference), not the runtime gate
ok: a weak-retrieval P2 is a false-escalate at the retrieval gate, 0 draft/0 judge
ok: a P2 draft scored below the faithfulness floor is a false-escalate at the faithfulness leg
ok: an empty P2 draft is a false-escalate at the draft leg, no judge call
ok: the faithfulness floor is inclusive (>=)
ok: an empty P2 population is not a pass (structurally-blind deflection guard)
ok: P2 result/decision to_dict carry the audited fields
ok: a P3 with an unfaithful draft escalates at the faithfulness gate (the moat), not a false-resolve
ok: a P3 that auto-resolves is caught as a false-resolve (the Risk #3 safety failure)
ok: a P3 escalating at the retrieval gate is mislabeled (never exercised the faithfulness gate), not a false-resolve
ok: an empty P3 draft escalates at the draft leg and is mislabeled (no judge call)
ok: the P3 faithfulness leg is the offline judge and receives the should-escalate reference
ok: the faithfulness floor is inclusive and flips the P3 verdict vs P2
ok: an empty P3 population is not a pass (structurally-blind false-resolve guard)
ok: an all-mislabeled P3 population is not a pass (the faithfulness gate was never exercised)
ok: P3 result/decision to_dict carry the audited fields
ok: the three consolidated rates match the hand-computed numerator/denominator
ok: the gated false-resolve is the P3 faithfulness leg; the P1a leak is monitor-only + hard-fails separately
ok: rates over empty rate-bearing populations are None (blind); P1a is monitor-only
ok: the consolidated-metrics to_dict exposes n/d, per-population breakdown, and the safety flag
ok: a false-resolve rate above the ceiling is a breach that fails the run
ok: rate at-or-below the ceiling passes (equality is feasible)
ok: a blind (None) false-resolve rate is not a breach
ok: the ceiling gate is inert on a per-PR (P1a-only, no P3) run
ok: the ceiling verdict to_dict + render carry the audited fields
ok: a structurally-blind P3 faithfulness leg fails the run closed (US-059 safety guard)
ok: an all-mislabeled P3 faithfulness leg fails the run closed (US-059 safety guard)
ok: a clean weekly run does not fail on the P3 positive-control guard
ok: a per-PR run (no P3 leg) does not trip the P3 positive-control guard
ok: a measured false-resolve ceiling breach fails the run
ok: a P1a retrieval-gate clear still hard-fails through the extracted decision
ok: mislabel_ratio is mislabeled/full-population (None when empty)
ok: a partial-mislabel P3 leg over the ratio max fails the run (issue #26 guard)
ok: the mislabel-ratio guard is strict-> and configurable (0.5 passes 0.5, fails 0.4)
ok: all-mislabeled stays owned by the positive control, not the ratio guard
ok: get_p3_mislabel_ratio_max honors env + default and fails closed on bad input
ok: the sweep emits the curve and picks the highest-deflection knee under the ceiling
ok: the sweep memoizes per question — an 8-point grid costs the same LLM as one point
ok: an unsatisfiable ceiling is reported explicitly, never downgraded to least-bad
ok: a feasible-but-deflection-blind sweep is reported explicitly
ok: the sweep to_dict carries the curve, per-point metrics, knee, and recommended config
ok: a P2 question replayed under a no-access viewer returns no gold (clean, no leak)
ok: a strong gate off legitimately-granted non-gold chunks is not a P1b leak
ok: a P1b row that returns a gold chunk is flagged as a no-access leak and fails the run
ok: a gold chunk returned below tau_sim (gate weak) is still caught as a P1b leak
ok: P1b has no privileged second pass — one no-access retrieval per row, full dict passed
ok: P1b replays only the P2 population (P1a/P3 ignored)
ok: an empty P1b population is not a pass (structurally-blind guard)
ok: P1b result/decision to_dict carry the audited fields
ok: a P1b leak is monitor-only (not gated) and hard-fails the P1b invariant unconditionally
ok: a P1b escalation shows the same customer bytes as P1a despite a different internal reason
ok: an injected access-denied reason in the P1b output fails the assertion (it really pins the invariant)
ok: US-058 ignores a legitimate non-gold draft (decoupled from the US-057 identity leak check)
ok: an empty P1b non-disclosure population is not a pass (structurally-blind guard)
ok: the non-disclosure to_dict + render carry the audited fields

PASS: 64 E7 P1a+P1b+P2+P3+metrics+sweep+ceiling runner (US-052–059) test groups

Evidence: CLI surface: --p3-mislabel-ratio-max in --help + fail-closed (flag exit 2, env exit 1)

### Issue #26 CLI surface (python -m evals.retrieval.e7_runner), Python 3.11

--- New flag in --help ---
                    [--p3-mislabel-ratio-max P3_MISLABEL_RATIO_MAX]

US-052/053/054/055/056 E7 escalation eval. Always runs the deterministic P1a
leg (genuinely-no-context → escalate at the retrieval gate, no draft/judge)
and emits the consolidated US-055 metrics (deflection / false-resolve / false-
escalate). With --include-p2 / --include-p3, additionally runs the LLM-judged
P2 leg (answerable+faithful → auto-resolve) and/or P3 leg (should-escalate →
escalate at the faithfulness gate; the moat), scored by the offline cross-
family Claude judge. With --sweep, grids the knobs and reports the deflection-
vs-false-resolve curve + the knee under the ceiling (US-056).

options:
--
  --p3-mislabel-ratio-max P3_MISLABEL_RATIO_MAX
                        Issue #26: max fraction of P3 rows allowed to be
                        `mislabeled` (escalated before the faithfulness gate)
                        before the weekly run fails. Guards against a heavily-
                        mislabeled P3 gold quietly diluting the false-resolve
                        sample (complements the positive control, which only
                        fails on a fully-mislabeled leg). Default:
                        E7_P3_MISLABEL_RATIO_MAX via
                        get_p3_mislabel_ratio_max() (0.5).

--- Fail-closed: a misconfigured safety ceiling must NEVER read as 'no ceiling' ---

$ e7_runner --p3-mislabel-ratio-max 1.5
e7_runner.py: error: --p3-mislabel-ratio-max must be in [0,1]
   exit=2 (argparse error)

$ e7_runner --p3-mislabel-ratio-max abc
e7_runner.py: error: argument --p3-mislabel-ratio-max: invalid float value: 'abc'
   exit=2 (argparse error)

$ E7_P3_MISLABEL_RATIO_MAX=abc e7_runner
ValueError: E7_P3_MISLABEL_RATIO_MAX must be a float in [0,1], got 'abc'
   exit=1 (fail-closed ValueError)

$ E7_P3_MISLABEL_RATIO_MAX=1.2 e7_runner
ValueError: E7_P3_MISLABEL_RATIO_MAX must be in [0,1], got 1.2
   exit=1 (fail-closed ValueError)

Evidence: Scenario driver source (reuses real production functions + test fixtures)

"""Issue #26 end-to-end scenario driver (evidence, not a committed test).

Drives the REAL production exit-code decision (`e7_pinned_invariants_failed`) and
the REAL P3 scoring leg (`run_e7_p3` via the test module's fakes — no DB / no LLM /
no network) over a realistic HEAVILY-but-not-entirely mislabeled P3 gold set, to
show the issue-#26 guard the way the weekly escalation eval operator experiences it:
the run's exit code and the operator-facing log line.

Run: python scenario_mislabel_guard.py
"""
from __future__ import annotations

import json
import sys
from pathlib import Path

_ROOT = Path("/Users/hcho/.no-mistakes/worktrees/3074c1251a17/01KVWZ0G95R7X797V7TAVJENX5")
sys.path.insert(0, str(_ROOT))
sys.path.insert(0, str(_ROOT / "backend"))
sys.path.insert(0, str(_ROOT / "evals" / "retrieval"))

# Reuse the REAL test fixtures (fake retriever/answerer/judge — deterministic, no IO).
from evals.retrieval.test_e7_runner import (  # noqa: E402
    STRONG,
    WEAK,
    _clean_p1a,
    _p3,
    _run_p3,
    _verdict_over,
)
from evals.retrieval.e7_runner import e7_pinned_invariants_failed  # noqa: E402


def banner(s: str) -> None:
    print("\n" + "=" * 78 + f"\n{s}\n" + "=" * 78)


# --- A realistic drifted weekly P3 gold: 2 of 3 should-escalate rows have drifted
#     so their retrieval now reads WEAK (escalate at the retrieval gate = mislabeled);
#     only 1 row still exercises the faithfulness gate. --------------------------------
p1a = _clean_p1a()  # clean per-PR/deterministic leg (green)
weak_a = _p3("e7-p3-warranty-a", "q-warranty-a")   # drifted -> weak retrieval
weak_b = _p3("e7-p3-returns-b", "q-returns-b")      # drifted -> weak retrieval
strong = _p3("e7-p3-shipping-c", "q-shipping-c")    # still strong -> exercises the gate
p3, *_ = _run_p3(
    [weak_a, weak_b, strong],
    {weak_a["question"]: WEAK, weak_b["question"]: WEAK, strong["question"]: STRONG},
    # the one exercised row escalates correctly at the faithfulness gate (score < floor):
    scores_by_question={strong["question"]: {"faithfulness": 2, "helpfulness": 2}},
)

banner("Drifted weekly P3 gold: 2 of 3 rows mislabeled (escalate before the gate)")
print(f"  presented P3 population (n_questions) : {p3.n_questions}")
print(f"  exercised the faithfulness gate       : {len(p3.exercised)} "
      f"-> {[d.question_id for d in p3.exercised]}")
print(f"  mislabeled (escalated before the gate): {len(p3.mislabeled)} "
      f"-> {[d.question_id for d in p3.mislabeled]}")
print(f"  mislabel_ratio (mislabeled / FULL presented pop, NOT exercised-only)")
print(f"     = {len(p3.mislabeled)}/{p3.n_questions} = {p3.mislabel_ratio:.4f}  "
      f"(the crux: denominator is {p3.n_questions}, not {len(p3.exercised)})")
print(f"  false_resolve_rate (same FULL denominator) = "
      f"{len(p3.false_resolves)}/{p3.n_questions} = {p3.false_resolve_rate:.4f}")

# The two existing safety controls both read GREEN on this leg -> without the new
# guard the run would exit 0 over a diluted 1-row sample.
print("\n  Existing controls on this leg (both GREEN, so neither catches the dilution):")
print(f"    positive control  E7P3Result.passed = {p3.passed}  "
      f"(>=1 row exercised, so it does NOT fire)")
verdict = _verdict_over(p1a, p3, 0.05)
print(f"    false-resolve ceiling breached       = {verdict.breached}  "
      f"(measured 0% over the exercised sample, <= 5% ceiling)")

banner("Weekly exit-code decision at the DEFAULT ceiling (E7_P3_MISLABEL_RATIO_MAX=0.5)")
print("  [operator-facing log emitted by e7_pinned_invariants_failed:]\n")
failed_default = e7_pinned_invariants_failed(
    p1a_result=p1a, p1b_result=None, non_disclosure=None,
    p3_result=p3, ceiling_verdict=verdict, p3_mislabel_ratio_max=0.5,
)
print(f"\n  -> weekly run FAILED = {failed_default}   "
      f"(exit code would be {1 if failed_default else 0})")
print("     The mislabel-ratio guard is the SOLE reason: positive control passed AND "
      "ceiling not breached.")

banner("Same leg, LAXER ceiling (--p3-mislabel-ratio-max 0.7): configurable, passes")
failed_lax = e7_pinned_invariants_failed(
    p1a_result=p1a, p1b_result=None, non_disclosure=None,
    p3_result=p3, ceiling_verdict=verdict, p3_mislabel_ratio_max=0.7,
)
print(f"  -> weekly run FAILED = {failed_lax}   (exit code would be "
      f"{1 if failed_lax else 0}); 0.67 ratio does not exceed a 0.7 ceiling.")

banner("mislabel_ratio is surfaced additively in the result JSON (weekly artifact)")
d = p3.to_dict()
print(json.dumps(
    {k: d[k] for k in
     ("population", "n_questions", "n_mislabeled", "mislabel_ratio",
      "n_false_resolve", "false_resolve_rate", "passed")},
    indent=2,
))

# Mirror the runner's exit-code contract for the transcript.
sys.exit(1 if failed_default else 0)

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

⚠️ **Review** - 1 info

ℹ️ evals/retrieval/e7_runner.py:2441 - e7_pinned_invariants_failed defaults p3_mislabel_ratio_max to the hardcoded DEFAULT_P3_MISLABEL_RATIO_MAX (0.5), not to get_p3_mislabel_ratio_max(). The E7_P3_MISLABEL_RATIO_MAX env override is therefore only honored when a caller explicitly resolves it, which today is only amain. No bug now (amain is the sole production caller and this mirrors the existing faithfulness_judge_min pattern), but a future second entry point that forgets to thread the resolver would silently fall back to 0.5 for a safety ceiling. Informational only.

✅ **Test** - passed

✅ No issues found.

python -m evals.retrieval.test_e7_runner full suite — 64/64 green on a CI-matching Python 3.11 venv (httpx/openai/pydantic/pyyaml) and on Anaconda Python 3.9
New issue-#26 groups: test_p3_mislabel_ratio_property, test_exit_p3_partial_mislabel_over_max_fails, test_exit_p3_mislabel_ratio_boundary_and_configurable, test_exit_p3_all_mislabeled_owned_by_positive_control_not_ratio_guard, test_get_p3_mislabel_ratio_max_env
Scenario driver over real run_e7_p3 + e7_pinned_invariants_failed on a 2/3-mislabeled P3 gold: guard fires -> weekly run exit 1, positive control passed=True, ceiling breached=False (guard is sole cause), mislabel_ratio=2/3 over full population
CLI python -m evals.retrieval.e7_runner --help shows --p3-mislabel-ratio-max
Fail-closed: --p3-mislabel-ratio-max 1.5 and abc -> argparse error exit 2; E7_P3_MISLABEL_RATIO_MAX=abc and =1.2 -> ValueError exit 1
E7P3Result.to_dict() exposes mislabel_ratio (verified additive in result JSON)

✅ **Document** - passed

✅ No issues found.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

…inator docs (#26) Harden the E7 weekly false-resolve safety gate against a heavily-mislabeled P3 gold set, and tighten the false_resolve_rate / compute_e7_metrics docstrings so the "denominator misread" review finding cannot recur. Mislabel-ratio guard (new pinned invariant in e7_pinned_invariants_failed): - The existing P3 positive control (E7P3Result.passed) only fails the run when EVERY P3 row is mislabeled (zero rows exercise the faithfulness gate). A leg that is heavily but not entirely mislabeled still exercises >=1 row, passes the positive control, yet measures the false-resolve ceiling over a shrunken sample that can quietly mask a bad faithfulness gate as the P3 gold grows. - The guard fails the run when n_mislabeled / n_questions (full presented population, matching the false-resolve rate's denominator) STRICTLY exceeds a configurable max. Gated on `passed` so the empty / all-mislabeled cases stay owned by the positive control (one clear failure reason each). Inert per-PR (no P3 leg). - Configurable via E7_P3_MISLABEL_RATIO_MAX (default 0.5) or the new --p3-mislabel-ratio-max flag; an unparseable/out-of-range value fails closed. - New E7P3Result.mislabel_ratio property, surfaced in the result JSON for attributability. Docstring tightening (the rejected "exercised-only denominator" misread): state explicitly in false_resolve_rate and compute_e7_metrics that the denominator is the full presented unanswerable population (mislabeled rows included, numerator excluded) and is deliberately NOT len(exercised) - dilution is handled by the mislabel-ratio guard, not by changing the denominator. Non-goal (per the issue): the false-resolve denominator is NOT changed to len(exercised). Tests: 5 new groups (property math + None-on-empty, partial-mislabel-over-max fails with positive-control-passes/ceiling-not-breached so the guard is the sole cause, strict-> boundary + configurability, all-mislabeled stays owned by the positive control, env resolver honors default/override and fails closed). Full suite 64/64 green; flake8 + mypy clean on the changed files. Closes #26

vercel · 2026-06-24T14:24:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agentic-rag	Ready	Preview, Comment	Jun 24, 2026 2:24pm

github-actions · 2026-06-24T14:32:26Z

Retrieval eval — PR vs `main`

n = 50 questions × 3 modes (vector, keyword, hybrid) on a 14-chunk corpus. PR ran in 119.34s; main in 84.33s.

Headline (each cell: PR value, Δ vs `main`)

Mode	recall@5	MRR	nDCG@5
vector	0.860 (±0.000)	0.772 (±0.000)	0.779 (±0.000)
keyword	0.110 (±0.000)	0.120 (±0.000)	0.112 (±0.000)
hybrid	0.860 (±0.000)	0.759 (±0.000)	0.769 (±0.000)

Per-category recall@5

Mode	single_chunk	multi_hop	adversarial	paraphrase
vector	0.900 (±0.000)	0.933 (±0.000)	0.600 (±0.000)	1.000 (±0.000)
keyword	0.250 (±0.000)	0.033 (±0.000)	0.000 (±0.000)	0.000 (±0.000)
hybrid	0.900 (±0.000)	0.933 (±0.000)	0.600 (±0.000)	1.000 (±0.000)

_{Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.}

hcho22 added 2 commits June 24, 2026 07:02

no-mistakes(document): document E7 P3 mislabel-ratio guard in evals.md

8a1db1a

hcho22 changed the title ~~chore: no-mistakes(document): document E7 P3 mislabel-ratio guard in evals.md~~ feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs Jun 24, 2026

hcho22 merged commit e43d76b into main Jun 24, 2026
3 checks passed

hcho22 deleted the fm/issue26-e7-mislabel-guard branch June 24, 2026 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs#30

feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs#30
hcho22 merged 2 commits into
mainfrom
fm/issue26-e7-mislabel-guard

hcho22 commented Jun 24, 2026

Uh oh!

vercel Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hcho22 commented Jun 24, 2026

Intent

What Changed

Risk Assessment

Testing

Pipeline

Uh oh!

vercel Bot commented Jun 24, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Retrieval eval — PR vs main

Headline (each cell: PR value, Δ vs main)

Per-category recall@5

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Retrieval eval — PR vs `main`

Headline (each cell: PR value, Δ vs `main`)