feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs#30
Merged
Conversation
…inator docs (#26) Harden the E7 weekly false-resolve safety gate against a heavily-mislabeled P3 gold set, and tighten the false_resolve_rate / compute_e7_metrics docstrings so the "denominator misread" review finding cannot recur. Mislabel-ratio guard (new pinned invariant in e7_pinned_invariants_failed): - The existing P3 positive control (E7P3Result.passed) only fails the run when EVERY P3 row is mislabeled (zero rows exercise the faithfulness gate). A leg that is heavily but not entirely mislabeled still exercises >=1 row, passes the positive control, yet measures the false-resolve ceiling over a shrunken sample that can quietly mask a bad faithfulness gate as the P3 gold grows. - The guard fails the run when n_mislabeled / n_questions (full presented population, matching the false-resolve rate's denominator) STRICTLY exceeds a configurable max. Gated on `passed` so the empty / all-mislabeled cases stay owned by the positive control (one clear failure reason each). Inert per-PR (no P3 leg). - Configurable via E7_P3_MISLABEL_RATIO_MAX (default 0.5) or the new --p3-mislabel-ratio-max flag; an unparseable/out-of-range value fails closed. - New E7P3Result.mislabel_ratio property, surfaced in the result JSON for attributability. Docstring tightening (the rejected "exercised-only denominator" misread): state explicitly in false_resolve_rate and compute_e7_metrics that the denominator is the full presented unanswerable population (mislabeled rows included, numerator excluded) and is deliberately NOT len(exercised) - dilution is handled by the mislabel-ratio guard, not by changing the denominator. Non-goal (per the issue): the false-resolve denominator is NOT changed to len(exercised). Tests: 5 new groups (property math + None-on-empty, partial-mislabel-over-max fails with positive-control-passes/ceiling-not-breached so the guard is the sole cause, strict-> boundary + configurability, all-mislabeled stays owned by the positive control, env resolver honors default/override and fails closed). Full suite 64/64 green; flake8 + mypy clean on the changed files. Closes #26
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
Retrieval eval — PR vs
|
| Mode | recall@5 | MRR | nDCG@5 |
|---|---|---|---|
| vector | 0.860 (±0.000) | 0.772 (±0.000) | 0.779 (±0.000) |
| keyword | 0.110 (±0.000) | 0.120 (±0.000) | 0.112 (±0.000) |
| hybrid | 0.860 (±0.000) | 0.759 (±0.000) | 0.769 (±0.000) |
Per-category recall@5
| Mode | single_chunk | multi_hop | adversarial | paraphrase |
|---|---|---|---|---|
| vector | 0.900 (±0.000) | 0.933 (±0.000) | 0.600 (±0.000) | 1.000 (±0.000) |
| keyword | 0.250 (±0.000) | 0.033 (±0.000) | 0.000 (±0.000) | 0.000 (±0.000) |
| hybrid | 0.900 (±0.000) | 0.933 (±0.000) | 0.600 (±0.000) | 1.000 (±0.000) |
Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Resolve issue #26: harden the E7 weekly false-resolve safety gate against a heavily-mislabeled P3 gold set, and tighten the false_resolve_rate / compute_e7_metrics docstrings so the "denominator misread" review finding cannot recur. Non-blocking follow-up to the escalation/deflection pipeline (Epic D, ADR-0003, US-046-059, merged PR #27).
Deliberate decisions a diff-only reviewer would not know:
Tests: 5 new groups in test_e7_runner.py (property math + None-on-empty; partial-mislabel-over-max fails with the positive control passing AND the ceiling not breached so the guard is the SOLE cause; strict greater-than boundary + configurability via two thresholds on one leg; all-mislabeled stays owned by the positive control; env resolver default/override + fail-closed). Full suite 64/64 green; flake8 and mypy clean on the two changed files (remaining repo-wide mypy notes - missing yaml stubs, ragas double-module - are pre-existing infra, not introduced here). Branch fm/issue26-e7-mislabel-guard is off the latest origin/main (PR #27 already merged, so the E7 code is present). Closes #26.
What Changed
8a1db1a no-mistakes(document): document E7 P3 mislabel-ratio guard in evals.md
8d972a0 feat(evals): E7 P3 mislabel-ratio guard + tighten false-resolve denominator docs (#26)
Risk Assessment
✅ Low: Tightly-scoped, additive safety-guard with accurate docstrings, fail-closed config on every path, correct partitioning with the existing positive control, no broken consumers, and thorough new tests.
Testing
Ran the full E7 runner suite (64/64 green) on both Anaconda Python 3.9 and a freshly-built CI-matching Python 3.11 venv, then drove the real production exit-code decision and P3 scoring leg over a realistic 2-of-3 mislabeled P3 gold set to show the guard failing the weekly run (exit 1) with its operator log while the positive control passes and the ceiling is not breached - the precise partial-dilution gap the issue closes. Also verified the new CLI flag in --help and the fail-closed behavior of both the flag and the env var, plus the additive mislabel_ratio JSON field. This is a backend/CLI eval tool change with no rendered UI surface, so evidence is CLI transcripts and the result JSON rather than screenshots. All checks passed and the worktree was left clean (temp venv and caches removed).
Evidence: End-to-end scenario transcript: mislabel-ratio guard fails the weekly run on a 2/3-mislabeled P3 gold (guard is sole cause)
Evidence: Operator-facing log emitted by the guard
Evidence: Full test_e7_runner suite output (64/64) on CI-matching Python 3.11
Evidence: CLI surface: --p3-mislabel-ratio-max in --help + fail-closed (flag exit 2, env exit 1)
Evidence: Scenario driver source (reuses real production functions + test fixtures)
Pipeline
Updates from git push no-mistakes
✅ **intent** - passed
✅ No issues found.
✅ **Rebase** - passed
✅ No issues found.
evals/retrieval/e7_runner.py:2441- e7_pinned_invariants_failed defaults p3_mislabel_ratio_max to the hardcoded DEFAULT_P3_MISLABEL_RATIO_MAX (0.5), not to get_p3_mislabel_ratio_max(). The E7_P3_MISLABEL_RATIO_MAX env override is therefore only honored when a caller explicitly resolves it, which today is only amain. No bug now (amain is the sole production caller and this mirrors the existing faithfulness_judge_min pattern), but a future second entry point that forgets to thread the resolver would silently fall back to 0.5 for a safety ceiling. Informational only.✅ **Test** - passed
✅ No issues found.
python -m evals.retrieval.test_e7_runnerfull suite — 64/64 green on a CI-matching Python 3.11 venv (httpx/openai/pydantic/pyyaml) and on Anaconda Python 3.9New issue-#26 groups:test_p3_mislabel_ratio_property,test_exit_p3_partial_mislabel_over_max_fails,test_exit_p3_mislabel_ratio_boundary_and_configurable,test_exit_p3_all_mislabeled_owned_by_positive_control_not_ratio_guard,test_get_p3_mislabel_ratio_max_envScenario driver over realrun_e7_p3+e7_pinned_invariants_failedon a 2/3-mislabeled P3 gold: guard fires -> weekly run exit 1, positive control passed=True, ceiling breached=False (guard is sole cause), mislabel_ratio=2/3 over full populationCLIpython -m evals.retrieval.e7_runner --helpshows--p3-mislabel-ratio-maxFail-closed:--p3-mislabel-ratio-max 1.5andabc-> argparse error exit 2;E7_P3_MISLABEL_RATIO_MAX=abcand=1.2-> ValueError exit 1E7P3Result.to_dict()exposesmislabel_ratio(verified additive in result JSON)✅ **Document** - passed
✅ No issues found.
✅ **Lint** - passed
✅ No issues found.
✅ **Push** - passed
✅ No issues found.