Skip to content

Commit 302c35b

Browse files
igerberclaude
andcommitted
reviewer-eval: harden the Codex-reviewer A/B harness (PR #510 review + local review)
Addresses the PR #510 round-3 CI review, then 16 rounds of local Codex review (~42 findings, including one P0) that hardened the harness end-to-end. All changes are local tooling under tools/reviewer-eval/ + its tests; no shipped diff-diff estimator/inference code is touched. Run identity & resume (fail closed): - experiment_tag folds in the codex-invocation contract (hash of _build_codex_cmd / call_codex source) + action_version, so a backend-wrapper or confound change busts the cache; resume re-verifies the built prompt so prompt-builder/case edits also bust it. - run_matrix fails closed on experiment_tag errors (no empty-tag fallback) and on a pinned-CLI mismatch or an unobtainable CLI version (single-arm included); aborts a multi-arm run that drifts in effort/sandbox/action_version (ConfoundMismatch); review() fail-closes on non-xhigh effort / non-read-only sandbox / non-v1 action. Manifest & compare (one run = one experiment): - cmd_run invalidates an existing manifest the moment it commits to a run attempt and writes a failure marker on every non-success exit; records base_prompt_sha. - compare refuses a failed/incomplete manifest, a rubric drift (live pr_review.md != the run's), and incomplete manifest coverage; fails closed on a missing manifest unless --allow-mixed; groups by (case_id, snapshot) so versions never conflate; renders grading context (notes, rerun prior review), must_catch, and known-FP file/severity; the grading table sizes to the actual config set. Corpus & worktree (integrity + safety): - verify() matches expected files by exact post-diff path (rename destination only); negative controls must declare an exact expected_files contract; the stratum dir must equal case.json's; duplicate/reserved case ids are rejected. - worktree leaf is a path-safe digest (no case.id traversal/collision — fixes a P0 where case.id=".." could rmtree runs/); cleanup is realpath-contained; stored-patch paths are containment-checked; materialize cleans up + rewraps any post-add failure as a case-scoped MaterializeError. Operator surface: - smoke defaults to 1 case and always runs codex live (clears its cache); --subdir is containment-validated; the notebook guard is scoped to docs/tutorials/*.ipynb. Deferred (tracked in TODO.md): port CI's <notebook-prose> extraction so docs/tutorials notebook cases can be reviewed with CI-equivalent context. 77 eval tests (tests/test_evals_{adapters,engine,runtime}.py); ruff + black clean; verify-corpus 2/2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 43704b6 commit 302c35b

17 files changed

Lines changed: 1713 additions & 133 deletions

File tree

TODO.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ Deferred items from PR reviews that were not addressed before merge.
169169
| Issue | Location | PR | Priority |
170170
|-------|----------|----|----------|
171171
| Drift test for tutorial 24 qualitative power claims (monotonic dilution fast→slow; CS-vs-2×2 MDE crossover/near-parity at slow rollout) — pins the prose against estimator-default/simulation drift | `docs/tutorials/24_staggered_vs_collapsed_power.ipynb` | staggered-analysis-2x2 | Low |
172-
| Port the CI `<notebook-prose>` extraction into the reviewer-eval harness (notebook cases are currently guarded out of `verify-corpus`/`run`) | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Low |
172+
| Port the CI `<notebook-prose>` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Low |
173173
| R comparison tests spawn separate `Rscript` per test (slow CI) | `tests/test_methodology_twfe.py:294` | #139 | Low |
174174
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
175175
| Doc-snippet smoke tests only cover `.rst` files; `.txt` AI guides outside CI validation | `tests/test_doc_snippets.py` | #239 | Low |

tests/test_evals_adapters.py

Lines changed: 23 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -121,21 +121,35 @@ def test_corpus_loads_seed_cases():
121121
assert s3.expect_no_blockers is True
122122

123123

124-
def test_seed_cases_match_schema_required_fields():
125-
"""Lightweight schema check (no jsonschema dep): required fields + enums."""
124+
def test_seed_cases_match_schema_constraints():
125+
"""Lightweight schema check (no jsonschema dep): required fields, enums, the
126+
top-level additionalProperties=false allowlist, and the per-kind fixture
127+
requirements — all mirrored from manifest.schema.json so typos in optional
128+
metadata can't be silently defaulted by the loader."""
126129
schema = json.loads((_EVAL_ROOT / "corpus" / "manifest.schema.json").read_text())
127130
required = set(schema["required"])
131+
allowed_top = set(schema["properties"])
132+
assert schema.get("additionalProperties") is False, "schema must forbid unknown top-level keys"
128133
severities = set(
129134
schema["properties"]["ground_truth"]["items"]["properties"]["expected_severity"]["enum"]
130135
)
131136
kinds = set(schema["properties"]["fixture"]["properties"]["kind"]["enum"])
137+
# mirror the fixture allOf conditionals (kind -> the field it requires)
138+
kind_req = {"git_range": "head_sha", "stored_patch": "patch", "git_revert": "revert_commit"}
132139
cases_dir = _EVAL_ROOT / "corpus" / "cases"
133140
found = 0
134141
for case_json in cases_dir.glob("*/*/case.json"):
135142
d = json.loads(case_json.read_text())
136143
found += 1
137144
assert required <= set(d), f"{case_json} missing {required - set(d)}"
138-
assert d["fixture"]["kind"] in kinds
145+
assert (
146+
set(d) <= allowed_top
147+
), f"{case_json} has unknown top-level keys {set(d) - allowed_top}"
148+
kind = d["fixture"]["kind"]
149+
assert kind in kinds
150+
assert (
151+
kind_req[kind] in d["fixture"]
152+
), f"{case_json} {kind} fixture missing {kind_req[kind]}"
139153
for bug in d.get("ground_truth", []):
140154
assert bug["expected_severity"] in severities
141155
assert found >= 2, "expected at least the two seed cases"
@@ -205,9 +219,13 @@ def test_s1_inject_diff_undrifted_at_base():
205219
def test_touches_notebook_predicate():
206220
from adapters.ci_prompt import touches_notebook
207221

222+
# Only TUTORIAL notebooks (docs/tutorials/*.ipynb) are special-cased by CI.
208223
assert touches_notebook("M\tdocs/tutorials/foo.ipynb") is True
209-
# rename line: STATUS \t old \t new — the .ipynb target must still trip it
210-
assert touches_notebook("R100\told.py\tdocs/x.ipynb") is True
224+
# rename TO a tutorial notebook trips it (destination column is a tutorial nb)
225+
assert touches_notebook("R100\told.py\tdocs/tutorials/new.ipynb") is True
226+
# a NON-tutorial .ipynb rides the normal diff path (same as CI) -> not guarded
227+
assert touches_notebook("M\tnotebooks/foo.ipynb") is False
228+
assert touches_notebook("R100\told.py\tdocs/x.ipynb") is False
211229
# the seed cases touch .py / .md, not notebooks
212230
assert touches_notebook("M\tdiff_diff/estimators.py") is False
213231
assert touches_notebook("A\tCHANGELOG.md\nM\tdiff_diff/x.py") is False

0 commit comments

Comments
 (0)