Commit 302c35b
reviewer-eval: harden the Codex-reviewer A/B harness (PR #510 review + local review)
Addresses the PR #510 round-3 CI review, then 16 rounds of local Codex review
(~42 findings, including one P0) that hardened the harness end-to-end. All changes
are local tooling under tools/reviewer-eval/ + its tests; no shipped diff-diff
estimator/inference code is touched.
Run identity & resume (fail closed):
- experiment_tag folds in the codex-invocation contract (hash of _build_codex_cmd /
call_codex source) + action_version, so a backend-wrapper or confound change busts
the cache; resume re-verifies the built prompt so prompt-builder/case edits also bust it.
- run_matrix fails closed on experiment_tag errors (no empty-tag fallback) and on a
pinned-CLI mismatch or an unobtainable CLI version (single-arm included); aborts a
multi-arm run that drifts in effort/sandbox/action_version (ConfoundMismatch);
review() fail-closes on non-xhigh effort / non-read-only sandbox / non-v1 action.
Manifest & compare (one run = one experiment):
- cmd_run invalidates an existing manifest the moment it commits to a run attempt and
writes a failure marker on every non-success exit; records base_prompt_sha.
- compare refuses a failed/incomplete manifest, a rubric drift (live pr_review.md !=
the run's), and incomplete manifest coverage; fails closed on a missing manifest
unless --allow-mixed; groups by (case_id, snapshot) so versions never conflate;
renders grading context (notes, rerun prior review), must_catch, and known-FP
file/severity; the grading table sizes to the actual config set.
Corpus & worktree (integrity + safety):
- verify() matches expected files by exact post-diff path (rename destination only);
negative controls must declare an exact expected_files contract; the stratum dir must
equal case.json's; duplicate/reserved case ids are rejected.
- worktree leaf is a path-safe digest (no case.id traversal/collision — fixes a P0 where
case.id=".." could rmtree runs/); cleanup is realpath-contained; stored-patch paths are
containment-checked; materialize cleans up + rewraps any post-add failure as a
case-scoped MaterializeError.
Operator surface:
- smoke defaults to 1 case and always runs codex live (clears its cache); --subdir is
containment-validated; the notebook guard is scoped to docs/tutorials/*.ipynb.
Deferred (tracked in TODO.md): port CI's <notebook-prose> extraction so docs/tutorials
notebook cases can be reviewed with CI-equivalent context.
77 eval tests (tests/test_evals_{adapters,engine,runtime}.py); ruff + black clean;
verify-corpus 2/2.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>1 parent 43704b6 commit 302c35b
17 files changed
Lines changed: 1713 additions & 133 deletions
File tree
- tests
- tools/reviewer-eval
- adapters
- corpus
- cases/s3_negative/s3-changelog-prose
- engine
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
169 | 169 | | |
170 | 170 | | |
171 | 171 | | |
172 | | - | |
| 172 | + | |
173 | 173 | | |
174 | 174 | | |
175 | 175 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
121 | 121 | | |
122 | 122 | | |
123 | 123 | | |
124 | | - | |
125 | | - | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
126 | 129 | | |
127 | 130 | | |
| 131 | + | |
| 132 | + | |
128 | 133 | | |
129 | 134 | | |
130 | 135 | | |
131 | 136 | | |
| 137 | + | |
| 138 | + | |
132 | 139 | | |
133 | 140 | | |
134 | 141 | | |
135 | 142 | | |
136 | 143 | | |
137 | 144 | | |
138 | | - | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
139 | 153 | | |
140 | 154 | | |
141 | 155 | | |
| |||
205 | 219 | | |
206 | 220 | | |
207 | 221 | | |
| 222 | + | |
208 | 223 | | |
209 | | - | |
210 | | - | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
211 | 229 | | |
212 | 230 | | |
213 | 231 | | |
| |||
0 commit comments