Files: scripts/benchmark.py:189 (zero-score gate), scripts/benchmark.py:_dispatch (~line 143), scripts/aggregate_bench.py:_rescore_run
The PR #12 zero-score gate only checks exit_code != 0. A reviewer whose CLI exits 0 but prints unparseable prose (extract_json → None) yields findings=[], error=None, exit_code=0 — flows into _score and earns P=R=F1=1.0 on clean-baseline. This is the exact "broken reviewer earns perfect F1" bug class the gate was added to close, surviving via the parse-failure path (the documented copilot-gpt5 "returns prose not JSON" failure mode). _dispatch doesn't propagate a parse_error flag, so neither benchmark.py nor aggregate_bench.py can see it.
Fix: _dispatch records parse_error; the zero-score gate and _rescore_run treat parse_error like a failed call; run rows persist the flag.
Found by /code-review round 2 (3 finder angles independently, CONFIRMED).
Files:
scripts/benchmark.py:189(zero-score gate),scripts/benchmark.py:_dispatch(~line 143),scripts/aggregate_bench.py:_rescore_runThe PR #12 zero-score gate only checks
exit_code != 0. A reviewer whose CLI exits 0 but prints unparseable prose (extract_json → None) yieldsfindings=[],error=None,exit_code=0— flows into_scoreand earns P=R=F1=1.0 on clean-baseline. This is the exact "broken reviewer earns perfect F1" bug class the gate was added to close, surviving via the parse-failure path (the documented copilot-gpt5 "returns prose not JSON" failure mode)._dispatchdoesn't propagate a parse_error flag, so neither benchmark.py nor aggregate_bench.py can see it.Fix:
_dispatchrecordsparse_error; the zero-score gate and_rescore_runtreat parse_error like a failed call; run rows persist the flag.Found by /code-review round 2 (3 finder angles independently, CONFIRMED).