Skip to content

benchmark: exit-0 unparseable reviewer output still scores F1=1.0 on clean-baseline #14

@jimstratus

Description

@jimstratus

Files: scripts/benchmark.py:189 (zero-score gate), scripts/benchmark.py:_dispatch (~line 143), scripts/aggregate_bench.py:_rescore_run

The PR #12 zero-score gate only checks exit_code != 0. A reviewer whose CLI exits 0 but prints unparseable prose (extract_json → None) yields findings=[], error=None, exit_code=0 — flows into _score and earns P=R=F1=1.0 on clean-baseline. This is the exact "broken reviewer earns perfect F1" bug class the gate was added to close, surviving via the parse-failure path (the documented copilot-gpt5 "returns prose not JSON" failure mode). _dispatch doesn't propagate a parse_error flag, so neither benchmark.py nor aggregate_bench.py can see it.

Fix: _dispatch records parse_error; the zero-score gate and _rescore_run treat parse_error like a failed call; run rows persist the flag.

Found by /code-review round 2 (3 finder angles independently, CONFIRMED).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions