feat(benchmarks): completeness judge — LOC wins can't hide under-delivery#171
Merged
DietrichGebert merged 1 commit intoJun 18, 2026
Conversation
…-delivery The LOC tier scores the open feature tasks (vibe-*, tmpl-fe-*, open-*) on git diff alone -- score_vibe only checks "it compiles", score_fixture only checks "a new file exists". So an arm can win the LOC metric by shipping a stub: fewer lines because it does less, not because it is less bloated. That is the most credible attack left on the headline number raised in DietrichGebert#126. complete.py is a second LLM judge (same auditable footing as judge.py: fixed model, temperature 0, published rubric) that rates how FULLY each submission implements its task, 0..3. Read alongside the LOC table, a low-LOC arm whose completeness also drops is caught, not rewarded. - judge_call gains a `system=` param so the HTTP/key/source plumbing is reused instead of duplicated (one rubric is the only delta between the two passes). - --selftest: the judge must rank a complete reference strictly above a stub. - --selftest-offline: validates the gate logic with no API call / no key. - README documents the pass and updates the can/cannot-show limitations. Fixes DietrichGebert#126 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #126 (the benchmark-fairness thread with @ColinEberhardt) and a companion to #169/#170.
These two PRs share one goal: make the benchmark better at every iteration so the project's claims are earned, not asserted.
ponytail/cavemanarms run off the maintainer's machine, not just on it.Problem
The LOC tier scores the open feature tasks (
vibe-*,tmpl-fe-*,open-*) ongit diffalone. The only quality gate is weak:score_vibe→ "it compiles"score_fixture(tmpl-fe) → "a new file exists"score_open→correct=1alwaysSo an arm can win the LOC metric by shipping a stub — fewer lines because it did less, not because it is less bloated. That is exactly the "you wrote less because you did less" objection, and the bench currently can't see it.
Fix
complete.py— a second LLM judge on the same auditable footing as the over-engineering judge injudge.py: fixed model (claude-sonnet-4-6), temperature 0, published rubric. It rates how fully each submission implements its task,0(stub) ..3(fully implements). Read alongside the LOC table: a low-LOC arm whose completeness also drops is doing less, not less-bloated — and now the bench shows it.judge_callgains asystem=parameter so the HTTP/key/source plumbing is reused, not duplicated — one rubric is the only delta between the two passes.completeness <= 1).Test verification (RED → GREEN)
The judge model is validated by
--selftest(must rank a complete reference strictly above a stub). The gate logic is validated with no API key by--selftest-offline:GREEN — with the gate intact:
RED — gate neutered to a no-op (regression that stops catching under-delivery):
(exit 1)
So the check fails the moment the gate stops catching a stub that out-scores a real implementation.
🤖 Generated with Claude Code