Skip to content

feat(benchmarks): completeness judge — LOC wins can't hide under-delivery#171

Merged
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
ousamabenyounes:feat/agentic-completeness-gate
Jun 18, 2026
Merged

feat(benchmarks): completeness judge — LOC wins can't hide under-delivery#171
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
ousamabenyounes:feat/agentic-completeness-gate

Conversation

@ousamabenyounes

Copy link
Copy Markdown
Contributor

Follow-up to #126 (the benchmark-fairness thread with @ColinEberhardt) and a companion to #169/#170.

These two PRs share one goal: make the benchmark better at every iteration so the project's claims are earned, not asserted.

Problem

The LOC tier scores the open feature tasks (vibe-*, tmpl-fe-*, open-*) on git diff alone. The only quality gate is weak:

  • score_vibe → "it compiles"
  • score_fixture (tmpl-fe) → "a new file exists"
  • score_opencorrect=1 always

So an arm can win the LOC metric by shipping a stub — fewer lines because it did less, not because it is less bloated. That is exactly the "you wrote less because you did less" objection, and the bench currently can't see it.

Fix

complete.py — a second LLM judge on the same auditable footing as the over-engineering judge in judge.py: fixed model (claude-sonnet-4-6), temperature 0, published rubric. It rates how fully each submission implements its task, 0 (stub) .. 3 (fully implements). Read alongside the LOC table: a low-LOC arm whose completeness also drops is doing less, not less-bloated — and now the bench shows it.

  • judge_call gains a system= parameter so the HTTP/key/source plumbing is reused, not duplicated — one rubric is the only delta between the two passes.
  • Output: mean completeness per arm + a list of under-delivered cells (completeness <= 1).
  • README: new "Completeness judge" section + updated can/cannot-show limitations.

Test verification (RED → GREEN)

The judge model is validated by --selftest (must rank a complete reference strictly above a stub). The gate logic is validated with no API key by --selftest-offline:

GREEN — with the gate intact:

offline gate -- well-ordered (expect ok):
ok cache: complete(3) > stub(0)
offline gate -- stub out-scores complete (expect XX):
XX cache: did not rank complete above stub

completeness gate selftest (offline): valid

RED — gate neutered to a no-op (regression that stops catching under-delivery):

completeness gate selftest (offline): BROKEN

(exit 1)

So the check fails the moment the gate stops catching a stub that out-scores a real implementation.

🤖 Generated with Claude Code

…-delivery

The LOC tier scores the open feature tasks (vibe-*, tmpl-fe-*, open-*) on
git diff alone -- score_vibe only checks "it compiles", score_fixture only
checks "a new file exists". So an arm can win the LOC metric by shipping a
stub: fewer lines because it does less, not because it is less bloated.
That is the most credible attack left on the headline number raised in DietrichGebert#126.

complete.py is a second LLM judge (same auditable footing as judge.py: fixed
model, temperature 0, published rubric) that rates how FULLY each submission
implements its task, 0..3. Read alongside the LOC table, a low-LOC arm whose
completeness also drops is caught, not rewarded.

- judge_call gains a `system=` param so the HTTP/key/source plumbing is reused
  instead of duplicated (one rubric is the only delta between the two passes).
- --selftest: the judge must rank a complete reference strictly above a stub.
- --selftest-offline: validates the gate logic with no API call / no key.
- README documents the pass and updates the can/cannot-show limitations.

Fixes DietrichGebert#126

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DietrichGebert DietrichGebert merged commit 955fff5 into DietrichGebert:main Jun 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants