feat(benchmarks): completeness judge — LOC wins can't hide under-delivery by ousamabenyounes · Pull Request #171 · DietrichGebert/ponytail

ousamabenyounes · 2026-06-18T20:23:58Z

Follow-up to #126 (the benchmark-fairness thread with @ColinEberhardt) and a companion to #169/#170.

These two PRs share one goal: make the benchmark better at every iteration so the project's claims are earned, not asserted.

fix(benchmarks): portable plugin-dir resolution for agentic arms #170 made the benchmark reproducible by anyone — it removed hardcoded personal plugin-cache paths so the ponytail/caveman arms run off the maintainer's machine, not just on it.
This PR makes the benchmark more relevant — it closes the "less code might just mean less feature" hole, which is the most credible attack left on the headline number after the Benchmark issues - baseline scores are ~7 times better #126 rebuild.

Problem

The LOC tier scores the open feature tasks (vibe-*, tmpl-fe-*, open-*) on git diff alone. The only quality gate is weak:

score_vibe → "it compiles"
score_fixture (tmpl-fe) → "a new file exists"
score_open → correct=1 always

So an arm can win the LOC metric by shipping a stub — fewer lines because it did less, not because it is less bloated. That is exactly the "you wrote less because you did less" objection, and the bench currently can't see it.

Fix

complete.py — a second LLM judge on the same auditable footing as the over-engineering judge in judge.py: fixed model (claude-sonnet-4-6), temperature 0, published rubric. It rates how fully each submission implements its task, 0 (stub) .. 3 (fully implements). Read alongside the LOC table: a low-LOC arm whose completeness also drops is doing less, not less-bloated — and now the bench shows it.

judge_call gains a system= parameter so the HTTP/key/source plumbing is reused, not duplicated — one rubric is the only delta between the two passes.
Output: mean completeness per arm + a list of under-delivered cells (completeness <= 1).
README: new "Completeness judge" section + updated can/cannot-show limitations.

Test verification (RED → GREEN)

The judge model is validated by --selftest (must rank a complete reference strictly above a stub). The gate logic is validated with no API key by --selftest-offline:

GREEN — with the gate intact:

offline gate -- well-ordered (expect ok):
ok cache: complete(3) > stub(0)
offline gate -- stub out-scores complete (expect XX):
XX cache: did not rank complete above stub

completeness gate selftest (offline): valid

RED — gate neutered to a no-op (regression that stops catching under-delivery):

completeness gate selftest (offline): BROKEN

(exit 1)

So the check fails the moment the gate stops catching a stub that out-scores a real implementation.

🤖 Generated with Claude Code

…-delivery The LOC tier scores the open feature tasks (vibe-*, tmpl-fe-*, open-*) on git diff alone -- score_vibe only checks "it compiles", score_fixture only checks "a new file exists". So an arm can win the LOC metric by shipping a stub: fewer lines because it does less, not because it is less bloated. That is the most credible attack left on the headline number raised in DietrichGebert#126. complete.py is a second LLM judge (same auditable footing as judge.py: fixed model, temperature 0, published rubric) that rates how FULLY each submission implements its task, 0..3. Read alongside the LOC table, a low-LOC arm whose completeness also drops is caught, not rewarded. - judge_call gains a `system=` param so the HTTP/key/source plumbing is reused instead of duplicated (one rubric is the only delta between the two passes). - --selftest: the judge must rank a complete reference strictly above a stub. - --selftest-offline: validates the gate logic with no API call / no key. - README documents the pass and updates the can/cannot-show limitations. Fixes DietrichGebert#126 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DietrichGebert merged commit 955fff5 into DietrichGebert:main Jun 18, 2026
1 check passed

DietrichGebert mentioned this pull request Jun 18, 2026

Is any evaluation prove it won‘t damage the quality of output? #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmarks): completeness judge — LOC wins can't hide under-delivery#171

feat(benchmarks): completeness judge — LOC wins can't hide under-delivery#171
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
ousamabenyounes:feat/agentic-completeness-gate

ousamabenyounes commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ousamabenyounes commented Jun 18, 2026

Problem

Fix

Test verification (RED → GREEN)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants