Agentic benchmark answering #126 (real repo, LOC + safety) by DietrichGebert · Pull Request #158 · DietrichGebert/ponytail

DietrichGebert · 2026-06-18T14:20:39Z

Rebuilds the ponytail benchmark to the standard #126 asked for.

What

Real headless Claude Code sessions (not a bare model) editing a real public repo (tiangolo/full-stack-fastapi-template @ cd83fc1, MIT). Fair arms: baseline, caveman, ponytail, and the "YAGNI + one-liners" prompt from the #126 writeup. n=4, Haiku 4.5.

LOC via git diff on 12 real feature tickets.
Safety: the produced code is executed against adversarial input on 6 surgical tasks.

Results (vs no-skill baseline)

arm	LOC	tokens	cost	time	safe
ponytail	-54%	-22%	-20%	-27%	100%
caveman	-20%	+7%	+3%	+2%	100%
"YAGNI + one-liners"	-33%	-14%	-21%	-30%	95%

ponytail cuts the most code (up to -94% on over-build features like the date/color picker, where it reaches for a native <input> instead of a component), is a wash on irreducible code, never writes more, and is the only arm that stays 100% safe. The one-liner prompt dropped a path-traversal guard once.

Also in this PR

Fixes a baseline-contamination bug (the ponytail plugin's SessionStart hook fired on every arm; isolated now with --setting-sources project,local + per-arm --plugin-dir) and a Windows subprocess-timeout hang.
Both READMEs lead with these agentic numbers; the single-shot 80-94% is kept but labelled "isolated generation" and credited to Benchmark issues - baseline scores are ~7 times better #126.
The contaminated 2026-06-17 writeup is superseded (kept for history).

Writeup: benchmarks/results/2026-06-18-agentic.md · reproduction: benchmarks/agentic/README.md.

Addresses #126.

🤖 Generated with Claude Code

Rebuild the benchmark to the standard #126 asked for: real headless Claude Code sessions (not a bare model) editing a real public repo (tiangolo/full-stack-fastapi-template @ cd83fc1, MIT), fair arms (baseline, caveman, ponytail, and the "YAGNI + one-liners" prompt), n=4, Haiku 4.5. LOC is the git diff; the safety tasks execute the produced code against adversarial input. Results: ponytail -54% LOC mean (up to -94% on over-build features like the date/color picker), -22% tokens, -20% cost, -27% time, and never more than baseline; 100% safe vs the one-liner prompt's 95% (it dropped a path-traversal guard once). caveman writes less code but spends more tokens. Also fixes a baseline-contamination bug (the ponytail plugin's SessionStart hook fired on every arm; now isolated with --setting-sources project,local + per-arm --plugin-dir) and a Windows subprocess-timeout hang. Lead both READMEs with the agentic numbers; demote the single-shot 80-94% to a labelled "isolated generation" note; supersede the contaminated 2026-06-17 writeup. Dead react-app fixture left untracked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Grouped bars of LOC, tokens, cost and time as a % of the no-skill baseline (lower is leaner/cheaper/faster), plus a separate safety strip (baseline, caveman and ponytail 100%; yagni-oneliner 95%). System-gray palette so it reads on both GitHub themes. The chart commits landed after #158 had already squash-merged, so this brings the chart onto main. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DietrichGebert merged commit b8d6aa7 into main Jun 18, 2026
1 check passed

DietrichGebert mentioned this pull request Jun 18, 2026

Add the agentic benchmark chart (metrics + safety) #160

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agentic benchmark answering #126 (real repo, LOC + safety)#158

Agentic benchmark answering #126 (real repo, LOC + safety)#158
DietrichGebert merged 1 commit into
mainfrom
bench/agentic-126

DietrichGebert commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DietrichGebert commented Jun 18, 2026

What

Results (vs no-skill baseline)

Also in this PR

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant