Skip to content

Agentic benchmark answering #126 (real repo, LOC + safety)#158

Merged
DietrichGebert merged 1 commit into
mainfrom
bench/agentic-126
Jun 18, 2026
Merged

Agentic benchmark answering #126 (real repo, LOC + safety)#158
DietrichGebert merged 1 commit into
mainfrom
bench/agentic-126

Conversation

@DietrichGebert

Copy link
Copy Markdown
Owner

Rebuilds the ponytail benchmark to the standard #126 asked for.

What

Real headless Claude Code sessions (not a bare model) editing a real public repo (tiangolo/full-stack-fastapi-template @ cd83fc1, MIT). Fair arms: baseline, caveman, ponytail, and the "YAGNI + one-liners" prompt from the #126 writeup. n=4, Haiku 4.5.

  • LOC via git diff on 12 real feature tickets.
  • Safety: the produced code is executed against adversarial input on 6 surgical tasks.

Results (vs no-skill baseline)

arm LOC tokens cost time safe
ponytail -54% -22% -20% -27% 100%
caveman -20% +7% +3% +2% 100%
"YAGNI + one-liners" -33% -14% -21% -30% 95%

ponytail cuts the most code (up to -94% on over-build features like the date/color picker, where it reaches for a native <input> instead of a component), is a wash on irreducible code, never writes more, and is the only arm that stays 100% safe. The one-liner prompt dropped a path-traversal guard once.

Also in this PR

  • Fixes a baseline-contamination bug (the ponytail plugin's SessionStart hook fired on every arm; isolated now with --setting-sources project,local + per-arm --plugin-dir) and a Windows subprocess-timeout hang.
  • Both READMEs lead with these agentic numbers; the single-shot 80-94% is kept but labelled "isolated generation" and credited to Benchmark issues - baseline scores are ~7 times better #126.
  • The contaminated 2026-06-17 writeup is superseded (kept for history).

Writeup: benchmarks/results/2026-06-18-agentic.md · reproduction: benchmarks/agentic/README.md.

Addresses #126.

🤖 Generated with Claude Code

Rebuild the benchmark to the standard #126 asked for: real headless Claude Code
sessions (not a bare model) editing a real public repo
(tiangolo/full-stack-fastapi-template @ cd83fc1, MIT), fair arms (baseline,
caveman, ponytail, and the "YAGNI + one-liners" prompt), n=4, Haiku 4.5. LOC is
the git diff; the safety tasks execute the produced code against adversarial
input.

Results: ponytail -54% LOC mean (up to -94% on over-build features like the
date/color picker), -22% tokens, -20% cost, -27% time, and never more than
baseline; 100% safe vs the one-liner prompt's 95% (it dropped a path-traversal
guard once). caveman writes less code but spends more tokens.

Also fixes a baseline-contamination bug (the ponytail plugin's SessionStart hook
fired on every arm; now isolated with --setting-sources project,local + per-arm
--plugin-dir) and a Windows subprocess-timeout hang.

Lead both READMEs with the agentic numbers; demote the single-shot 80-94% to a
labelled "isolated generation" note; supersede the contaminated 2026-06-17
writeup. Dead react-app fixture left untracked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DietrichGebert DietrichGebert merged commit b8d6aa7 into main Jun 18, 2026
1 check passed
DietrichGebert added a commit that referenced this pull request Jun 18, 2026
Grouped bars of LOC, tokens, cost and time as a % of the no-skill baseline
(lower is leaner/cheaper/faster), plus a separate safety strip (baseline,
caveman and ponytail 100%; yagni-oneliner 95%). System-gray palette so it reads
on both GitHub themes. The chart commits landed after #158 had already
squash-merged, so this brings the chart onto main.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant