benchmarks: add system prompt to baseline arm so it doesn't ramble (closes #126) by Fato07 · Pull Request #128 · DietrichGebert/ponytail

Fato07 · 2026-06-17T01:14:00Z

What

Add a one-line system prompt to the baseline benchmark arm so the comparison against ponytail and caveman is apples-to-apples.

Why

Closes #126. Without a system prompt, the baseline model often responds with multiple options and usage examples, inflating LOC ~7× for reasons unrelated to the test. The system prompt is the one Colin Eberhardt proposed in the issue and matches the style of the other arm files.

Diff

1 file, +6 / -2. Doesn't touch SKILL.md, AGENTS.md, the canary, or any trust-boundary code. node scripts/check-rule-copies.js ✅ · npm test ✅ (11/11).

…loses DietrichGebert#126)

…amble (closes #126) (#128)" (#175) This reverts commit 37f46b8.

DietrichGebert · 2026-06-18T23:44:33Z

Heads up @Fato07, I reverted this in #175, want to explain why since the intent was good.

A baseline arm has to be the pure model: the task and nothing else. Adding a system prompt ("provide just one example, no commentary") changes the control's behavior, which means two things break. The "baseline (no skill)" label stops being true, and the published single-shot numbers (baseline 518/693/256 LOC, the 80-94% figure) were measured against the bare baseline and never recomputed, so the code and the table no longer match.

The deeper reason: any prompt on the baseline tilts the result. "Write the minimum amount of code" would erase ponytail's gap, the opposite would inflate it. So the only honest control is no prompt at all. The rambling concern from #126 is real, but it's already answered by the agentic benchmark (that's the headline number now), so the single-shot baseline doesn't need de-rambling.

Thanks for the contribution, the instinct was right, it just can't live on the control arm.

benchmarks: add system prompt to baseline arm so it doesn't ramble (c…

80f9fcb

…loses DietrichGebert#126)

DietrichGebert merged commit 37f46b8 into DietrichGebert:main Jun 18, 2026

DietrichGebert mentioned this pull request Jun 18, 2026

Revert #128: a benchmark baseline must be the bare model, no system prompt #175

Merged

DietrichGebert added a commit that referenced this pull request Jun 18, 2026

Revert "benchmarks: add system prompt to baseline arm so it doesn't r…

48cdf05

…amble (closes #126) (#128)" (#175) This reverts commit 37f46b8.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

benchmarks: add system prompt to baseline arm so it doesn't ramble (closes #126)#128

benchmarks: add system prompt to baseline arm so it doesn't ramble (closes #126)#128
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
Fato07:ponytail/2026-06-17-baseline-sysprompt

Fato07 commented Jun 17, 2026

Uh oh!

DietrichGebert commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Fato07 commented Jun 17, 2026

Uh oh!

DietrichGebert commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants