Skip to content

benchmarks: add system prompt to baseline arm so it doesn't ramble (closes #126)#128

Merged
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
Fato07:ponytail/2026-06-17-baseline-sysprompt
Jun 18, 2026
Merged

benchmarks: add system prompt to baseline arm so it doesn't ramble (closes #126)#128
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
Fato07:ponytail/2026-06-17-baseline-sysprompt

Conversation

@Fato07

@Fato07 Fato07 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What

Add a one-line system prompt to the baseline benchmark arm so the comparison against ponytail and caveman is apples-to-apples.

Why

Closes #126. Without a system prompt, the baseline model often responds with multiple options and usage examples, inflating LOC ~7× for reasons unrelated to the test. The system prompt is the one Colin Eberhardt proposed in the issue and matches the style of the other arm files.

Diff

1 file, +6 / -2. Doesn't touch SKILL.md, AGENTS.md, the canary, or any trust-boundary code. node scripts/check-rule-copies.js ✅ · npm test ✅ (11/11).

@DietrichGebert

Copy link
Copy Markdown
Owner

Heads up @Fato07, I reverted this in #175, want to explain why since the intent was good.

A baseline arm has to be the pure model: the task and nothing else. Adding a system prompt ("provide just one example, no commentary") changes the control's behavior, which means two things break. The "baseline (no skill)" label stops being true, and the published single-shot numbers (baseline 518/693/256 LOC, the 80-94% figure) were measured against the bare baseline and never recomputed, so the code and the table no longer match.

The deeper reason: any prompt on the baseline tilts the result. "Write the minimum amount of code" would erase ponytail's gap, the opposite would inflate it. So the only honest control is no prompt at all. The rambling concern from #126 is real, but it's already answered by the agentic benchmark (that's the headline number now), so the single-shot baseline doesn't need de-rambling.

Thanks for the contribution, the instinct was right, it just can't live on the control arm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark issues - baseline scores are ~7 times better

2 participants