Skip to content

Revert #128: a benchmark baseline must be the bare model, no system prompt#175

Merged
DietrichGebert merged 1 commit into
mainfrom
revert/128-baseline-system-prompt
Jun 18, 2026
Merged

Revert #128: a benchmark baseline must be the bare model, no system prompt#175
DietrichGebert merged 1 commit into
mainfrom
revert/128-baseline-system-prompt

Conversation

@DietrichGebert

Copy link
Copy Markdown
Owner

Reverts #128.

A baseline arm is the control: the bare model with the task and nothing else. #128 gave it a system prompt ("Provide just one example for any given task, and no commentary or usage examples"), which:

  • makes the "baseline (no skill)" label false (it now carries a hand-written instruction),
  • desyncs the published numbers from the code: the single-shot table (baseline 518/693/256 LOC, "80-94% less code") was measured against the bare baseline and was never recomputed, so main shipped a baseline that would produce different numbers than the ones printed beside it,
  • was undisclosed.

Any prompt on the baseline tilts the comparison ("write the minimum amount of code" would erase the gap; the opposite would inflate it). The fair control is no prompt at all. The rambling critique in #126 is already answered by the agentic benchmark, so the single-shot baseline does not need de-rambling.

🤖 Generated with Claude Code

@tosage05

Copy link
Copy Markdown

lol?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants