Skip to content

Fix/examples issue 127#131

Merged
DietrichGebert merged 7 commits into
mainfrom
fix/examples-issue-127
Jun 17, 2026
Merged

Fix/examples issue 127#131
DietrichGebert merged 7 commits into
mainfrom
fix/examples-issue-127

Conversation

@DietrichGebert

Copy link
Copy Markdown
Owner

No description provided.

DietrichGebert and others added 7 commits June 17, 2026 03:47
Re-ran the cost benchmark at 30 reps per cell on Claude (Haiku/Sonnet/Opus):
ponytail is 42-75% cheaper than no-skill, not the previously published 47-77%.
The direction holds, both ends came in a few points lower. Updates the README
headline and body, the benchmark chart subtitle, and the benchmarks/README cost
table, and adds a dated results doc with full method.

Also adds the OpenAI (gpt-4.1-mini/gpt-5.4-mini/gpt-5.5) and Gemini configs. On
OpenAI reasoning models ponytail costs more, not less, so the claim stays
Claude-scoped. Gemini run pending a fresh-quota day.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"on every model" read as cross-provider, but the 30-rep verification shows
the cost win reverses on OpenAI reasoning models. Match the caption and
benchmarks/README, which already say Claude.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cost/code/latency numbers vary by model and on some (terse reasoning
models like GPT-5.5) ponytail costs more, so leading with them as a universal
win was misleading. Adds model-variance to the headline caption and a paragraph
making the stated point the mental model: write only what the task needs,
safety kept, maintainable code. Savings are a model-dependent side effect.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ladder is a deliberation step: on reasoning models the agent spends
thinking tokens working through the rungs before it saves any output, which
together with the always-on ruleset can outweigh the shorter code. Makes the
GPT-5.5 cost increase legible rather than just stating it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The benchmark is single-shot (one prompt, one completion); it does not measure
a real multi-turn agent session, where the ruleset re-injects and the ladder
deliberates every turn. Adds that caveat to the README, and corrects the
benchmarks/README note that claimed caching widens the gap "in ponytail's
favor" (unverified, and a measured agentic A/B in #121 found the opposite can
happen). Per-session cost can land either way.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cost was re-verified at 30 reps; code and latency are still the original 10.
The headline caption said "10 runs" across the board, which undersold the cost
verification. Now states the split, matching benchmarks/README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The examples/ before/after blocks were authored by hand, not produced by a
model. Issue #127 correctly noted that nobody hand-rolls quicksort for "sort
this array" - every model just calls .sort(). Regenerate all examples verbatim
from a real benchmark run (Claude Haiku 4.5, no-skill arm vs ponytail arm,
benchmarks/output.json) so the before/after is reproducible, not authored:

  email 75->3, debounce 116->10, csv 20->3, countdown 267->9, rate-limit 128->10 LOC

- Delete sorting.md (pure strawman) plus the other hand-written caricatures
  (api-endpoint, caching, date-picker)
- Add benchmarks/generate-examples.mjs to regenerate examples from any run
- examples/README.md indexes the set and documents how to reproduce

Closes #127

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DietrichGebert DietrichGebert merged commit 45f7d2f into main Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant