Fix/examples issue 127 by DietrichGebert · Pull Request #131 · DietrichGebert/ponytail

DietrichGebert · 2026-06-17T02:34:48Z

No description provided.

Re-ran the cost benchmark at 30 reps per cell on Claude (Haiku/Sonnet/Opus): ponytail is 42-75% cheaper than no-skill, not the previously published 47-77%. The direction holds, both ends came in a few points lower. Updates the README headline and body, the benchmark chart subtitle, and the benchmarks/README cost table, and adds a dated results doc with full method. Also adds the OpenAI (gpt-4.1-mini/gpt-5.4-mini/gpt-5.5) and Gemini configs. On OpenAI reasoning models ponytail costs more, not less, so the claim stays Claude-scoped. Gemini run pending a fresh-quota day. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

"on every model" read as cross-provider, but the 30-rep verification shows the cost win reverses on OpenAI reasoning models. Match the caption and benchmarks/README, which already say Claude. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The cost/code/latency numbers vary by model and on some (terse reasoning models like GPT-5.5) ponytail costs more, so leading with them as a universal win was misleading. Adds model-variance to the headline caption and a paragraph making the stated point the mental model: write only what the task needs, safety kept, maintainable code. Savings are a model-dependent side effect. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The ladder is a deliberation step: on reasoning models the agent spends thinking tokens working through the rungs before it saves any output, which together with the always-on ruleset can outweigh the shorter code. Makes the GPT-5.5 cost increase legible rather than just stating it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The benchmark is single-shot (one prompt, one completion); it does not measure a real multi-turn agent session, where the ruleset re-injects and the ladder deliberates every turn. Adds that caveat to the README, and corrects the benchmarks/README note that claimed caching widens the gap "in ponytail's favor" (unverified, and a measured agentic A/B in #121 found the opposite can happen). Per-session cost can land either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cost was re-verified at 30 reps; code and latency are still the original 10. The headline caption said "10 runs" across the board, which undersold the cost verification. Now states the split, matching benchmarks/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The examples/ before/after blocks were authored by hand, not produced by a model. Issue #127 correctly noted that nobody hand-rolls quicksort for "sort this array" - every model just calls .sort(). Regenerate all examples verbatim from a real benchmark run (Claude Haiku 4.5, no-skill arm vs ponytail arm, benchmarks/output.json) so the before/after is reproducible, not authored: email 75->3, debounce 116->10, csv 20->3, countdown 267->9, rate-limit 128->10 LOC - Delete sorting.md (pure strawman) plus the other hand-written caricatures (api-endpoint, caching, date-picker) - Add benchmarks/generate-examples.mjs to regenerate examples from any run - examples/README.md indexes the set and documents how to reproduce Closes #127 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DietrichGebert and others added 7 commits June 17, 2026 03:47

DietrichGebert merged commit 45f7d2f into main Jun 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/examples issue 127#131

Fix/examples issue 127#131
DietrichGebert merged 7 commits into
mainfrom
fix/examples-issue-127

DietrichGebert commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

DietrichGebert commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant