Fix/examples issue 127#131
Merged
Merged
Conversation
Re-ran the cost benchmark at 30 reps per cell on Claude (Haiku/Sonnet/Opus): ponytail is 42-75% cheaper than no-skill, not the previously published 47-77%. The direction holds, both ends came in a few points lower. Updates the README headline and body, the benchmark chart subtitle, and the benchmarks/README cost table, and adds a dated results doc with full method. Also adds the OpenAI (gpt-4.1-mini/gpt-5.4-mini/gpt-5.5) and Gemini configs. On OpenAI reasoning models ponytail costs more, not less, so the claim stays Claude-scoped. Gemini run pending a fresh-quota day. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"on every model" read as cross-provider, but the 30-rep verification shows the cost win reverses on OpenAI reasoning models. Match the caption and benchmarks/README, which already say Claude. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The cost/code/latency numbers vary by model and on some (terse reasoning models like GPT-5.5) ponytail costs more, so leading with them as a universal win was misleading. Adds model-variance to the headline caption and a paragraph making the stated point the mental model: write only what the task needs, safety kept, maintainable code. Savings are a model-dependent side effect. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The ladder is a deliberation step: on reasoning models the agent spends thinking tokens working through the rungs before it saves any output, which together with the always-on ruleset can outweigh the shorter code. Makes the GPT-5.5 cost increase legible rather than just stating it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The benchmark is single-shot (one prompt, one completion); it does not measure a real multi-turn agent session, where the ruleset re-injects and the ladder deliberates every turn. Adds that caveat to the README, and corrects the benchmarks/README note that claimed caching widens the gap "in ponytail's favor" (unverified, and a measured agentic A/B in #121 found the opposite can happen). Per-session cost can land either way. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cost was re-verified at 30 reps; code and latency are still the original 10. The headline caption said "10 runs" across the board, which undersold the cost verification. Now states the split, matching benchmarks/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The examples/ before/after blocks were authored by hand, not produced by a model. Issue #127 correctly noted that nobody hand-rolls quicksort for "sort this array" - every model just calls .sort(). Regenerate all examples verbatim from a real benchmark run (Claude Haiku 4.5, no-skill arm vs ponytail arm, benchmarks/output.json) so the before/after is reproducible, not authored: email 75->3, debounce 116->10, csv 20->3, countdown 267->9, rate-limit 128->10 LOC - Delete sorting.md (pure strawman) plus the other hand-written caricatures (api-endpoint, caching, date-picker) - Add benchmarks/generate-examples.mjs to regenerate examples from any run - examples/README.md indexes the set and documents how to reproduce Closes #127 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.