fix(benchmarks): portable plugin-dir resolution for agentic arms#170
Merged
DietrichGebert merged 1 commit intoJun 18, 2026
Merged
Conversation
The ponytail/caveman arms hardcoded one machine's Windows plugin-cache paths (C:\Users\Dietr\...), so only baseline/yagni/yagni-oneliner were reproducible off the maintainer's box — undercutting the "fully reproducible" claim the rebuilt benchmark (DietrichGebert#126) was meant to establish. Resolve per-arm at use-site: env override (PONYTAIL_PLUGIN_DIR / CAVEMAN_PLUGIN_DIR) -> latest version dir under ~/.claude/plugins/cache -> clear sys.exit. No pinned version/hash. Selftest extended to cover env-override and missing-install paths. Fixes DietrichGebert#169 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
|
This is exactly what I wanted. Ran the selftest, env override and the loud exit both pass, existing stuff still green. Clean. One tiny thing for later: Merging as-is. Thanks for this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #169.
Problem
benchmarks/agentic/run.pyhardcoded the maintainer's personal Windows plugin-cache paths:Passed to
--plugin-dir, these only exist on one machine. Everyone else'sponytail/cavemanarms get a non-existent dir, the activation smoke test fails, and the skill never loads — so the headline 54% result and the safety-tier comparison can't be independently reproduced. That is the exact "fully reproducible" bar the rebuilt benchmark (#126, answering the Scott Logic critique) set out to meet.Fix
Resolve each plugin dir at use-site, portably:
PONYTAIL_PLUGIN_DIR/CAVEMAN_PLUGIN_DIR.~/.claude/plugins/cache/<plugin>/<plugin>/(glob — no pinned version or content hash).sys.exitnaming the env var to set.Resolved only for the arm actually run, so a missing
cavemaninstall can't block aponytail-only run. No hardcoded paths, explicit filesystem error handling.Test verification (RED → GREEN)
Extended the existing
--selftestwith two checks: env-override honored, and missing-install fails loudly (sys.exit).RED — resolver reverted to the old hardcoded behavior, test kept:
GREEN — with the fix:
🤖 Generated with Claude Code