Skip to content

fix(benchmarks): portable plugin-dir resolution for agentic arms#170

Merged
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
ousamabenyounes:fix/agentic-portable-plugin-dir
Jun 18, 2026
Merged

fix(benchmarks): portable plugin-dir resolution for agentic arms#170
DietrichGebert merged 1 commit into
DietrichGebert:mainfrom
ousamabenyounes:fix/agentic-portable-plugin-dir

Conversation

@ousamabenyounes

Copy link
Copy Markdown
Contributor

Fixes #169.

Problem

benchmarks/agentic/run.py hardcoded the maintainer's personal Windows plugin-cache paths:

PLUGIN_DIRS = {
    "ponytail": r"C:\Users\Dietr\.claude\plugins\cache\ponytail\ponytail\4.2.0",
    "caveman":  r"C:\Users\Dietr\.claude\plugins\cache\caveman\caveman\63e797cd753b",
}

Passed to --plugin-dir, these only exist on one machine. Everyone else's ponytail/caveman arms get a non-existent dir, the activation smoke test fails, and the skill never loads — so the headline 54% result and the safety-tier comparison can't be independently reproduced. That is the exact "fully reproducible" bar the rebuilt benchmark (#126, answering the Scott Logic critique) set out to meet.

Fix

Resolve each plugin dir at use-site, portably:

  1. Env override: PONYTAIL_PLUGIN_DIR / CAVEMAN_PLUGIN_DIR.
  2. Else latest version dir under ~/.claude/plugins/cache/<plugin>/<plugin>/ (glob — no pinned version or content hash).
  3. Else sys.exit naming the env var to set.

Resolved only for the arm actually run, so a missing caveman install can't block a ponytail-only run. No hardcoded paths, explicit filesystem error handling.

Test verification (RED → GREEN)

Extended the existing --selftest with two checks: env-override honored, and missing-install fails loudly (sys.exit).

RED — resolver reverted to the old hardcoded behavior, test kept:

XX  plugin_dir   env  override honored
XX  plugin_dir   miss clear error (sys.exit)
selftest: 2 BROKEN

GREEN — with the fix:

ok  plugin_dir   env  override honored
ok  plugin_dir   miss clear error (sys.exit)

selftest: all instruments valid

🤖 Generated with Claude Code

The ponytail/caveman arms hardcoded one machine's Windows plugin-cache
paths (C:\Users\Dietr\...), so only baseline/yagni/yagni-oneliner were
reproducible off the maintainer's box — undercutting the "fully
reproducible" claim the rebuilt benchmark (DietrichGebert#126) was meant to establish.

Resolve per-arm at use-site: env override (PONYTAIL_PLUGIN_DIR /
CAVEMAN_PLUGIN_DIR) -> latest version dir under ~/.claude/plugins/cache
-> clear sys.exit. No pinned version/hash. Selftest extended to cover
env-override and missing-install paths.

Fixes DietrichGebert#169

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DietrichGebert

Copy link
Copy Markdown
Owner

This is exactly what I wanted. Ran the selftest, env override and the loud exit both pass, existing stuff still green. Clean.

One tiny thing for later: sorted(...)[-1] for "latest version" is lexical not semver, so if the cache ever has two versions, 4.9.0 beats 4.10.0 and you grab the wrong one. Cache usually has one dir so it basically never happens, not a blocker. Either max() with a version key or a ponytail: comment naming the limit, whenever you feel like it.

Merging as-is. Thanks for this.

@DietrichGebert DietrichGebert merged commit 44babb2 into DietrichGebert:main Jun 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Agentic benchmark: hardcoded personal plugin-cache paths break reproducibility of ponytail/caveman arms

2 participants