Skip to content

Audit-driven fixes: correctness, gate integrity, config-dir, CI/doc coverage#136

Open
ferran-valvia wants to merge 4 commits into
DietrichGebert:mainfrom
ferran-valvia:ponytail-audit-fixes
Open

Audit-driven fixes: correctness, gate integrity, config-dir, CI/doc coverage#136
ferran-valvia wants to merge 4 commits into
DietrichGebert:mainfrom
ferran-valvia:ponytail-audit-fixes

Conversation

@ferran-valvia

Copy link
Copy Markdown

What this is

A read-only senior audit of the repo turned into a set of small, verified fixes.
Each finding was adversarially verified before landing, and every change ships
with the runnable check that fails if it regresses. No behavior is changed beyond
the bullets below; the multi-platform rule copies (the product) are untouched.

All commits keep node scripts/check-rule-copies.js and npm test green
(77 checks), and the series applies cleanly on top of main.

Fixes

Correctness / state machine

  • mode-tracker double-output: a prompt that both switches mode and contains
    "stop ponytail" / "normal mode" emitted two JSON objects on one stdout,
    breaking the host's JSON.parse (Codex/Copilot) and leaving the emitted
    systemMessage contradicting the final state. The deactivation branch is now
    mutually exclusive with the command branch.
  • silent level reset on a typo: /ponytail <unknown-arg> fell through to
    getDefaultMode() and silently reset the active level in Claude Code and
    OpenCode. It is now a no-op (the active level is left alone), matching what the
    pi extension already does. Bare /ponytail still re-applies the default.

Benchmark correctness gate (benchmarks/correctness.js)

  • prose scored as a pass: the unfenced-fallback block is now tagged, and the
    run-free structural checks (countdown/ratelimit) require a real code construct,
    so prose that name-drops useState/FastAPI/429 no longer scores 1.
  • exit-0 scored as a pass: a model demo that calls sys.exit(0) /
    process.exit(0) before the appended asserts run used to pass on exit code
    alone. Each harness already prints a PASS sentinel on success; the gate now
    requires it.

Cross-platform config dir

  • The statusline scripts (.sh + .ps1) and the SessionStart setup nudge
    hardcoded ~/.claude, but the hooks write the flag under $CLAUDE_CONFIG_DIR
    when set. They now mirror getClaudeDir(), so the badge and nudge work when the
    config dir is relocated.

CI coverage

  • Fold the orphaned benchmarks/correctness.test.js (the issue Impact on model performance? #65 guard, never
    run by npm test/CI) into tests/correctness.test.js, plus regression cases
    for the two gate fixes above.
  • Add a bash smoke test for the statusline badge.
  • Extend the Windows hooks test to cover copilot-hooks.json (script existence)
    and assert hooks.jsoncopilot-hooks.json wire the same scripts.
  • Add a parity gate over the three marketplace.json manifests, and a
    name+description gate over the skill source frontmatter.

Docs / metadata

  • Reconcile the host/command matrix: Antigravity reuses gemini-extension.json
    (same as Gemini CLI) and surfaces commands as chat-skills, so it is documented
    as command-capable; Copilot CLI and OpenClaw are marked command-capable and
    "Copilot (editor)" as the instruction-only one; the bare /ponytail row is
    corrected (it re-applies the default, it does not report the level).
  • Document .agents/plugins/marketplace.json and add the missing Aider row.
  • Drop the unused license: MIT frontmatter key from skills/ponytail.

Deliberately not changed

  • mode-tracker still accepts a $ponytail prefix. It is undocumented and no
    known host emits it, but narrowing an input parser on an unverified assumption
    risks breaking a niche host for a cosmetic win, so it is left as-is.

Limits

  • Antigravity's exact command surface and the precise CLI verb that resolves
    .agents/plugins/marketplace.json were reasoned from the shared manifest and
    docs, not confirmed against a live host.
  • CI is ubuntu-only, so the PowerShell statusline path is static-checked, not
    executed; the bash statusline smoke test does run.

Ferran Torres and others added 4 commits June 17, 2026 13:05
- mode-tracker: a prompt that both switches mode and says "stop ponytail"
  no longer emits two JSON objects on one stdout (broke the host JSON.parse
  in Codex/Copilot); the deactivation branch is now mutually exclusive.
- mode-tracker / opencode: an unknown /ponytail arg (a typo) no longer
  silently resets the active level — it is left untouched, matching what the
  pi extension already does. Bare /ponytail still re-applies the default.
- benchmarks/correctness gate: prose that name-drops the structural keywords
  no longer scores as a pass on the run-free checks (countdown/ratelimit),
  and a model demo that exits 0 before the appended asserts run no longer
  counts as a pass — the harness already prints a PASS sentinel; we require it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hooks write the mode flag under $CLAUDE_CONFIG_DIR when set (getClaudeDir),
but the statusline scripts hardcoded ~/.claude, so the badge vanished whenever
a user relocated Claude's config dir. Both the bash and PowerShell statusline
now resolve the same dir. The SessionStart setup nudge likewise pointed at a
literal ~/.claude/settings.json; it now interpolates the resolved settings path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Fold the orphaned benchmarks/correctness.test.js (issue DietrichGebert#65 guard, never run
  by `npm test` or CI) into tests/correctness.test.js, and add regression cases
  for the prose-as-pass and exit-0-as-pass fixes.
- Add a bash smoke test for the statusline badge (both CLAUDE_CONFIG_DIR and the
  ~/.claude fallback).
- Extend the Windows hooks test to also validate copilot-hooks.json (script
  existence) and assert hooks.json and copilot-hooks.json wire the same scripts,
  so a hook added to one host manifest can't be silently forgotten in the other.
- Add a parity gate over the three marketplace.json manifests (parse, plugin
  name, shared description) and a name+description gate over the skill source
  frontmatter.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README + agent-portability: Antigravity reuses gemini-extension.json (same as
  Gemini CLI) and surfaces /ponytail commands as chat-skills, so it is listed as
  command-capable, resolving the README's self-contradiction.
- Mark Copilot CLI and OpenClaw as command-capable and qualify "Copilot (editor)"
  as the instruction-only one.
- Correct the bare /ponytail description: it re-applies the default level, it
  does not report the current one.
- Document .agents/plugins/marketplace.json (the .agents-standard marketplace
  manifest) and add the missing Aider row.
- Drop the unused `license: MIT` frontmatter key from skills/ponytail (no host
  reads it; the openclaw generator injects license uniformly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant