Audit-driven fixes: correctness, gate integrity, config-dir, CI/doc coverage#136
Open
ferran-valvia wants to merge 4 commits into
Open
Audit-driven fixes: correctness, gate integrity, config-dir, CI/doc coverage#136ferran-valvia wants to merge 4 commits into
ferran-valvia wants to merge 4 commits into
Conversation
- mode-tracker: a prompt that both switches mode and says "stop ponytail" no longer emits two JSON objects on one stdout (broke the host JSON.parse in Codex/Copilot); the deactivation branch is now mutually exclusive. - mode-tracker / opencode: an unknown /ponytail arg (a typo) no longer silently resets the active level — it is left untouched, matching what the pi extension already does. Bare /ponytail still re-applies the default. - benchmarks/correctness gate: prose that name-drops the structural keywords no longer scores as a pass on the run-free checks (countdown/ratelimit), and a model demo that exits 0 before the appended asserts run no longer counts as a pass — the harness already prints a PASS sentinel; we require it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The hooks write the mode flag under $CLAUDE_CONFIG_DIR when set (getClaudeDir), but the statusline scripts hardcoded ~/.claude, so the badge vanished whenever a user relocated Claude's config dir. Both the bash and PowerShell statusline now resolve the same dir. The SessionStart setup nudge likewise pointed at a literal ~/.claude/settings.json; it now interpolates the resolved settings path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Fold the orphaned benchmarks/correctness.test.js (issue DietrichGebert#65 guard, never run by `npm test` or CI) into tests/correctness.test.js, and add regression cases for the prose-as-pass and exit-0-as-pass fixes. - Add a bash smoke test for the statusline badge (both CLAUDE_CONFIG_DIR and the ~/.claude fallback). - Extend the Windows hooks test to also validate copilot-hooks.json (script existence) and assert hooks.json and copilot-hooks.json wire the same scripts, so a hook added to one host manifest can't be silently forgotten in the other. - Add a parity gate over the three marketplace.json manifests (parse, plugin name, shared description) and a name+description gate over the skill source frontmatter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- README + agent-portability: Antigravity reuses gemini-extension.json (same as Gemini CLI) and surfaces /ponytail commands as chat-skills, so it is listed as command-capable, resolving the README's self-contradiction. - Mark Copilot CLI and OpenClaw as command-capable and qualify "Copilot (editor)" as the instruction-only one. - Correct the bare /ponytail description: it re-applies the default level, it does not report the current one. - Document .agents/plugins/marketplace.json (the .agents-standard marketplace manifest) and add the missing Aider row. - Drop the unused `license: MIT` frontmatter key from skills/ponytail (no host reads it; the openclaw generator injects license uniformly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is
A read-only senior audit of the repo turned into a set of small, verified fixes.
Each finding was adversarially verified before landing, and every change ships
with the runnable check that fails if it regresses. No behavior is changed beyond
the bullets below; the multi-platform rule copies (the product) are untouched.
All commits keep
node scripts/check-rule-copies.jsandnpm testgreen(77 checks), and the series applies cleanly on top of
main.Fixes
Correctness / state machine
"stop ponytail" / "normal mode" emitted two JSON objects on one stdout,
breaking the host's
JSON.parse(Codex/Copilot) and leaving the emittedsystemMessagecontradicting the final state. The deactivation branch is nowmutually exclusive with the command branch.
/ponytail <unknown-arg>fell through togetDefaultMode()and silently reset the active level in Claude Code andOpenCode. It is now a no-op (the active level is left alone), matching what the
pi extension already does. Bare
/ponytailstill re-applies the default.Benchmark correctness gate (
benchmarks/correctness.js)run-free structural checks (countdown/ratelimit) require a real code construct,
so prose that name-drops
useState/FastAPI/429no longer scores 1.sys.exit(0)/process.exit(0)before the appended asserts run used to pass on exit codealone. Each harness already prints a
PASSsentinel on success; the gate nowrequires it.
Cross-platform config dir
.sh+.ps1) and the SessionStart setup nudgehardcoded
~/.claude, but the hooks write the flag under$CLAUDE_CONFIG_DIRwhen set. They now mirror
getClaudeDir(), so the badge and nudge work when theconfig dir is relocated.
CI coverage
benchmarks/correctness.test.js(the issue Impact on model performance? #65 guard, neverrun by
npm test/CI) intotests/correctness.test.js, plus regression casesfor the two gate fixes above.
copilot-hooks.json(script existence)and assert
hooks.json↔copilot-hooks.jsonwire the same scripts.marketplace.jsonmanifests, and aname+description gate over the skill source frontmatter.
Docs / metadata
gemini-extension.json(same as Gemini CLI) and surfaces commands as chat-skills, so it is documented
as command-capable; Copilot CLI and OpenClaw are marked command-capable and
"Copilot (editor)" as the instruction-only one; the bare
/ponytailrow iscorrected (it re-applies the default, it does not report the level).
.agents/plugins/marketplace.jsonand add the missing Aider row.license: MITfrontmatter key fromskills/ponytail.Deliberately not changed
mode-trackerstill accepts a$ponytailprefix. It is undocumented and noknown host emits it, but narrowing an input parser on an unverified assumption
risks breaking a niche host for a cosmetic win, so it is left as-is.
Limits
.agents/plugins/marketplace.jsonwere reasoned from the shared manifest anddocs, not confirmed against a live host.
executed; the bash statusline smoke test does run.