Add automatic skill evaluation pipeline (structural + rubric)#44
Open
chigichan24 wants to merge 5 commits into
Open
Add automatic skill evaluation pipeline (structural + rubric)#44chigichan24 wants to merge 5 commits into
chigichan24 wants to merge 5 commits into
Conversation
- New `scripts/skill-evaluator.ts` exposing `validateStructure`, `scoreWithRubric`, `smokeFireTest` (stub), and `evaluateSkill`. Structural validation uses zod over a minimal frontmatter parser (name, description, allowed-tools/requires/next). Rubric scoring spawns `claude -p` with a strict-JSON prompt aligned to skill-creator conventions. Smoke firing test is intentionally a stub for now. - Wire evaluator into `scripts/cli.ts` with `--skip-eval` and `--eval-model` flags. Per-candidate console output now reports structural pass/fail, rubric score, and improvement hints. Files are still written on failure so reviewers can inspect; gating is left to a follow-up issue. - Extend `SkillCandidate` in `src/types/session.ts` with optional `evaluation` field so persisted data still parses. - Unit tests cover frontmatter happy path, missing required fields, oversize/short description, malformed YAML, and JSON extraction helpers. Tests do not call `claude -p`. Refs #20 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…osing fence The body-extraction regex in validateStructure required a newline after the closing '---' fence, so a SKILL.md ending immediately after its body content (no trailing newline) was incorrectly reported as having an empty body. Make the trailing newline optional and add regression coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…blocks SKILL.md bodies routinely contain triple-backtick code blocks. The previous buildRubricPrompt wrapped the markdown in a triple-backtick fence, so any nested ``` would close the outer block prematurely and confuse the rubric LLM. Pick a fence one backtick longer than the longest run in the input (minimum 4) so nested fences stay verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eWithRubric Two failure modes were unsafe in the rubric path: 1. The child 'error' handler only treated ENOENT specially. For other spawn errors (EACCES, EPERM, ...) the 'close' callback fired with code=null and produced the misleading message "claude exited with code null", obscuring the real reason. 2. There was no 'error' listener on child.stdin. When spawn fails, the subsequent stdin.write triggers an unhandled 'error' event that crashes the host process before we can return the structured RubricResult. Capture all spawn errors with a structured message, attach a no-op stdin 'error' listener, and wrap the write/end in try/catch so the failure is always reported through the resolved promise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…with tests The orchestrator's overallScore branches (structural-fail -> 0, structural pass + skipped rubric -> 50) and the smokeFireTest stub had no test coverage. Add deterministic cases that exercise both code paths without spawning the claude CLI (skipRubric: true), so a regression in either the score combination or the stub return shape is now caught locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
scripts/skill-evaluator.tsimplementing the three-layer evaluator from Add an automatic Skill evaluation pipeline (skill-creator-style quality checks) #20: deterministic structural validation (zod over a minimal SKILL.md frontmatter parser), LLM rubric scoring viaclaude -paligned to skill-creator conventions, and a stubbed smoke firing test (returns{ skipped: true }until Claude Code itself can run the fixtures).scripts/cli.tswith--skip-evaland--eval-modelflags. Per-candidate output now surfaces structural pass/fail, rubric score, and improvement hints; files are still written when scores are low so this PR does not silently drop output (gating is intentionally left to a follow-up issue).SkillCandidateinsrc/types/session.tswith an optionalevaluationfield so persisted data still parses.Notes for reviewers
smokeFireTest) is in place but always resolves{ skipped: true }. Expect a follow-up issue to wire fixture prompts and Claude Code activation tracing.claude -p. Only structural validation + JSON-extraction helpers + prompt builder are covered. Rubric path is exercised by mocking-friendly helpers (extractFirstJsonObject,buildRubricPrompt).zodis added as a direct dependency (it was previously only a transitive ofeslint-plugin-react-hooks).Test plan
npm run lint(no new errors; pre-existing warnings unchanged)npm test(230 tests pass, including 18 new evaluator tests + new CLI flag tests)npx tsc --noEmit -p tsconfig.app.jsonnpx tsc --noEmit -p tsconfig.node.jsonnpx tsc --noEmit -p tsconfig.cli.jsonnpm run buildandnpm run build:clinpx tsx scripts/cli.ts --dry-runagainst real session logs (requires local data)--skip-evalto confirm flag short-circuits the evaluator--eval-model haikuto confirm rubric scoring pathRefs #20
🤖 Generated with Claude Code