Skip to content

Add automatic skill evaluation pipeline (structural + rubric)#44

Open
chigichan24 wants to merge 5 commits into
mainfrom
feature/20-skill-evaluator
Open

Add automatic skill evaluation pipeline (structural + rubric)#44
chigichan24 wants to merge 5 commits into
mainfrom
feature/20-skill-evaluator

Conversation

@chigichan24
Copy link
Copy Markdown
Owner

Summary

  • Adds scripts/skill-evaluator.ts implementing the three-layer evaluator from Add an automatic Skill evaluation pipeline (skill-creator-style quality checks) #20: deterministic structural validation (zod over a minimal SKILL.md frontmatter parser), LLM rubric scoring via claude -p aligned to skill-creator conventions, and a stubbed smoke firing test (returns { skipped: true } until Claude Code itself can run the fixtures).
  • Wires the evaluator into scripts/cli.ts with --skip-eval and --eval-model flags. Per-candidate output now surfaces structural pass/fail, rubric score, and improvement hints; files are still written when scores are low so this PR does not silently drop output (gating is intentionally left to a follow-up issue).
  • Extends SkillCandidate in src/types/session.ts with an optional evaluation field so persisted data still parses.

Notes for reviewers

  • Smoke firing test is a stub in this PR. The interface (smokeFireTest) is in place but always resolves { skipped: true }. Expect a follow-up issue to wire fixture prompts and Claude Code activation tracing.
  • Tests do not invoke claude -p. Only structural validation + JSON-extraction helpers + prompt builder are covered. Rubric path is exercised by mocking-friendly helpers (extractFirstJsonObject, buildRubricPrompt).
  • zod is added as a direct dependency (it was previously only a transitive of eslint-plugin-react-hooks).

Test plan

  • npm run lint (no new errors; pre-existing warnings unchanged)
  • npm test (230 tests pass, including 18 new evaluator tests + new CLI flag tests)
  • npx tsc --noEmit -p tsconfig.app.json
  • npx tsc --noEmit -p tsconfig.node.json
  • npx tsc --noEmit -p tsconfig.cli.json
  • npm run build and npm run build:cli
  • Manual: run npx tsx scripts/cli.ts --dry-run against real session logs (requires local data)
  • Manual: run with --skip-eval to confirm flag short-circuits the evaluator
  • Manual: run with --eval-model haiku to confirm rubric scoring path

Refs #20

🤖 Generated with Claude Code

chigichan24 and others added 5 commits April 26, 2026 02:05
- New `scripts/skill-evaluator.ts` exposing `validateStructure`,
  `scoreWithRubric`, `smokeFireTest` (stub), and `evaluateSkill`.
  Structural validation uses zod over a minimal frontmatter parser
  (name, description, allowed-tools/requires/next). Rubric scoring
  spawns `claude -p` with a strict-JSON prompt aligned to skill-creator
  conventions. Smoke firing test is intentionally a stub for now.
- Wire evaluator into `scripts/cli.ts` with `--skip-eval` and
  `--eval-model` flags. Per-candidate console output now reports
  structural pass/fail, rubric score, and improvement hints. Files are
  still written on failure so reviewers can inspect; gating is left to
  a follow-up issue.
- Extend `SkillCandidate` in `src/types/session.ts` with optional
  `evaluation` field so persisted data still parses.
- Unit tests cover frontmatter happy path, missing required fields,
  oversize/short description, malformed YAML, and JSON extraction
  helpers. Tests do not call `claude -p`.

Refs #20

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…osing fence

The body-extraction regex in validateStructure required a newline after the
closing '---' fence, so a SKILL.md ending immediately after its body content
(no trailing newline) was incorrectly reported as having an empty body. Make
the trailing newline optional and add regression coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…blocks

SKILL.md bodies routinely contain triple-backtick code blocks. The previous
buildRubricPrompt wrapped the markdown in a triple-backtick fence, so any
nested ``` would close the outer block prematurely and confuse the rubric
LLM. Pick a fence one backtick longer than the longest run in the input
(minimum 4) so nested fences stay verbatim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eWithRubric

Two failure modes were unsafe in the rubric path:

1. The child 'error' handler only treated ENOENT specially. For other spawn
   errors (EACCES, EPERM, ...) the 'close' callback fired with code=null and
   produced the misleading message "claude exited with code null", obscuring
   the real reason.

2. There was no 'error' listener on child.stdin. When spawn fails, the
   subsequent stdin.write triggers an unhandled 'error' event that crashes
   the host process before we can return the structured RubricResult.

Capture all spawn errors with a structured message, attach a no-op stdin
'error' listener, and wrap the write/end in try/catch so the failure is
always reported through the resolved promise.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…with tests

The orchestrator's overallScore branches (structural-fail -> 0, structural
pass + skipped rubric -> 50) and the smokeFireTest stub had no test
coverage. Add deterministic cases that exercise both code paths without
spawning the claude CLI (skipRubric: true), so a regression in either the
score combination or the stub return shape is now caught locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant