Add automatic skill evaluation pipeline (structural + rubric) by chigichan24 · Pull Request #44 · chigichan24/crune

chigichan24 · 2026-04-25T17:05:58Z

Summary

Adds scripts/skill-evaluator.ts implementing the three-layer evaluator from Add an automatic Skill evaluation pipeline (skill-creator-style quality checks) #20: deterministic structural validation (zod over a minimal SKILL.md frontmatter parser), LLM rubric scoring via claude -p aligned to skill-creator conventions, and a stubbed smoke firing test (returns { skipped: true } until Claude Code itself can run the fixtures).
Wires the evaluator into scripts/cli.ts with --skip-eval and --eval-model flags. Per-candidate output now surfaces structural pass/fail, rubric score, and improvement hints; files are still written when scores are low so this PR does not silently drop output (gating is intentionally left to a follow-up issue).
Extends SkillCandidate in src/types/session.ts with an optional evaluation field so persisted data still parses.

Notes for reviewers

Smoke firing test is a stub in this PR. The interface (smokeFireTest) is in place but always resolves { skipped: true }. Expect a follow-up issue to wire fixture prompts and Claude Code activation tracing.
Tests do not invoke claude -p. Only structural validation + JSON-extraction helpers + prompt builder are covered. Rubric path is exercised by mocking-friendly helpers (extractFirstJsonObject, buildRubricPrompt).
zod is added as a direct dependency (it was previously only a transitive of eslint-plugin-react-hooks).

Test plan

npm run lint (no new errors; pre-existing warnings unchanged)
npm test (230 tests pass, including 18 new evaluator tests + new CLI flag tests)
npx tsc --noEmit -p tsconfig.app.json
npx tsc --noEmit -p tsconfig.node.json
npx tsc --noEmit -p tsconfig.cli.json
npm run build and npm run build:cli
Manual: run npx tsx scripts/cli.ts --dry-run against real session logs (requires local data)
Manual: run with --skip-eval to confirm flag short-circuits the evaluator
Manual: run with --eval-model haiku to confirm rubric scoring path

Refs #20

🤖 Generated with Claude Code

- New `scripts/skill-evaluator.ts` exposing `validateStructure`, `scoreWithRubric`, `smokeFireTest` (stub), and `evaluateSkill`. Structural validation uses zod over a minimal frontmatter parser (name, description, allowed-tools/requires/next). Rubric scoring spawns `claude -p` with a strict-JSON prompt aligned to skill-creator conventions. Smoke firing test is intentionally a stub for now. - Wire evaluator into `scripts/cli.ts` with `--skip-eval` and `--eval-model` flags. Per-candidate console output now reports structural pass/fail, rubric score, and improvement hints. Files are still written on failure so reviewers can inspect; gating is left to a follow-up issue. - Extend `SkillCandidate` in `src/types/session.ts` with optional `evaluation` field so persisted data still parses. - Unit tests cover frontmatter happy path, missing required fields, oversize/short description, malformed YAML, and JSON extraction helpers. Tests do not call `claude -p`. Refs #20 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…osing fence The body-extraction regex in validateStructure required a newline after the closing '---' fence, so a SKILL.md ending immediately after its body content (no trailing newline) was incorrectly reported as having an empty body. Make the trailing newline optional and add regression coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…blocks SKILL.md bodies routinely contain triple-backtick code blocks. The previous buildRubricPrompt wrapped the markdown in a triple-backtick fence, so any nested ``` would close the outer block prematurely and confuse the rubric LLM. Pick a fence one backtick longer than the longest run in the input (minimum 4) so nested fences stay verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eWithRubric Two failure modes were unsafe in the rubric path: 1. The child 'error' handler only treated ENOENT specially. For other spawn errors (EACCES, EPERM, ...) the 'close' callback fired with code=null and produced the misleading message "claude exited with code null", obscuring the real reason. 2. There was no 'error' listener on child.stdin. When spawn fails, the subsequent stdin.write triggers an unhandled 'error' event that crashes the host process before we can return the structured RubricResult. Capture all spawn errors with a structured message, attach a no-op stdin 'error' listener, and wrap the write/end in try/catch so the failure is always reported through the resolved promise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…with tests The orchestrator's overallScore branches (structural-fail -> 0, structural pass + skipped rubric -> 50) and the smokeFireTest stub had no test coverage. Add deterministic cases that exercise both code paths without spawning the claude CLI (skipRubric: true), so a regression in either the score combination or the stub return shape is now caught locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chigichan24 and others added 5 commits April 26, 2026 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add automatic skill evaluation pipeline (structural + rubric)#44

Add automatic skill evaluation pipeline (structural + rubric)#44
chigichan24 wants to merge 5 commits into
mainfrom
feature/20-skill-evaluator

chigichan24 commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chigichan24 commented Apr 25, 2026

Summary

Notes for reviewers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant