feat: scanner precision fixes + wild-skills benchmark (75 real skills) by DCCA · Pull Request #42 · DCCA/skval

DCCA · 2026-06-23T01:37:16Z

Goal

Find bad skills, run skval on them, and use them as real benchmarks. I scored 75 installed skills (Anthropic plugins, Vercel suite, Superpowers) with skval's deterministic structural scan.

Key finding

The dramatic "bad" results were mostly skval false positives, not bad skills. So the highest-value outcome was hardening skval — and keeping the corpus as a false-positive regression set.

Precision fixes (each pinned in `tests/test_precision.py`)

skval was flagging	Reality	Fix
`missing referenced path: jsonwebtoken`	a regex in YAML frontmatter (`['"](jsonwebtoken)`) parsed as a Markdown link	scan refs in the body only (skip frontmatter + code)
`0/F unsafe` (`hook-development`, `writing-rules`)	`rm -rf`/`mkfs` inside a hook that blocks them	skip dangerous tokens in a quoted/defensive context
`0/F unsafe` (`command-development`)	`dd if=/dev/zero of=/tmp/file` (a regular file)	narrow the rule to `dd … of=/dev/…` (a device)
`unexpected key: version / user-invocable / tools`	valid, widely-used frontmatter	broaden the allow-list

Effect across the 75: clean 100/A 21 → 34, false safety vetoes 7 → 4 (the 4 left are real rm -rf ~/dev/…). No real check weakened — every genuine finding still flags and the unsafe fixture still scores 0/F.

Benchmark

docs/examples/skill-benchmark/wild-skills.md — distribution, the precision before/after, and the genuine remaining findings (vendored duplicate SKILL.md, size budget, vendor keys, real rm -rf).
raw/nextjs.md — 77/C/Revise exemplar (Vercel ships a vendored upstream/SKILL.md).
Linked from the benchmark README. Kept in docs/examples — not the landing page (no named third-party low scores headlined), per the agreed framing.

Validation

162 tests pass (+11 precision); self-validation 100/A/Ship; ruff clean. Test-count badges bumped 151 → 162.

🤖 Generated with Claude Code

…ills Ran skval across 75 installed skills (Anthropic plugins, Vercel, Superpowers). The run exposed four false-positive classes in the structural/safety scan, now fixed and locked with regression tests; the corpus is published as a benchmark that doubles as the FP regression set. Precision fixes (tests/test_precision.py): - refs: scan the body only (skip frontmatter + code) so a regex like ['"](jsonwebtoken) in YAML no longer reads as a broken markdown link - safety: skip dangerous tokens in defensive/quoted context (a hook that *blocks* rm -rf/mkfs is not unsafe) - safety: narrow the dd rule to `dd ... of=/dev/...` (a real device write), not benign `dd if=/dev/zero of=/tmp/file` - frontmatter: allow widely-used keys (version, user-invocable, tools, model) Effect across the 75: clean 100/A 21 -> 34, false safety vetoes 7 -> 4 (the 4 remaining are real `rm -rf ~/...`). No real check weakened — every genuine finding still flags and the unsafe fixture still scores 0/F. Benchmark: docs/examples/skill-benchmark/wild-skills.md (distribution, the precision before/after, remaining genuine findings) + raw/nextjs.md (77/C/Revise exemplar: vendored duplicate SKILL.md). Kept in docs/examples (not the landing page). Test count 151 -> 162. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

DCCA merged commit 727b85f into main Jun 23, 2026
1 check passed

DCCA deleted the feat/scanner-precision branch June 23, 2026 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: scanner precision fixes + wild-skills benchmark (75 real skills)#42

feat: scanner precision fixes + wild-skills benchmark (75 real skills)#42
DCCA merged 1 commit into
mainfrom
feat/scanner-precision

DCCA commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DCCA commented Jun 23, 2026

Goal

Key finding

Precision fixes (each pinned in tests/test_precision.py)

Benchmark

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Precision fixes (each pinned in `tests/test_precision.py`)