feat: scanner precision fixes + wild-skills benchmark (75 real skills)#42
Merged
Conversation
…ills Ran skval across 75 installed skills (Anthropic plugins, Vercel, Superpowers). The run exposed four false-positive classes in the structural/safety scan, now fixed and locked with regression tests; the corpus is published as a benchmark that doubles as the FP regression set. Precision fixes (tests/test_precision.py): - refs: scan the body only (skip frontmatter + code) so a regex like ['"](jsonwebtoken) in YAML no longer reads as a broken markdown link - safety: skip dangerous tokens in defensive/quoted context (a hook that *blocks* rm -rf/mkfs is not unsafe) - safety: narrow the dd rule to `dd ... of=/dev/...` (a real device write), not benign `dd if=/dev/zero of=/tmp/file` - frontmatter: allow widely-used keys (version, user-invocable, tools, model) Effect across the 75: clean 100/A 21 -> 34, false safety vetoes 7 -> 4 (the 4 remaining are real `rm -rf ~/...`). No real check weakened — every genuine finding still flags and the unsafe fixture still scores 0/F. Benchmark: docs/examples/skill-benchmark/wild-skills.md (distribution, the precision before/after, remaining genuine findings) + raw/nextjs.md (77/C/Revise exemplar: vendored duplicate SKILL.md). Kept in docs/examples (not the landing page). Test count 151 -> 162. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Goal
Find bad skills, run skval on them, and use them as real benchmarks. I scored 75 installed skills (Anthropic plugins, Vercel suite, Superpowers) with skval's deterministic structural scan.
Key finding
The dramatic "bad" results were mostly skval false positives, not bad skills. So the highest-value outcome was hardening skval — and keeping the corpus as a false-positive regression set.
Precision fixes (each pinned in
tests/test_precision.py)missing referenced path: jsonwebtoken['"](jsonwebtoken)) parsed as a Markdown link0/F unsafe(hook-development,writing-rules)rm -rf/mkfsinside a hook that blocks them0/F unsafe(command-development)dd if=/dev/zero of=/tmp/file(a regular file)dd … of=/dev/…(a device)unexpected key: version / user-invocable / toolsEffect across the 75: clean
100/A21 → 34, false safety vetoes 7 → 4 (the 4 left are realrm -rf ~/dev/…). No real check weakened — every genuine finding still flags and the unsafe fixture still scores0/F.Benchmark
docs/examples/skill-benchmark/wild-skills.md— distribution, the precision before/after, and the genuine remaining findings (vendored duplicateSKILL.md, size budget, vendor keys, realrm -rf).raw/nextjs.md—77/C/Reviseexemplar (Vercel ships a vendoredupstream/SKILL.md).Validation
162 tests pass (+11 precision); self-validation 100/A/Ship; ruff clean. Test-count badges bumped 151 → 162.
🤖 Generated with Claude Code