Skip to content

feat: scanner precision fixes + wild-skills benchmark (75 real skills)#42

Merged
DCCA merged 1 commit into
mainfrom
feat/scanner-precision
Jun 23, 2026
Merged

feat: scanner precision fixes + wild-skills benchmark (75 real skills)#42
DCCA merged 1 commit into
mainfrom
feat/scanner-precision

Conversation

@DCCA

@DCCA DCCA commented Jun 23, 2026

Copy link
Copy Markdown
Owner

Goal

Find bad skills, run skval on them, and use them as real benchmarks. I scored 75 installed skills (Anthropic plugins, Vercel suite, Superpowers) with skval's deterministic structural scan.

Key finding

The dramatic "bad" results were mostly skval false positives, not bad skills. So the highest-value outcome was hardening skval — and keeping the corpus as a false-positive regression set.

Precision fixes (each pinned in tests/test_precision.py)

skval was flagging Reality Fix
missing referenced path: jsonwebtoken a regex in YAML frontmatter (['"](jsonwebtoken)) parsed as a Markdown link scan refs in the body only (skip frontmatter + code)
0/F unsafe (hook-development, writing-rules) rm -rf/mkfs inside a hook that blocks them skip dangerous tokens in a quoted/defensive context
0/F unsafe (command-development) dd if=/dev/zero of=/tmp/file (a regular file) narrow the rule to dd … of=/dev/… (a device)
unexpected key: version / user-invocable / tools valid, widely-used frontmatter broaden the allow-list

Effect across the 75: clean 100/A 21 → 34, false safety vetoes 7 → 4 (the 4 left are real rm -rf ~/dev/…). No real check weakened — every genuine finding still flags and the unsafe fixture still scores 0/F.

Benchmark

  • docs/examples/skill-benchmark/wild-skills.md — distribution, the precision before/after, and the genuine remaining findings (vendored duplicate SKILL.md, size budget, vendor keys, real rm -rf).
  • raw/nextjs.md77/C/Revise exemplar (Vercel ships a vendored upstream/SKILL.md).
  • Linked from the benchmark README. Kept in docs/examples — not the landing page (no named third-party low scores headlined), per the agreed framing.

Validation

162 tests pass (+11 precision); self-validation 100/A/Ship; ruff clean. Test-count badges bumped 151 → 162.

🤖 Generated with Claude Code

…ills

Ran skval across 75 installed skills (Anthropic plugins, Vercel, Superpowers).
The run exposed four false-positive classes in the structural/safety scan,
now fixed and locked with regression tests; the corpus is published as a
benchmark that doubles as the FP regression set.

Precision fixes (tests/test_precision.py):
- refs: scan the body only (skip frontmatter + code) so a regex like
  ['"](jsonwebtoken) in YAML no longer reads as a broken markdown link
- safety: skip dangerous tokens in defensive/quoted context (a hook that
  *blocks* rm -rf/mkfs is not unsafe)
- safety: narrow the dd rule to `dd ... of=/dev/...` (a real device write),
  not benign `dd if=/dev/zero of=/tmp/file`
- frontmatter: allow widely-used keys (version, user-invocable, tools, model)

Effect across the 75: clean 100/A 21 -> 34, false safety vetoes 7 -> 4
(the 4 remaining are real `rm -rf ~/...`). No real check weakened — every
genuine finding still flags and the unsafe fixture still scores 0/F.

Benchmark: docs/examples/skill-benchmark/wild-skills.md (distribution,
the precision before/after, remaining genuine findings) + raw/nextjs.md
(77/C/Revise exemplar: vendored duplicate SKILL.md). Kept in docs/examples
(not the landing page). Test count 151 -> 162.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@DCCA DCCA merged commit 727b85f into main Jun 23, 2026
1 check passed
@DCCA DCCA deleted the feat/scanner-precision branch June 23, 2026 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants