feat(eval): eval-run shortcut + runner label injection, calibrated baselines, full eval docs (#15) by evemcgivern · Pull Request #69 · stylusnexus/defect-scan

evemcgivern · 2026-06-16T17:48:03Z

Summary

Completes the #15 self-improving eval: it's now real-model-validated, fully calibrated, and documented.

Genuinely-new since v1.7.0:

feat(eval): scripts/eval-run wrapper — auto-selects a runner, forwards all flags (feat(eval): scripts/eval-run shortcut for the harness (#15) #63).
fix(eval): runners inject the language's valid label set (eval-categories) so the model emits grader-matchable labels (fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15) #67) — fixed a measurement bug that scored correct finds as FP+FN for custom-label languages.
chore(eval): measured seen baselines for all 13 languages at 1.00/1.00 (chore(eval): record measured baselines — pilot (python/go/react-typescript) (#15) #58, chore(eval): measured baselines for 6 more languages (csharp/dart/java/kotlin/php/ruby) (#15) #59, fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15) #67) — replacing conservative placeholders; the gate now enforces a measured erosion floor (~0.90 precision) instead of a guess.
docs(eval): "how the full eval works" architecture overview, language-vs-runner clarity, runner-injection callout, baseline-calibration note (docs(eval): baseline calibration note + label-set caveat (#15) #60, docs(eval): 'how the full eval works' overview (#15) #61, docs(eval): clarify language vs runner (#15) #64, docs(eval): call out runner injection in architecture overview (#15) #65).

Validation

End-to-end against real models: every language's seen corpus measured 1.00/1.00. The exercise surfaced and fixed a real harness bug (label injection) and confirmed the safety machinery (overfit FLAG + PARTIAL-refuses-baseline both fired correctly).

Known follow-ups (filed, non-blocking)

Sharpen java held-out fixtures (real-model scores 0.25; persistent overfit FLAG) #68 java held-out fixtures are ambiguous (real-model 0.25); kept at placeholder, overfit FLAG correctly firing.
Supply-chain defect pattern pack (npm-first): malicious install scripts, dependency confusion, lockfile tampering #66 supply-chain pattern pack (npm-first). Optional cost-capped scheduled eval-run (regression heartbeat) #62 optional scheduled eval-run.

Test plan

bats 147/147; sh -n clean; BSD+GNU
all 13 seen baselines measured; calibration via CODEOWNERS PRs

🤖 Generated with Claude Code

…cript (#15) (#58) Scoped pilot, claude runner, --runs 2: all three measured precision 1.00 / recall 1.00 (±0.00), 0 clean-FP. Sets precision_baseline/recall_baseline to measured; floors left conservative (erosion gate now effectively precision >= 0.90). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

#59) Overnight sweep, claude runner, --runs 2: csharp, dart, java, kotlin, php, ruby all measured precision 1.00 / recall 1.00 (±0.00), 0 clean-FP. Sets *_baseline to measured; floors unchanged. Deliberately EXCLUDED (measurement artifact, not committed): rust/shell/yaml seen and java held-out scored 0.67/0.50 because the runner doesn't enumerate each language's valid category labels — the model emitted correct-line, correct-concept findings with synonym labels (panic->unwrap-panic, quoting->SC2086, coerce-> norway-problem, cat#2->swallowed-interrupt). Root-cause fix + re-measure tracked separately. swift came back PARTIAL (transient) and was not written. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…eat (#15) (#60) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…ow + safety) (#15) (#61) Single mental-model overview at the top of the canonical eval doc: the four parts (validator/harness/critic/runner), the corpus→runner→sentinel→grader→gate data flow, and the load-bearing safety properties. No new file — avoids a third source of truth drifting against the README and the design spec. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

… (#63) Convenience entry point for the maintainer harness: locates detect.sh, auto-selects a runner (claude preferred, codex fallback) if DEFECT_SCAN_EVAL_RUNNER is unset, and forwards all flags to 'detect.sh eval-run'. README leads with the shortcut; CI sh -n covers it; bats verifies forwarding (stub) + the no-runner exit-3 path. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…ness section (#15) (#64) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…e architecture overview (#15) (#65) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…/yaml/swift (#15) (#67) * fix(eval): runners inject the language's valid label set (eval-categories) (#15) The eval-mode prompt named 'a language-specific label' without enumerating the valid ones, so for custom-label languages the model invented synonyms (panic-> unwrap-panic, quoting->SC2086, coerce->norway-problem) that the exact-match grader scored as FP+FN — artificially depressing rust/shell/yaml to 0.67 and java held-out to 0.50 (a FALSE overfit flag). Runners now run 'detect.sh eval-categories <lang>' and inject the exact valid set. Verified: rust unwrap now emits '3:panic' (was 'unwrap-panic'), an exact match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * chore(eval): record measured seen baselines for rust/shell/yaml/swift (#15) Re-measured with the label-fixed runner: all four now 1.00/1.00 (±0.00). rust/shell/ yaml were the 0.67 synonym-label artifacts (now resolved by injecting eval-categories); swift's earlier PARTIAL transient cleared. Completes measured SEEN baselines for all 13 languages. java held-out (0.25, persistent overfit FLAG) intentionally NOT committed — fixture-quality issue tracked separately; it keeps its conservative placeholder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern and others added 9 commits June 15, 2026 21:34

chore(deploy): back-merge release 1.7.0 stamp into dev

c3400f5

docs(eval): note baselines are measured (calibration) + label-set cav…

9851375

…eat (#15) (#60) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

docs(eval): clarify language (corpus) vs runner (which AI) in the har…

3b5e7c4

…ness section (#15) (#64) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

docs(eval): call out runner injection (DEFECT_SCAN_EVAL_RUNNER) in th…

549ba08

…e architecture overview (#15) (#65) Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern requested a review from a team as a code owner June 16, 2026 17:48

evemcgivern merged commit bbe0343 into main Jun 16, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): eval-run shortcut + runner label injection, calibrated baselines, full eval docs (#15)#69

feat(eval): eval-run shortcut + runner label injection, calibrated baselines, full eval docs (#15)#69
evemcgivern merged 9 commits into
mainfrom
dev

evemcgivern commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evemcgivern commented Jun 16, 2026

Summary

Validation

Known follow-ups (filed, non-blocking)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant