fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15) by evemcgivern · Pull Request #67 · stylusnexus/defect-scan

evemcgivern · 2026-06-16T13:24:56Z

Summary

Closes out baseline calibration. Two related changes:

Runner label injection (the fix). The eval-mode prompt named "a language-specific label" without enumerating the valid ones, so for custom-label languages the model invented synonyms the exact-match grader scored as FP+FN. Runners now run detect.sh eval-categories <lang> and inject the exact valid set.
Corrected measured baselines (re-measured with the fixed runner, --runs 2):

Language	before	after
rust	0.67 (`unwrap-panic` ≠ `panic`)	1.00 / 1.00
shell	0.67 (`SC2086` ≠ `quoting`)	1.00 / 1.00
yaml	0.67 (`norway-problem` ≠ `coerce`)	1.00 / 1.00
swift	PARTIAL (transient)	1.00 / 1.00

All 13 languages now have measured seen baselines at 1.00/1.00.

Deliberately NOT included

java held-out re-measured 0.25/0.25 with the overfit FLAG still firing — and this is not the label artifact (the same fix made rust/shell/yaml perfect). It's a fixture-quality problem: the agent-authored held-out fixtures (e.g. BugPathTraversal — a new File("/data/"+name) constructor, not a realized read/write) are ambiguous, so a real model reasonably doesn't flag them as labeled. Left at its conservative placeholder; tracked as a follow-up to sharpen those fixtures, then re-measure that split. The overfit FLAG is working as intended until then.

Test plan

bats tests/detect.bats 147/147; sh -n clean
label fix verified end-to-end (rust unwrap now emits 3:panic)
re-measure done on a clean checkout (no branch-switching mid-run)

CODEOWNERS review required (tests/eval/). Part of #15.

🤖 Generated with Claude Code

…ries) (#15) The eval-mode prompt named 'a language-specific label' without enumerating the valid ones, so for custom-label languages the model invented synonyms (panic-> unwrap-panic, quoting->SC2086, coerce->norway-problem) that the exact-match grader scored as FP+FN — artificially depressing rust/shell/yaml to 0.67 and java held-out to 0.50 (a FALSE overfit flag). Runners now run 'detect.sh eval-categories <lang>' and inject the exact valid set. Verified: rust unwrap now emits '3:panic' (was 'unwrap-panic'), an exact match. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…#15) Re-measured with the label-fixed runner: all four now 1.00/1.00 (±0.00). rust/shell/ yaml were the 0.67 synonym-label artifacts (now resolved by injecting eval-categories); swift's earlier PARTIAL transient cleared. Completes measured SEEN baselines for all 13 languages. java held-out (0.25, persistent overfit FLAG) intentionally NOT committed — fixture-quality issue tracked separately; it keeps its conservative placeholder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern and others added 2 commits June 16, 2026 06:51

evemcgivern requested a review from a team as a code owner June 16, 2026 13:24

evemcgivern merged commit 272f773 into dev Jun 16, 2026
4 checks passed

evemcgivern mentioned this pull request Jun 16, 2026

feat(eval): eval-run shortcut + runner label injection, calibrated baselines, full eval docs (#15) #69

Merged

2 tasks

github-actions Bot mentioned this pull request Jun 16, 2026

chore(main): release 1.8.0 #70

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15)#67

fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15)#67
evemcgivern merged 2 commits into
devfrom
fix/15-runner-category-labels

evemcgivern commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evemcgivern commented Jun 16, 2026

Summary

Deliberately NOT included

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant