Skip to content

fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15)#67

Merged
evemcgivern merged 2 commits into
devfrom
fix/15-runner-category-labels
Jun 16, 2026
Merged

fix(eval): runner label injection + measured baselines for rust/shell/yaml/swift (#15)#67
evemcgivern merged 2 commits into
devfrom
fix/15-runner-category-labels

Conversation

@evemcgivern

Copy link
Copy Markdown
Contributor

Summary

Closes out baseline calibration. Two related changes:

  1. Runner label injection (the fix). The eval-mode prompt named "a language-specific label" without enumerating the valid ones, so for custom-label languages the model invented synonyms the exact-match grader scored as FP+FN. Runners now run detect.sh eval-categories <lang> and inject the exact valid set.
  2. Corrected measured baselines (re-measured with the fixed runner, --runs 2):
Language before after
rust 0.67 (unwrap-panicpanic) 1.00 / 1.00
shell 0.67 (SC2086quoting) 1.00 / 1.00
yaml 0.67 (norway-problemcoerce) 1.00 / 1.00
swift PARTIAL (transient) 1.00 / 1.00

All 13 languages now have measured seen baselines at 1.00/1.00.

Deliberately NOT included

  • java held-out re-measured 0.25/0.25 with the overfit FLAG still firing — and this is not the label artifact (the same fix made rust/shell/yaml perfect). It's a fixture-quality problem: the agent-authored held-out fixtures (e.g. BugPathTraversal — a new File("/data/"+name) constructor, not a realized read/write) are ambiguous, so a real model reasonably doesn't flag them as labeled. Left at its conservative placeholder; tracked as a follow-up to sharpen those fixtures, then re-measure that split. The overfit FLAG is working as intended until then.

Test plan

  • bats tests/detect.bats 147/147; sh -n clean
  • label fix verified end-to-end (rust unwrap now emits 3:panic)
  • re-measure done on a clean checkout (no branch-switching mid-run)

CODEOWNERS review required (tests/eval/). Part of #15.

🤖 Generated with Claude Code

evemcgivern and others added 2 commits June 16, 2026 06:51
…ries) (#15)

The eval-mode prompt named 'a language-specific label' without enumerating the
valid ones, so for custom-label languages the model invented synonyms (panic->
unwrap-panic, quoting->SC2086, coerce->norway-problem) that the exact-match grader
scored as FP+FN — artificially depressing rust/shell/yaml to 0.67 and java held-out
to 0.50 (a FALSE overfit flag). Runners now run 'detect.sh eval-categories <lang>'
and inject the exact valid set. Verified: rust unwrap now emits '3:panic' (was
'unwrap-panic'), an exact match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…#15)

Re-measured with the label-fixed runner: all four now 1.00/1.00 (±0.00). rust/shell/
yaml were the 0.67 synonym-label artifacts (now resolved by injecting eval-categories);
swift's earlier PARTIAL transient cleared. Completes measured SEEN baselines for all 13
languages. java held-out (0.25, persistent overfit FLAG) intentionally NOT committed —
fixture-quality issue tracked separately; it keeps its conservative placeholder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@evemcgivern evemcgivern requested a review from a team as a code owner June 16, 2026 13:24
@evemcgivern evemcgivern merged commit 272f773 into dev Jun 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant