Skip to content

feat(eval): eval-run shortcut + runner label injection, calibrated baselines, full eval docs (#15)#69

Merged
evemcgivern merged 9 commits into
mainfrom
dev
Jun 16, 2026
Merged

feat(eval): eval-run shortcut + runner label injection, calibrated baselines, full eval docs (#15)#69
evemcgivern merged 9 commits into
mainfrom
dev

Conversation

@evemcgivern

Copy link
Copy Markdown
Contributor

Summary

Completes the #15 self-improving eval: it's now real-model-validated, fully calibrated, and documented.

Genuinely-new since v1.7.0:

Validation

End-to-end against real models: every language's seen corpus measured 1.00/1.00. The exercise surfaced and fixed a real harness bug (label injection) and confirmed the safety machinery (overfit FLAG + PARTIAL-refuses-baseline both fired correctly).

Known follow-ups (filed, non-blocking)

Test plan

  • bats 147/147; sh -n clean; BSD+GNU
  • all 13 seen baselines measured; calibration via CODEOWNERS PRs

🤖 Generated with Claude Code

evemcgivern and others added 9 commits June 15, 2026 21:34
…cript (#15) (#58)

Scoped pilot, claude runner, --runs 2: all three measured precision 1.00 / recall
1.00 (±0.00), 0 clean-FP. Sets precision_baseline/recall_baseline to measured;
floors left conservative (erosion gate now effectively precision >= 0.90).

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
#59)

Overnight sweep, claude runner, --runs 2: csharp, dart, java, kotlin, php, ruby
all measured precision 1.00 / recall 1.00 (±0.00), 0 clean-FP. Sets *_baseline to
measured; floors unchanged.

Deliberately EXCLUDED (measurement artifact, not committed): rust/shell/yaml seen
and java held-out scored 0.67/0.50 because the runner doesn't enumerate each
language's valid category labels — the model emitted correct-line, correct-concept
findings with synonym labels (panic->unwrap-panic, quoting->SC2086, coerce->
norway-problem, cat#2->swallowed-interrupt). Root-cause fix + re-measure tracked
separately. swift came back PARTIAL (transient) and was not written.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…eat (#15) (#60)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ow + safety) (#15) (#61)

Single mental-model overview at the top of the canonical eval doc: the four parts
(validator/harness/critic/runner), the corpus→runner→sentinel→grader→gate data flow,
and the load-bearing safety properties. No new file — avoids a third source of truth
drifting against the README and the design spec.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
… (#63)

Convenience entry point for the maintainer harness: locates detect.sh, auto-selects
a runner (claude preferred, codex fallback) if DEFECT_SCAN_EVAL_RUNNER is unset, and
forwards all flags to 'detect.sh eval-run'. README leads with the shortcut; CI sh -n
covers it; bats verifies forwarding (stub) + the no-runner exit-3 path.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…ness section (#15) (#64)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…e architecture overview (#15) (#65)

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
…/yaml/swift (#15) (#67)

* fix(eval): runners inject the language's valid label set (eval-categories) (#15)

The eval-mode prompt named 'a language-specific label' without enumerating the
valid ones, so for custom-label languages the model invented synonyms (panic->
unwrap-panic, quoting->SC2086, coerce->norway-problem) that the exact-match grader
scored as FP+FN — artificially depressing rust/shell/yaml to 0.67 and java held-out
to 0.50 (a FALSE overfit flag). Runners now run 'detect.sh eval-categories <lang>'
and inject the exact valid set. Verified: rust unwrap now emits '3:panic' (was
'unwrap-panic'), an exact match.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* chore(eval): record measured seen baselines for rust/shell/yaml/swift (#15)

Re-measured with the label-fixed runner: all four now 1.00/1.00 (±0.00). rust/shell/
yaml were the 0.67 synonym-label artifacts (now resolved by injecting eval-categories);
swift's earlier PARTIAL transient cleared. Completes measured SEEN baselines for all 13
languages. java held-out (0.25, persistent overfit FLAG) intentionally NOT committed —
fixture-quality issue tracked separately; it keeps its conservative placeholder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@evemcgivern evemcgivern requested a review from a team as a code owner June 16, 2026 17:48
@evemcgivern evemcgivern merged commit bbe0343 into main Jun 16, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant