fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus by evemcgivern · Pull Request #74 · stylusnexus/defect-scan

evemcgivern · 2026-06-17T13:05:35Z

Eval-harness reliability + corpus quality. Three eval-scoped changes since 1.8.0:

fix(eval): deterministic self-contained eval runners (#71 → #72)
The eval runners injected category labels as opaque IDs (cat#1..5) without their
definitions; the headless, sandboxed scan couldn't read baseline-categories.md, so it
thrashed on denied reads (intermittently missing the <<<EVAL>>> block → PARTIAL) and
guessed categories (cat#2-vs-cat#3 flips). Now both runners inject a cat#N = <definition>
legend inline (read-only, from the repo) and steer the model to reason directly on the single
fixture. java/held-out went from 0.33(±0.47)/PARTIAL to 1.00/1.00(±0.00). Affects every
language's calibration, not just java.

test(eval): sharpen java held-out fixtures + calibrate baseline (#68 → #73)
The agent-authored held-out java fixtures were ambiguous (a bare File constructor for
"traversal"; a swallowed interrupt that reads as cat#2 or cat#5), so a real model scored the
split 0.25/0.25 and tripped the overfit FLAG — a fixture-quality problem, not profile overfit.
Rewrote them as realized, unambiguous defects (realized path-traversal read → cat#3; swallowed
parse → cat#2), each with a clean precision twin, and recalibrated baseline.held-out.txt to the
measured 1.00/1.00. The seen-vs-held-out gap closes; the FLAG correctly clears.

docs(eval): CLAUDE.md model-free grading / eval-run details
Documents the two model-free eval layers (grader vs orchestrator) and DEFECT_SCAN_EVAL_RUNNER.

Gates: bats 147/147 (ubuntu + macos), runners POSIX-clean, no version/compat gates touched.

…grading details

…s scoring (#72) The runners injected the valid label set as opaque IDs (cat#1..5) but not their definitions. The headless scan is sandboxed to the temp work dir, so the model couldn't read baseline-categories.md to learn what the numbers mean — it thrashed on denied reads (intermittently never emitting the <<<EVAL>>> block, scored as PARTIAL) and guessed categories (cat#2-vs-cat#3 flips on the same fixture). Inject the category legend inline from baseline-categories.md (read-only, no sandbox loosening, no plugin-path discovery), tell the model the definitions are provided so it need not read skill files, steer it to reason directly on the single fixture, and strengthen "always emit the block." Symmetric across claude.sh and codex.sh — divergence between runners is a bug. java/held-out went from 0.33(±0.47)/PARTIAL to 1.00/1.00(±0.00), deterministic. Closes #71 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

… (#73) The held-out java fixtures were agent-authored in #15 and never real-model validated; they were ambiguous, so a real model scored the split 0.25/0.25 and tripped the seen-vs-held-out overfit FLAG — a fixture-quality problem, not a profile-overfit one. Rewrite them as unambiguous, realized defects, each with a clean precision twin: - BugPathTraversal: realized Files.readAllBytes(...) of the user-controlled path (was a bare new File(...) constructor a model reasonably won't flag) -> cat#3 - BugSwallowedInterrupt -> BugSwallowedParse: swallowed NumberFormatException to a sentinel, no concurrency flavor, so no cat#2-vs-cat#5 ambiguity -> cat#2 - CleanSafePath: path read behind a strict whitelist (precision twin for cat#3) - CleanReportedParse: parse error propagates, nothing swallowed (twin for cat#2) Recalibrate baseline.held-out.txt to the measured 1.00/1.00 (runs=3, ±0.00, clean_fp_runs=0, gate PASS) now that the runners are deterministic (#71/#72). The seen-vs-held-out gap closes, so the overfit FLAG correctly clears. Closes #68 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern and others added 4 commits June 16, 2026 12:52

chore(deploy): back-merge release 1.8.0 stamp into dev

0cdc037

docs(eval): enhance evaluation process documentation with model-free …

0ea5534

…grading details

evemcgivern requested a review from a team as a code owner June 17, 2026 13:05

evemcgivern merged commit 257b77d into main Jun 17, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus#74

fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus#74
evemcgivern merged 4 commits into
mainfrom
dev

evemcgivern commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evemcgivern commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant