test(eval): sharpen java held-out fixtures and calibrate baseline (#68) by evemcgivern · Pull Request #73 · stylusnexus/defect-scan

evemcgivern · 2026-06-17T12:36:14Z

What

Sharpen the java held-out eval fixtures and calibrate the split's baseline.

The held-out fixtures were agent-authored in #15 and never real-model validated. They
were ambiguous, so a real model scored the split 0.25/0.25 and tripped the seen-vs-held-out
overfit FLAG — a fixture-quality problem, not a profile-overfit one (the same runner
fix made rust/shell/yaml/swift perfect).

Changes

Every defect is now realized and unambiguous, and each buggy fixture has a clean
precision twin (mirrors the seen split's 1.00-scoring balance):

fixture	category	why unambiguous
`BugPathTraversal`	cat#3	realized `Files.readAllBytes(Paths.get("/data/" + name))` — an actual read, not a bare `new File(...)` constructor
`BugSwallowedParse` (was `BugSwallowedInterrupt`)	cat#2	swallowed `NumberFormatException`→sentinel; no concurrency flavor, so no cat#2-vs-cat#5 split
`CleanSafePath` (was `CleanRestoresInterrupt`)	clean	path read behind a strict whitelist — precision twin for cat#3
`CleanReportedParse` (new)	clean	parse error propagates, nothing swallowed — precision twin for cat#2

Calibration

baseline.held-out.txt recalibrated from the conservative placeholder (0.80/0.50) to the
measured 1.00/1.00 (runs=3, ±0.00, clean_fp_runs=0, gate PASS), now that the
runners are deterministic (#71 / #72). Floors and bands unchanged. The seen-vs-held-out gap
closes → the overfit FLAG correctly clears.

Depends on

#72 (runner determinism) — already merged to dev. Without it the 1.00/1.00 baseline
would not be reproducible.

Closes #68

🤖 Generated with Claude Code

The held-out java fixtures were agent-authored in #15 and never real-model validated; they were ambiguous, so a real model scored the split 0.25/0.25 and tripped the seen-vs-held-out overfit FLAG — a fixture-quality problem, not a profile-overfit one. Rewrite them as unambiguous, realized defects, each with a clean precision twin: - BugPathTraversal: realized Files.readAllBytes(...) of the user-controlled path (was a bare new File(...) constructor a model reasonably won't flag) -> cat#3 - BugSwallowedInterrupt -> BugSwallowedParse: swallowed NumberFormatException to a sentinel, no concurrency flavor, so no cat#2-vs-cat#5 ambiguity -> cat#2 - CleanSafePath: path read behind a strict whitelist (precision twin for cat#3) - CleanReportedParse: parse error propagates, nothing swallowed (twin for cat#2) Recalibrate baseline.held-out.txt to the measured 1.00/1.00 (runs=3, ±0.00, clean_fp_runs=0, gate PASS) now that the runners are deterministic (#71/#72). The seen-vs-held-out gap closes, so the overfit FLAG correctly clears. Closes #68 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern requested a review from a team as a code owner June 17, 2026 12:36

evemcgivern merged commit db7dc11 into dev Jun 17, 2026
4 checks passed

evemcgivern deleted the feat/68-java-held-out-fixtures branch June 17, 2026 12:38

This was referenced Jun 17, 2026

fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus #74

Merged

Re-baseline all languages against the deterministic eval runner (#72 follow-up) #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(eval): sharpen java held-out fixtures and calibrate baseline (#68)#73

test(eval): sharpen java held-out fixtures and calibrate baseline (#68)#73
evemcgivern merged 1 commit into
devfrom
feat/68-java-held-out-fixtures

evemcgivern commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evemcgivern commented Jun 17, 2026

What

Changes

Calibration

Depends on

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant