fix(eval): make eval runners self-contained for deterministic headless scoring by evemcgivern · Pull Request #72 · stylusnexus/defect-scan

evemcgivern · 2026-06-17T12:32:31Z

What

Make the eval runners self-contained so headless scoring is deterministic.

The runners injected the valid label set as opaque IDs (cat#1..5) but not their
definitions. The headless scan is sandboxed to the temp work dir, so the model could
not read the skill's baseline-categories.md to learn what those numbers mean. It both:

thrashed on denied reads → intermittently never emitted the <<<EVAL>>> block (PARTIAL), and
guessed categories → cat#2-vs-cat#3 flips on the same fixture.

Fix

Build a cat#N = <definition> legend from baseline-categories.md (read-only, in the
dev repo — no sandbox loosening, no plugin-path discovery) and inject it into the prompt.
Tell the model the definitions are provided inline, so it need not read skill files; steer
it to reason directly on the single fixture (the scope/triage/tool stages are N/A for one
file and only generate denied-read noise).
Strengthen "always emit the block."
Applied symmetrically to claude.sh and codex.sh — divergence between runners is a bug.

Evidence

java/held-out, --runs 3:

before: mean_precision=0.33(±0.47) recall=0.17 + PARTIAL
after: mean_precision=1.00(±0.00) recall=1.00(±0.00) + gate PASS

The ±0.00 is the point — the runner is now deterministic.

Scope / safety

Read-only invariant preserved (write-deny tools unchanged; legend is read at runner-build
time from the dev repo). bats tests/detect.bats green (147/0), sh -n clean.
Cross-cutting: this fixes nondeterminism for every language's calibration. Other
languages' baselines were set with the old flaky runner and may now measure higher — a
re-baseline sweep is noted as follow-up in Eval runners inject opaque category IDs without definitions — nondeterministic headless scoring #71.

Closes #71

🤖 Generated with Claude Code

…s scoring The runners injected the valid label set as opaque IDs (cat#1..5) but not their definitions. The headless scan is sandboxed to the temp work dir, so the model couldn't read baseline-categories.md to learn what the numbers mean — it thrashed on denied reads (intermittently never emitting the <<<EVAL>>> block, scored as PARTIAL) and guessed categories (cat#2-vs-cat#3 flips on the same fixture). Inject the category legend inline from baseline-categories.md (read-only, no sandbox loosening, no plugin-path discovery), tell the model the definitions are provided so it need not read skill files, steer it to reason directly on the single fixture, and strengthen "always emit the block." Symmetric across claude.sh and codex.sh — divergence between runners is a bug. java/held-out went from 0.33(±0.47)/PARTIAL to 1.00/1.00(±0.00), deterministic. Closes #71 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… (#73) The held-out java fixtures were agent-authored in #15 and never real-model validated; they were ambiguous, so a real model scored the split 0.25/0.25 and tripped the seen-vs-held-out overfit FLAG — a fixture-quality problem, not a profile-overfit one. Rewrite them as unambiguous, realized defects, each with a clean precision twin: - BugPathTraversal: realized Files.readAllBytes(...) of the user-controlled path (was a bare new File(...) constructor a model reasonably won't flag) -> cat#3 - BugSwallowedInterrupt -> BugSwallowedParse: swallowed NumberFormatException to a sentinel, no concurrency flavor, so no cat#2-vs-cat#5 ambiguity -> cat#2 - CleanSafePath: path read behind a strict whitelist (precision twin for cat#3) - CleanReportedParse: parse error propagates, nothing swallowed (twin for cat#2) Recalibrate baseline.held-out.txt to the measured 1.00/1.00 (runs=3, ±0.00, clean_fp_runs=0, gate PASS) now that the runners are deterministic (#71/#72). The seen-vs-held-out gap closes, so the overfit FLAG correctly clears. Closes #68 Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

…bels (#77) #72 made the runner inject the cat#N definitions inline (from baseline-categories.md), not only the eval-categories label set — that was the fix for nondeterministic headless scoring. Update the data-flow diagram to match. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

evemcgivern requested a review from a team as a code owner June 17, 2026 12:32

evemcgivern merged commit aa57158 into dev Jun 17, 2026
4 checks passed

evemcgivern deleted the fix/71-eval-runner-self-contained branch June 17, 2026 12:35

evemcgivern mentioned this pull request Jun 17, 2026

test(eval): sharpen java held-out fixtures and calibrate baseline (#68) #73

Merged

evemcgivern mentioned this pull request Jun 17, 2026

fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus #74

Merged

github-actions Bot mentioned this pull request Jun 17, 2026

chore(main): release 1.8.1 #75

Merged

This was referenced Jun 17, 2026

Re-baseline all languages against the deterministic eval runner (#72 follow-up) #76

Open

docs(eval): runner injects category definitions, not just labels #77

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(eval): make eval runners self-contained for deterministic headless scoring#72

fix(eval): make eval runners self-contained for deterministic headless scoring#72
evemcgivern merged 1 commit into
devfrom
fix/71-eval-runner-self-contained

evemcgivern commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evemcgivern commented Jun 17, 2026

What

Fix

Evidence

Scope / safety

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant