Skip to content

fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus#74

Merged
evemcgivern merged 4 commits into
mainfrom
dev
Jun 17, 2026
Merged

fix(eval): deterministic self-contained eval runners + sharpened java held-out corpus#74
evemcgivern merged 4 commits into
mainfrom
dev

Conversation

@evemcgivern

Copy link
Copy Markdown
Contributor

Eval-harness reliability + corpus quality. Three eval-scoped changes since 1.8.0:

fix(eval): deterministic self-contained eval runners (#71#72)
The eval runners injected category labels as opaque IDs (cat#1..5) without their
definitions; the headless, sandboxed scan couldn't read baseline-categories.md, so it
thrashed on denied reads (intermittently missing the <<<EVAL>>> block → PARTIAL) and
guessed categories (cat#2-vs-cat#3 flips). Now both runners inject a cat#N = <definition>
legend inline (read-only, from the repo) and steer the model to reason directly on the single
fixture. java/held-out went from 0.33(±0.47)/PARTIAL to 1.00/1.00(±0.00). Affects every
language's calibration, not just java.

test(eval): sharpen java held-out fixtures + calibrate baseline (#68#73)
The agent-authored held-out java fixtures were ambiguous (a bare File constructor for
"traversal"; a swallowed interrupt that reads as cat#2 or cat#5), so a real model scored the
split 0.25/0.25 and tripped the overfit FLAG — a fixture-quality problem, not profile overfit.
Rewrote them as realized, unambiguous defects (realized path-traversal read → cat#3; swallowed
parse → cat#2), each with a clean precision twin, and recalibrated baseline.held-out.txt to the
measured 1.00/1.00. The seen-vs-held-out gap closes; the FLAG correctly clears.

docs(eval): CLAUDE.md model-free grading / eval-run details
Documents the two model-free eval layers (grader vs orchestrator) and DEFECT_SCAN_EVAL_RUNNER.

Gates: bats 147/147 (ubuntu + macos), runners POSIX-clean, no version/compat gates touched.

evemcgivern and others added 4 commits June 16, 2026 12:52
…s scoring (#72)

The runners injected the valid label set as opaque IDs (cat#1..5) but not their
definitions. The headless scan is sandboxed to the temp work dir, so the model
couldn't read baseline-categories.md to learn what the numbers mean — it thrashed
on denied reads (intermittently never emitting the <<<EVAL>>> block, scored as
PARTIAL) and guessed categories (cat#2-vs-cat#3 flips on the same fixture).

Inject the category legend inline from baseline-categories.md (read-only, no
sandbox loosening, no plugin-path discovery), tell the model the definitions are
provided so it need not read skill files, steer it to reason directly on the single
fixture, and strengthen "always emit the block." Symmetric across claude.sh and
codex.sh — divergence between runners is a bug.

java/held-out went from 0.33(±0.47)/PARTIAL to 1.00/1.00(±0.00), deterministic.

Closes #71

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
… (#73)

The held-out java fixtures were agent-authored in #15 and never real-model
validated; they were ambiguous, so a real model scored the split 0.25/0.25 and
tripped the seen-vs-held-out overfit FLAG — a fixture-quality problem, not a
profile-overfit one.

Rewrite them as unambiguous, realized defects, each with a clean precision twin:
- BugPathTraversal: realized Files.readAllBytes(...) of the user-controlled path
  (was a bare new File(...) constructor a model reasonably won't flag) -> cat#3
- BugSwallowedInterrupt -> BugSwallowedParse: swallowed NumberFormatException to a
  sentinel, no concurrency flavor, so no cat#2-vs-cat#5 ambiguity -> cat#2
- CleanSafePath: path read behind a strict whitelist (precision twin for cat#3)
- CleanReportedParse: parse error propagates, nothing swallowed (twin for cat#2)

Recalibrate baseline.held-out.txt to the measured 1.00/1.00 (runs=3, ±0.00,
clean_fp_runs=0, gate PASS) now that the runners are deterministic (#71/#72). The
seen-vs-held-out gap closes, so the overfit FLAG correctly clears.

Closes #68

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
@evemcgivern evemcgivern requested a review from a team as a code owner June 17, 2026 13:05
@evemcgivern evemcgivern merged commit 257b77d into main Jun 17, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant