Skip to content

fix(eval): make eval runners self-contained for deterministic headless scoring#72

Merged
evemcgivern merged 1 commit into
devfrom
fix/71-eval-runner-self-contained
Jun 17, 2026
Merged

fix(eval): make eval runners self-contained for deterministic headless scoring#72
evemcgivern merged 1 commit into
devfrom
fix/71-eval-runner-self-contained

Conversation

@evemcgivern

Copy link
Copy Markdown
Contributor

What

Make the eval runners self-contained so headless scoring is deterministic.

The runners injected the valid label set as opaque IDs (cat#1..5) but not their
definitions. The headless scan is sandboxed to the temp work dir, so the model could
not read the skill's baseline-categories.md to learn what those numbers mean. It both:

  • thrashed on denied reads → intermittently never emitted the <<<EVAL>>> block (PARTIAL), and
  • guessed categoriescat#2-vs-cat#3 flips on the same fixture.

Fix

  • Build a cat#N = <definition> legend from baseline-categories.md (read-only, in the
    dev repo — no sandbox loosening, no plugin-path discovery) and inject it into the prompt.
  • Tell the model the definitions are provided inline, so it need not read skill files; steer
    it to reason directly on the single fixture (the scope/triage/tool stages are N/A for one
    file and only generate denied-read noise).
  • Strengthen "always emit the block."
  • Applied symmetrically to claude.sh and codex.sh — divergence between runners is a bug.

Evidence

java/held-out, --runs 3:

  • before: mean_precision=0.33(±0.47) recall=0.17 + PARTIAL
  • after: mean_precision=1.00(±0.00) recall=1.00(±0.00) + gate PASS

The ±0.00 is the point — the runner is now deterministic.

Scope / safety

Closes #71

🤖 Generated with Claude Code

…s scoring

The runners injected the valid label set as opaque IDs (cat#1..5) but not their
definitions. The headless scan is sandboxed to the temp work dir, so the model
couldn't read baseline-categories.md to learn what the numbers mean — it thrashed
on denied reads (intermittently never emitting the <<<EVAL>>> block, scored as
PARTIAL) and guessed categories (cat#2-vs-cat#3 flips on the same fixture).

Inject the category legend inline from baseline-categories.md (read-only, no
sandbox loosening, no plugin-path discovery), tell the model the definitions are
provided so it need not read skill files, steer it to reason directly on the single
fixture, and strengthen "always emit the block." Symmetric across claude.sh and
codex.sh — divergence between runners is a bug.

java/held-out went from 0.33(±0.47)/PARTIAL to 1.00/1.00(±0.00), deterministic.

Closes #71

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@evemcgivern evemcgivern requested a review from a team as a code owner June 17, 2026 12:32
@evemcgivern evemcgivern merged commit aa57158 into dev Jun 17, 2026
4 checks passed
@evemcgivern evemcgivern deleted the fix/71-eval-runner-self-contained branch June 17, 2026 12:35
evemcgivern added a commit that referenced this pull request Jun 17, 2026
… (#73)

The held-out java fixtures were agent-authored in #15 and never real-model
validated; they were ambiguous, so a real model scored the split 0.25/0.25 and
tripped the seen-vs-held-out overfit FLAG — a fixture-quality problem, not a
profile-overfit one.

Rewrite them as unambiguous, realized defects, each with a clean precision twin:
- BugPathTraversal: realized Files.readAllBytes(...) of the user-controlled path
  (was a bare new File(...) constructor a model reasonably won't flag) -> cat#3
- BugSwallowedInterrupt -> BugSwallowedParse: swallowed NumberFormatException to a
  sentinel, no concurrency flavor, so no cat#2-vs-cat#5 ambiguity -> cat#2
- CleanSafePath: path read behind a strict whitelist (precision twin for cat#3)
- CleanReportedParse: parse error propagates, nothing swallowed (twin for cat#2)

Recalibrate baseline.held-out.txt to the measured 1.00/1.00 (runs=3, ±0.00,
clean_fp_runs=0, gate PASS) now that the runners are deterministic (#71/#72). The
seen-vs-held-out gap closes, so the overfit FLAG correctly clears.

Closes #68

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
evemcgivern added a commit that referenced this pull request Jun 17, 2026
…bels (#77)

#72 made the runner inject the cat#N definitions inline (from baseline-categories.md),
not only the eval-categories label set — that was the fix for nondeterministic headless
scoring. Update the data-flow diagram to match.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant