run: scope non-canonical --mode job overlay patches to the eval Job#5
Open
elronbandel wants to merge 1 commit into
Open
run: scope non-canonical --mode job overlay patches to the eval Job#5elronbandel wants to merge 1 commit into
elronbandel wants to merge 1 commit into
Conversation
The synthesized Kustomize overlay for non-canonical `--mode job` runs
patched the runner/gateway env and pod-template labels with an unscoped
`target: { kind: Job }`. On a benchmark that ships a bespoke second Job in
the same manifest — only tau-bench today (a user-sim `harness` Job beside
the eval `tau-bench-task-0` Job) — those strategic-merge patches leaked
into the harness Job too, injecting imageless `runner` and `gateway`
containers. The result is an invalid Job spec, so
`run tau-bench --mode job --task-id N --agent X` produced a manifest that
fails admission.
Scope all three patches (runner env, pod labels, gateway env) to the
canonical Job by name, and move the rename patch LAST so the name-scoped
patches still match `<bench>-task-0` before the rename changes it.
Verified by rendering the actual generated overlay:
- tau-bench: harness Job now has only [harness] (was [gateway, runner,
harness], two of them imageless); the task-N Job is unchanged.
- single-Job benchmarks (mmlu-pro, ...): render identically — no regression.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f1ca5a8 to
6058ed2
Compare
6058ed2 to
4bddc01
Compare
4bddc01 to
9e3e5b6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
eval-containers run <bench> --mode jobwith non-canonical axes (--agent/--task-id/--model) synthesizes a Kustomize overlay. Its rename patch is scoped by Job name, but the runner-env, pod-label, and gateway-env patches used an unscopedtarget: { kind: Job }.Every benchmark renders exactly one Job except tau-bench, which ships a bespoke second Job (a user-simulation
harness) in the same manifest. The unscoped patches strategic-merged arunnerandgatewaycontainer into the harness Job — imageless, so the Job spec is invalid andkubectl applyis rejected.['gateway', 'runner', 'harness']← 2 imageless → invalid['harness']Canonical
--mode joband all 100 single-Job benchmarks were unaffected; only non-canonical runs of tau-bench broke (silently — kustomize skips a non-matching target, so there was no error, just a bad manifest).Fix
Scope all three patches to the canonical Job by name (the way the rename patch already was), and move the rename last so the name-scoped patches still match
<bench>-task-0before the rename changes the name.Verification — rendered the actual generated overlay (not a mock)
--task-id 5 --agent codex --model gpt-x: harness Job →['harness'](leak gone);tau-bench-task-5→[otelcol, gateway, runner], zero imageless containers.mmlu-pro-task-5Job, runner envAGENT=codex TASK_ID=5, all containers have images — a current-vs-fixed render diff is empty (no behavior change for the other 100 benchmarks).cargo build+cargo test --no-runpass;cargo fmt --checkclean forrun.rs.Follow-up (separate)
No automated test covers overlay generation today; this multi-Job class of bug is exactly what the render/drift lint from the
gen-kustomizationdiscussion would catch. Tracking separately.🤖 Generated with Claude Code