Skip to content

run: scope non-canonical --mode job overlay patches to the eval Job#5

Open
elronbandel wants to merge 1 commit into
mainfrom
elron/fix-overlay-job-scope
Open

run: scope non-canonical --mode job overlay patches to the eval Job#5
elronbandel wants to merge 1 commit into
mainfrom
elron/fix-overlay-job-scope

Conversation

@elronbandel

Copy link
Copy Markdown
Contributor

Problem

eval-containers run <bench> --mode job with non-canonical axes (--agent / --task-id / --model) synthesizes a Kustomize overlay. Its rename patch is scoped by Job name, but the runner-env, pod-label, and gateway-env patches used an unscoped target: { kind: Job }.

Every benchmark renders exactly one Job except tau-bench, which ships a bespoke second Job (a user-simulation harness) in the same manifest. The unscoped patches strategic-merged a runner and gateway container into the harness Job — imageless, so the Job spec is invalid and kubectl apply is rejected.

harness Job containers
before ['gateway', 'runner', 'harness'] ← 2 imageless → invalid
after ['harness']

Canonical --mode job and all 100 single-Job benchmarks were unaffected; only non-canonical runs of tau-bench broke (silently — kustomize skips a non-matching target, so there was no error, just a bad manifest).

Fix

Scope all three patches to the canonical Job by name (the way the rename patch already was), and move the rename last so the name-scoped patches still match <bench>-task-0 before the rename changes the name.

Verification — rendered the actual generated overlay (not a mock)

  • tau-bench --task-id 5 --agent codex --model gpt-x: harness Job → ['harness'] (leak gone); tau-bench-task-5[otelcol, gateway, runner], zero imageless containers.
  • mmlu-pro (single Job): one mmlu-pro-task-5 Job, runner env AGENT=codex TASK_ID=5, all containers have images — a current-vs-fixed render diff is empty (no behavior change for the other 100 benchmarks).
  • cargo build + cargo test --no-run pass; cargo fmt --check clean for run.rs.

Follow-up (separate)

No automated test covers overlay generation today; this multi-Job class of bug is exactly what the render/drift lint from the gen-kustomization discussion would catch. Tracking separately.

🤖 Generated with Claude Code

The synthesized Kustomize overlay for non-canonical `--mode job` runs
patched the runner/gateway env and pod-template labels with an unscoped
`target: { kind: Job }`. On a benchmark that ships a bespoke second Job in
the same manifest — only tau-bench today (a user-sim `harness` Job beside
the eval `tau-bench-task-0` Job) — those strategic-merge patches leaked
into the harness Job too, injecting imageless `runner` and `gateway`
containers. The result is an invalid Job spec, so
`run tau-bench --mode job --task-id N --agent X` produced a manifest that
fails admission.

Scope all three patches (runner env, pod labels, gateway env) to the
canonical Job by name, and move the rename patch LAST so the name-scoped
patches still match `<bench>-task-0` before the rename changes it.

Verified by rendering the actual generated overlay:
- tau-bench: harness Job now has only [harness] (was [gateway, runner,
  harness], two of them imageless); the task-N Job is unchanged.
- single-Job benchmarks (mmlu-pro, ...): render identically — no regression.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant