What
--mode compose and --mode container both mount a single flat named volume output at /output, and write-result writes fixed, flat paths:
/output/task/result.json
/output/agent/result.json, /output/agent/stdout.log, /output/agent/patch.diff
/output/model/result.json, /output/model/gateway.log, /output/traces.jsonl
There is no <benchmark>/<task>/<agent> namespacing, so consecutive local runs overwrite each other — only the latest run's structured outputs (trajectory, gateway log, result.json) survive.
Repro
eval-containers run swe-bench --task-id sympy__sympy-24066 --agent claude-code --model ... --mode container
# writes /output/task/result.json, /output/traces.jsonl, /output/model/gateway.log
eval-containers run aime --task-id 0 --agent codex --model ... --mode container
# ^ silently clobbers the swe-bench run's /output — its trajectory + logs are gone
Hit directly while verifying SWE-bench locally: re-running the same task (to correct a model name) clobbered the prior run's /output.
Docs mismatch
tests/LOCAL.md documents a namespaced path that nothing produces:
cat output/aime/0/task/result.json # line ~189
cp output/aime/0/model/trajectory.jsonl ... # line ~141
The real output is the flat output volume at /output/task/result.json — output/aime/0/... doesn't exist on the host, so these commands don't work as written.
Impact
- Local dev: can't keep or compare runs; all but the last run's trajectory + logs are lost.
- Not a deployment issue:
--mode job gives each pod its own per-pod results dir, so this is compose/container only.
Suggested fix
Namespace outputs by <benchmark>/<task>/<agent> to match the docs — e.g. have run / write-result write under /output/<benchmark>/<task>/... (and bind a host ./output), or mount a per-run volume. That makes the documented output/<bench>/<task>/... paths real and stops runs from clobbering each other.
What
--mode composeand--mode containerboth mount a single flat named volumeoutputat/output, andwrite-resultwrites fixed, flat paths:/output/task/result.json/output/agent/result.json,/output/agent/stdout.log,/output/agent/patch.diff/output/model/result.json,/output/model/gateway.log,/output/traces.jsonlThere is no
<benchmark>/<task>/<agent>namespacing, so consecutive local runs overwrite each other — only the latest run's structured outputs (trajectory, gateway log, result.json) survive.Repro
Hit directly while verifying SWE-bench locally: re-running the same task (to correct a model name) clobbered the prior run's
/output.Docs mismatch
tests/LOCAL.mddocuments a namespaced path that nothing produces:The real output is the flat
outputvolume at/output/task/result.json—output/aime/0/...doesn't exist on the host, so these commands don't work as written.Impact
--mode jobgives each pod its own per-pod results dir, so this iscompose/containeronly.Suggested fix
Namespace outputs by
<benchmark>/<task>/<agent>to match the docs — e.g. haverun/write-resultwrite under/output/<benchmark>/<task>/...(and bind a host./output), or mount a per-run volume. That makes the documentedoutput/<bench>/<task>/...paths real and stops runs from clobbering each other.