Skip to content

Eval outputs not namespaced by benchmark/task — local runs overwrite each other (and LOCAL.md paths don't exist) #136

Description

@elronbandel

What

--mode compose and --mode container both mount a single flat named volume output at /output, and write-result writes fixed, flat paths:

  • /output/task/result.json
  • /output/agent/result.json, /output/agent/stdout.log, /output/agent/patch.diff
  • /output/model/result.json, /output/model/gateway.log, /output/traces.jsonl

There is no <benchmark>/<task>/<agent> namespacing, so consecutive local runs overwrite each other — only the latest run's structured outputs (trajectory, gateway log, result.json) survive.

Repro

eval-containers run swe-bench --task-id sympy__sympy-24066 --agent claude-code --model ... --mode container
#   writes /output/task/result.json, /output/traces.jsonl, /output/model/gateway.log
eval-containers run aime --task-id 0 --agent codex --model ... --mode container
#   ^ silently clobbers the swe-bench run's /output — its trajectory + logs are gone

Hit directly while verifying SWE-bench locally: re-running the same task (to correct a model name) clobbered the prior run's /output.

Docs mismatch

tests/LOCAL.md documents a namespaced path that nothing produces:

cat output/aime/0/task/result.json            # line ~189
cp  output/aime/0/model/trajectory.jsonl ...  # line ~141

The real output is the flat output volume at /output/task/result.jsonoutput/aime/0/... doesn't exist on the host, so these commands don't work as written.

Impact

  • Local dev: can't keep or compare runs; all but the last run's trajectory + logs are lost.
  • Not a deployment issue: --mode job gives each pod its own per-pod results dir, so this is compose/container only.

Suggested fix

Namespace outputs by <benchmark>/<task>/<agent> to match the docs — e.g. have run / write-result write under /output/<benchmark>/<task>/... (and bind a host ./output), or mount a per-run volume. That makes the documented output/<bench>/<task>/... paths real and stops runs from clobbering each other.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions