Eval outputs not namespaced by benchmark/task — local runs overwrite each other (and LOCAL.md paths don't exist)

## What

`--mode compose` and `--mode container` both mount a single **flat named volume** `output` at `/output`, and `write-result` writes fixed, flat paths:

- `/output/task/result.json`
- `/output/agent/result.json`, `/output/agent/stdout.log`, `/output/agent/patch.diff`
- `/output/model/result.json`, `/output/model/gateway.log`, `/output/traces.jsonl`

There is **no `<benchmark>/<task>/<agent>` namespacing**, so **consecutive local runs overwrite each other** — only the latest run's structured outputs (trajectory, gateway log, result.json) survive.

## Repro

```bash
eval-containers run swe-bench --task-id sympy__sympy-24066 --agent claude-code --model ... --mode container
#   writes /output/task/result.json, /output/traces.jsonl, /output/model/gateway.log
eval-containers run aime --task-id 0 --agent codex --model ... --mode container
#   ^ silently clobbers the swe-bench run's /output — its trajectory + logs are gone
```

Hit directly while verifying SWE-bench locally: re-running the *same* task (to correct a model name) clobbered the prior run's `/output`.

## Docs mismatch

`tests/LOCAL.md` documents a namespaced path that nothing produces:

```bash
cat output/aime/0/task/result.json            # line ~189
cp  output/aime/0/model/trajectory.jsonl ...  # line ~141
```

The real output is the flat `output` **volume** at `/output/task/result.json` — `output/aime/0/...` doesn't exist on the host, so these commands don't work as written.

## Impact

- **Local dev:** can't keep or compare runs; all but the last run's trajectory + logs are lost.
- **Not a deployment issue:** `--mode job` gives each pod its own per-pod results dir, so this is `compose`/`container` only.

## Suggested fix

Namespace outputs by `<benchmark>/<task>/<agent>` to match the docs — e.g. have `run` / `write-result` write under `/output/<benchmark>/<task>/...` (and bind a host `./output`), or mount a per-run volume. That makes the documented `output/<bench>/<task>/...` paths real and stops runs from clobbering each other.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval outputs not namespaced by benchmark/task — local runs overwrite each other (and LOCAL.md paths don't exist) #136

What

Repro

Docs mismatch

Impact

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Eval outputs not namespaced by benchmark/task — local runs overwrite each other (and LOCAL.md paths don't exist) #136

Description

What

Repro

Docs mismatch

Impact

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions