Agent task-id isolation (#80) is bypassed by custom-runnerArgs benchmarks (tau-bench)

## Context

#80 stops leaking the benchmark **task identity** to the agent by removing `TASK_ID` from the agent's `env -i` allow-list in `core/process-compose/process-compose.yaml` (rule 7 — a model can recall a memorized solution from an instance id). That covers every benchmark whose runner goes through `/usr/local/bin/run` → process-compose (all three modes).

## Problem

A benchmark whose chart preset (or `compose.yaml` / `container.Dockerfile`) sets a **custom `runnerArgs`/command** runs its agent *without* going through process-compose's `env -i` — so it inherits the runner container env, which still carries `EVAL_TASK_ID`/`TASK_ID`. The agent (and model) can then see the task id, defeating #80 for that benchmark.

**Known instance: tau-bench** — `benchmarks/_chart/presets/tau-bench.yaml` sets `runnerArgs: python3 /app/agent.py …`, bypassing `/usr/local/bin/run`. (tau-bench is *shared-env*, so its task id is a dataset index — lower memorization risk than SWE-bench's instance ids — but the isolation guarantee still has a hole.)

## To do

- Audit all surfaces for custom agent invocations that bypass the process-compose `env -i` allow-list (chart presets' `runnerArgs`, per-benchmark `compose.yaml` command overrides, `container.Dockerfile`).
- Ensure each strips the task identity from the agent's env (run the custom agent under the shared task-id-free allow-list, or `unset EVAL_TASK_ID TASK_ID` before it).
- Extend the rule-7 conformance test added in #80 (`tests/sanity/check.rs::agent_env_excludes_the_task_id`) to cover custom-runnerArgs benchmarks.

## Notes

- Follow-up to #80. Low severity (only tau-bench today; shared-env). Relevant rules: 7 (agent env), 24 (triple-mode).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent task-id isolation (#80) is bypassed by custom-runnerArgs benchmarks (tau-bench) #84

Context

Problem

To do

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Agent task-id isolation (#80) is bypassed by custom-runnerArgs benchmarks (tau-bench) #84

Description

Context

Problem

To do

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions