You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#80 stops leaking the benchmark task identity to the agent by removing TASK_ID from the agent's env -i allow-list in core/process-compose/process-compose.yaml (rule 7 — a model can recall a memorized solution from an instance id). That covers every benchmark whose runner goes through /usr/local/bin/run → process-compose (all three modes).
Problem
A benchmark whose chart preset (or compose.yaml / container.Dockerfile) sets a custom runnerArgs/command runs its agent without going through process-compose's env -i — so it inherits the runner container env, which still carries EVAL_TASK_ID/TASK_ID. The agent (and model) can then see the task id, defeating #80 for that benchmark.
Known instance: tau-bench — benchmarks/_chart/presets/tau-bench.yaml sets runnerArgs: python3 /app/agent.py …, bypassing /usr/local/bin/run. (tau-bench is shared-env, so its task id is a dataset index — lower memorization risk than SWE-bench's instance ids — but the isolation guarantee still has a hole.)
To do
Audit all surfaces for custom agent invocations that bypass the process-compose env -i allow-list (chart presets' runnerArgs, per-benchmark compose.yaml command overrides, container.Dockerfile).
Ensure each strips the task identity from the agent's env (run the custom agent under the shared task-id-free allow-list, or unset EVAL_TASK_ID TASK_ID before it).
Context
#80 stops leaking the benchmark task identity to the agent by removing
TASK_IDfrom the agent'senv -iallow-list incore/process-compose/process-compose.yaml(rule 7 — a model can recall a memorized solution from an instance id). That covers every benchmark whose runner goes through/usr/local/bin/run→ process-compose (all three modes).Problem
A benchmark whose chart preset (or
compose.yaml/container.Dockerfile) sets a customrunnerArgs/command runs its agent without going through process-compose'senv -i— so it inherits the runner container env, which still carriesEVAL_TASK_ID/TASK_ID. The agent (and model) can then see the task id, defeating #80 for that benchmark.Known instance: tau-bench —
benchmarks/_chart/presets/tau-bench.yamlsetsrunnerArgs: python3 /app/agent.py …, bypassing/usr/local/bin/run. (tau-bench is shared-env, so its task id is a dataset index — lower memorization risk than SWE-bench's instance ids — but the isolation guarantee still has a hole.)To do
env -iallow-list (chart presets'runnerArgs, per-benchmarkcompose.yamlcommand overrides,container.Dockerfile).unset EVAL_TASK_ID TASK_IDbefore it).tests/sanity/check.rs::agent_env_excludes_the_task_id) to cover custom-runnerArgs benchmarks.Notes