This document defines what a task is, how it runs, and what agents are allowed to touch.
A task is a closed-world environment with:
- a deterministic initial state
- a constrained action surface
- a single success condition
A task does not test intelligence. It tests whether an agent can operate correctly under constraints.
Each task is a self-contained directory:
tasks/
<task_id>/
task.toml (preferred; task.yaml legacy)
setup.py
actions.py
validate.py
README.md (optional)
Nothing outside this directory may influence task behavior.
This file is purely declarative. Legacy task.yaml files are still accepted but should be considered read-only.
id = "filesystem_hidden_config"
suite = "filesystem"
version = 1
description = "Extract the correct configuration value from the filesystem."
deterministic = true
seed_behavior = "fixed"
[budgets]
steps = 200
tool_calls = 50
[action_surface]
source = "actions.py"
schema = "introspected"
[validator]
entrypoint = "validate.py:validate"
[sandbox]
filesystem_roots = ["/app"]
network_hosts = []Rules:
- No logic
- No conditionals
- No imports
- Deterministic tasks must declare sandbox allowlists (
filesystem_roots,network_hosts) - Once released, this file is immutable
- Changing behavior requires a new version
Creates the world. Responsibilities:
- Create files, directories, logs, mock services
- Seed all randomness
- Initialize hidden state Rules:
- Runs before the agent starts
- Agent cannot read or inspect setup code
- Must be deterministic given the seed
- No network access
- No wall-clock dependence
Defines everything the agent can do. If an action is not here, it does not exist. Rules:
- All actions are synchronous, logged, and budgeted
- No shell access
- No filesystem escape
- No reflection or inspection
Error handling:
- Actions return either a successful result or a structured error
- Never exceptions
The harness runs in discrete steps. Each step:
- Agent receives an observation
- Agent emits exactly one action
- Harness executes the action
- Result is recorded
- Budgets are decremented
No action batching. No background execution.
Every task enforces budgets. Budgets:
- steps
- tool_calls
- optional wall-clock timeout
A task ends when:
- validate() returns success
- Agent emits an invalid action
- Any budget is exhausted
- Harness encounters a fatal error
Agents can observe only what the task allows. Agents can see:
- Task description
- Action results
- Their own past actions
- Explicit visible state (if provided)
Agents cannot see:
- Setup code
- Validation logic
- Hidden files or state
- Ground truth answers
Defines success. Rules:
- Deterministic
- Final-state only
- No LLMs
- No partial credit
- No time-based logic
Given task id + version, random seed, and agent implementation, outcomes must be reproducible.
The harness enforces:
- Process isolation
- Read-only task metadata
- No filesystem escape (task access must stay within the manifest allowlist)
- No environment introspection
- No dynamic imports outside the task
- Per-step filesystem/network IO audit logged in traces and enforced in replay/strict
A good task:
- Fails brittle agents quickly
- Rewards conservative behavior
- Has exactly one right outcome
- Surfaces why the agent failed
A bad task:
- Requires guessing
- Encourages hacks
- Depends on timing
- Takes minutes to run
The task harness does not:
- Simulate the real world
- Test creativity
- Judge explanations
- Optimize for realism
- Every run produces a JSON artifact under
.agent_bench/runs/that captures metadata (run_id, trace_id, timestamps, harness version), per-step traces, and outcome metrics. - Trace viewers (CLI + Web UI) must surface that artifact verbatim; no summarization that hides the raw steps.
- Baseline tables derive only from persisted artifacts to keep comparisons reproducible.
- When freezing a task version, capture the relevant run IDs and carry them forward as regression fixtures.