| description | Task catalog and significance |
|---|
Use this catalog to understand what each bundled task measures, how it is wired, and why it matters to TraceCore. Every task entry links to its source directory for deeper implementation notes.
tasks/registry.jsonis the manifest that keeps README/SPEC_FREEZE/docs in sync. When you add or bump a bundled task, update this file so downstream tooling can discover it.- Each task directory includes a
task.tomlmanifest (seedocs/task_manifest.md) describing budgets, entrypoints, and deterministic behavior. - External task packages can register via the
agent_bench.tasksentry-point group. Seedocs/task_plugin_template.mdfor a starter layout, entry-point snippet, andregister()helper contract. - The loader merges bundled manifest rows + plugin descriptors, so
agent-bench run --task your_plugin_task@1works once the plugin package is installed.
- Choose the packaging model
- Use an in-repo bundled task when the scenario is becoming part of the maintained benchmark surface.
- Use an external plugin package when you want to distribute tasks independently and expose them through the
agent_bench.tasksentry-point group.
- Author the deterministic harness
- Every task, bundled or external, should ship
task.toml,setup.py,actions.py, andvalidate.py. - Keep filesystem and network access inside the manifest-declared sandbox, and use the guarded environment helpers instead of raw system calls.
- Treat behavior as versioned contract surface: once a task is frozen, behavior changes require a version bump.
- Every task, bundled or external, should ship
- Register the task correctly
- Bundled tasks belong in
tasks/registry.json. - External plugins should expose a
register()entry point inpyproject.tomlunder[project.entry-points."agent_bench.tasks"]and return descriptor rows with stableid,suite,version, and eitherpathorloader. - Keep task IDs snake_case and make the installed task addressable as
task_id@version.
- Bundled tasks belong in
- Validate before publishing
- Run
tracecore tasks validate --path ...against the task directory for focused checks. - Run
tracecore tasks validate --registrywhen contributing bundled tasks or when verifying a local workspace that includes multiple maintained tasks. - Run
tracecore run --agent agents/toy_agent.py --task your_task@1 --seed 0 --strict-specto prove the task works under the deterministic runtime contract. - Run
python -m pytestandpython -m ruff check agent_benchbefore opening a PR or publishing a maintained plugin update.
- Run
- Document integrity and signing expectations
- Include installation instructions, supported environment variables, and any trust/integrity workflow that operators must follow before enabling the plugin in CI.
- If your distribution uses signed artifacts, document who signs releases, how signatures are verified, and where maintainers should record the evidence.
- Keep release notes and onboarding docs aligned so reviewers can tell which task version, package version, and validation evidence belong together.
- Use the contributor checklist
- Update
SPEC_FREEZE.mdwhen a task becomes part of the frozen benchmark surface. - Update
CHANGELOG.mdfor maintained additions or behavioral changes. - Add or refresh regression tests so validators, action schemas, and plugin discovery stay covered.
- Reference the external contributor onboarding guide for the broader PR checklist and review expectations.
- Update
- Start here in
docs/tasks/tasks.mdfor the catalog and the end-to-end onboarding map. - Read
plugin_contribution_guide.mdfor the detailed task authoring and validation checklist. - Use
task_plugin_template.mdwhen packaging an external plugin withagent_bench.tasksentry points. - Use
task_harness.mdandtask_manifest.mdas the contract references for harness behavior and manifest fields. - Cross-check
../contributing/external_contributor_onboarding.mdbefore opening the PR.
filesystem_hidden_config@1
- Suite: filesystem · Deterministic: ✅ · Path:
tasks/filesystem_hidden_config/ - Core idea: forces agents to plan cautious filesystem exploration to recover
API_KEYwithout brute-force traversal. - Skills stressed:
- Stateful search across nested directories.
- Budget-aware exploration vs. repeated reads.
- Validating when a clue (config file) resolves the goal.
- Why it matters: mirrors classic "find config secret" incidents where LLM agents must persist state, avoid loops, and stop once the secret is located.
- Suite: api · Deterministic: ✅ · Path:
tasks/rate_limited_api/ - Core idea: single-endpoint API that enforces strict quotas and transient failures; agents must respect
retry_afterwindows. - Skills stressed:
- Differentiating
rate_limitedvs.temporary_failurevs. fatal errors. - Implementing exponential/backoff-style waiting with the
waitaction. - Submitting the token through
set_outputonly when confirmed.
- Differentiating
- Why it matters: probes whether an agent can follow API etiquette under pressure—no handshake yet, but lots of budget management.
- Suite: api · Deterministic: ✅ · Path:
tasks/rate_limited_chain/ - Core idea: extends the previous API with a handshake template and chained endpoints that expire; combines instruction following with rate limits.
- Skills stressed:
- Parsing README/templates to craft the handshake response.
- Tracking
handshake_idlifetimes and retry windows simultaneously. - Differentiating fatal vs. transient API responses to know when to restart.
- Why it matters: captures real-world auth flows (OAuth/device codes) where skipping handshake logic bricks the session.
- Suite: api · Deterministic: ✅ · Path:
tasks/deterministic_rate_service/ - Core idea: deterministic yet unforgiving service combining handshake confirmation, required payload templates, rate limiting, and a guaranteed transient hiccup.
- Skills stressed:
- Maintaining service state (virtual clock, retry budget, history).
- Distinguishing
rate_limited,temporary_failure,bad_request,invalid_handshake, and escalating appropriately. - Recovering from fatal payload errors by restarting the flow automatically.
- Why it matters: this is TraceCore’s "depth" scenario—agents must orchestrate multi-step APIs without over-spending limited tool calls, which is representative of production integration incidents.
- Suite: operations · Deterministic: ✅ · Path:
tasks/log_alert_triage/ - Core idea: walk deterministic log artifacts and recover the final
ALERT_CODEused for escalation. - Skills stressed:
- Parsing operational logs for actionable signals.
- Following breadcrumbs across multiple files.
- Avoiding unnecessary reads once the alert code is found.
- Why it matters: mirrors real-world log triage where the last error line controls escalation playbooks.
- Suite: operations · Deterministic: ✅ · Path:
tasks/config_drift_remediation/ - Core idea: compare desired vs. live configuration and output the exact remediation patch line.
- Skills stressed:
- Differencing structured configs.
- Isolating the single drifted setting under budget pressure.
- Emitting a precise corrective change without modifying files.
- Why it matters: captures high-signal config drift investigations that production agents must handle cleanly.
- Suite: operations · Deterministic: ✅ · Path:
tasks/incident_recovery_chain/ - Core idea: follow a deterministic recovery handoff chain to extract the final
RECOVERY_TOKEN. - Skills stressed:
- Tracking sequential handoffs across incident notes.
- Preserving context across multi-step recovery procedures.
- Stopping once the authoritative token is located.
- Why it matters: models recovery runbooks where skipping a step yields bad remediation.
- Suite: operations · Deterministic: ✅ · Path:
tasks/log_stream_monitor/ - Core idea: poll a seeded, paginated log stream across multiple pages, filter out
INFO/WARNnoise, and emit theSTREAM_CODEembedded in the firstCRITICALentry. - Skills stressed:
- Cursor-based pagination without over-fetching.
- Signal/noise discrimination across a multi-page stream.
- Stopping immediately once the trigger condition is met.
- Why it matters: mirrors production monitoring loops where agents must watch a live stream, ignore routine events, and fire exactly once on a critical signal — without exhausting tool-call budgets on noise.
- Quick start:
agent-bench run pairing log_stream_monitor
- Suite: operations · Deterministic: ✅ · Path:
tasks/runbook_verifier/ - Core idea: verify that every incident runbook phase executed in order and emit the
RUNBOOK_CHECKSUMcombining phase codes + ACK + handoff token. - Skills stressed:
- Stitching multiple artifacts (README, index, per-phase files, timeline, handoff) into a single deterministic output.
- Maintaining strict ordering under limited tool-call budgets.
- Detecting incomplete phase data before emitting results.
- Why it matters: models the real-world audit workflow where operators must prove each mitigation phase ran before handoff, with zero tolerance for missing steps.
- Suite: operations · Deterministic: ✅ · Path:
tasks/sandboxed_code_auditor/ - Core idea: audit a sandbox runtime sample to locate a legacy bypass
ISSUE_IDand analyzerAUDIT_CODE, then emitISSUE_ID|AUDIT_CODEviaSANDBOX_AUDIT_TOKEN. - Skills stressed:
- Reading scoped documentation to learn the audit output contract.
- Inspecting source code and analyzer logs under a strict filesystem sandbox.
- Combining multiple findings into a structured output while respecting budgets.
- Why it matters: sandbox regressions are high-risk; this scenario trains agents to follow deterministic audit steps, avoid unauthorized filesystem access, and report compliance findings in a repeatable format.
- Suite: security · Deterministic: ✅ · Path:
tasks/security_incident_triage/ - Core idea: correlate IDS logs, analyst findings, and CSIRT notes to emit the confirmed
BREACH_TOKENinstead of a noisy intermediate indicator. - Skills stressed:
- Separating noisy indicators from confirmed breach evidence.
- Following escalation narratives documented across multiple files.
- Emitting the precise token only after the final validation step.
- Why it matters: security incidents often involve conflicting telemetry—agents must validate the canonical breach artifact before triggering expensive responses.
- Suite: operations · Deterministic: ✅ · Path:
tasks/customer_support_escalation/ - Core idea: synthesize ticket metadata, manager transcripts, and policy docs to emit the manager-confirmed
ESCALATION_CODEwithout skipping checkpoints. - Skills stressed:
- Parsing structured ticket JSON to understand severity and routing.
- Scanning multi-channel transcripts for the canonical confirmation line.
- Verifying policy compliance before emitting the final code.
- Why it matters: escalation errors are costly; this task forces agents to respect escalation ladders and only act on validated manager approvals.
- Suite: operations · Deterministic: ✅ · Path:
tasks/multi_role_escalation/ - Core idea: coordinate recon + executor roles to collect
ANALYST_TOKEN,MANAGER_TOKEN, and applyFINAL_FORMATbefore emittingESCALATION_CODE. - Skills stressed:
- Managing multiple signal files and respecting their order.
- Tracking multiple intermediate tokens simultaneously.
- Combining tokens using a format template without leaking partial outputs.
- Why it matters: reflects real incident workflows where operators collect evidence from multiple participants before issuing a final escalation token; it is the primary Phase 6 multi-agent harness showcase.
Next steps: For full implementation details, open each task's README (kept alongside the code) or read task_harness.md for the harness contract.