| description | TraceCore core primitive (deterministic episode runtime) |
|---|
TraceCore’s invariant core is a Deterministic Episode Runtime: a bounded runtime that executes agent-environment interaction with fixed inputs and emits replayable traces plus a structured verdict.
This is the stable nucleus that can power multiple futures (test framework, runtime platform, protocol/standard) without changing the primitive.
A Deterministic Episode Runtime executes:
Agent + Environment + Seed + Budgets (+ Harness version + Task version)
and produces:
Deterministic interaction trace + Structured termination outcome + Replayable artifact.
- A controlled interaction container for agent behavior under constraints.
- A deterministic execution model with reproducible outcomes.
- An artifact-first diagnostic layer for CI and regressions.
- A leaderboard.
- An LLM-as-judge scoring framework.
- A broad intelligence benchmark.
- A hosted product requirement.
Those can be built on top. They are not the core.
An episode is the smallest valid execution unit.
- Agent implementation
- Must satisfy the reset/observe/act interface.
- Task/environment version
- Closed-world, deterministic setup and validator.
- Seed
- Explicit seed used for deterministic setup and execution.
- Budgets
stepstool_calls- optional wall-clock timeout
- Runtime identity
- Harness version and task version included in artifacts.
Reference contracts:
- Agent API:
docs/agent_interface.md - Task harness + determinism rules:
docs/task_harness.md - Artifact schema envelope:
docs/trace_artifacts.md
The runtime loop is discrete and bounded:
- Setup environment from task + seed.
- Reset agent with task spec.
- Repeat observe -> act -> execute -> validate while budgets remain.
- Terminate with structured reason.
- Persist run artifact for replay/comparison.
TraceCore separates exact stop condition from analysis bucket.
termination_reason: precise termination event from the runtime.failure_type: normalized category for filtering, dashboards, and CI policy gates.
budget_exhaustedinvalid_actionsandbox_violationlogic_failuretimeoutnon_termination(reserved for future use; not emitted by the current runner)
Typical runtime termination reasons map as follows:
steps_exhausted->budget_exhaustedtool_calls_exhausted->budget_exhaustedinvalid_action->invalid_actionaction_exception->invalid_actionsandbox_violation->sandbox_violationtimeout->timeoutlogic_failure->logic_failurenon_termination->non_termination
Terminal validator failures ({"ok": false, "terminal": true}) emit termination_reason=logic_failure unless an explicit override is provided.
Replay is a first-class property, not a convenience feature.
Given the same:
- task id/version,
- agent implementation,
- seed,
- budgets,
- and compatible harness/task contracts,
the runtime must produce reproducible outcomes with a stable trace envelope, or a diff that is explicit and inspectable.
If an episode cannot be replayed deterministically, it is not a reliable infrastructure primitive; it is only a demo.
Every episode must emit a machine-readable artifact suitable for automation and audit. Core fields include:
- identity (
run_id,trace_id,task_ref,agent,harness_version) - control inputs (
seed) - outcome (
success,termination_reason,failure_type,failure_reason) - bounded usage (
steps_used,tool_calls_used) - full
action_trace
Additive schema evolution is acceptable; breaking schema changes require versioning and release notes.
Defining this primitive cleanly avoids early identity lock-in and preserves optionality:
- Want pytest-for-agents? Wrap episodes in test runners.
- Want runtime packaging? Package environments around episode contracts.
- Want a standard/protocol? Publish this spec as the interoperable core.
All three paths depend on the same deterministic episode runtime.
This core gives teams:
- Regression detection with stable seeds and baseline compare workflows.
- Actionable failures via structured taxonomy and full trace context.
- CI-native gating using deterministic pass/fail and policy thresholds.
- Auditable evidence through persisted run artifacts and replayability.
If pytest tests functions, TraceCore executes deterministic episodes.
If Docker packages containers, TraceCore packages bounded agent-environment interactions.