Skip to content

[Feature] Define Run/Step/Artifact projections as the canonical harness vocabulary #946

@shaun0927

Description

@shaun0927

Problem

Ouroboros already has strong internal primitives (Seed, AC tree, EventStore, checkpoint, orchestrator session, evaluation results), but the public harness vocabulary is still too fragmented for AgentOS-level debugging and plugin integration. A maintainer or plugin author cannot consistently answer:

  • What is the user-visible execution unit?
  • Which model/tool/subagent action happened as one step?
  • Which artifacts were produced by that step?
  • Which acceptance criterion or workflow phase does a step belong to?
  • Which event stream entries prove the final verdict?

This makes the harness harder to inspect, replay, evaluate, and extend through ouroboros-plugins.

Why now

Ouroboros is moving toward thin skill, fat harness and a separate UserLevel plugin ecosystem. Before adding more plugins or AgentOS UI, the core needs a stable projection vocabulary over existing events. Letta's Run / Step model shows why this matters: long-running agent behavior becomes understandable when each invocation and model/tool pass has a first-class record.

This issue should be implemented before Run Capsules, eval-suite replicas, and harness inspector work, because those features should consume the same projection contract instead of inventing separate schemas.

User / persona

  • Maintainers debugging orchestrator/runtime behavior.
  • Plugin authors who need a stable contract for reporting work.
  • CI/evaluation jobs that need reproducible evidence.
  • Users asking “why did Ouroboros do this?” after a long run.

Current behavior

  • Events and sessions exist, but there is no single public RunRecord / StepRecord / ArtifactRecord projection contract.
  • Different subsystems can describe work using different names: session, execution, generation, phase, task, event, AC result.
  • Plugins have no stable target object to attach produced artifacts, permission use, or verification evidence to.

Desired behavior

Introduce a minimal, event-sourced Run / Step / Artifact projection layer as the canonical harness vocabulary.

Definitions:

  • RunRecord: one user-goal or Seed execution envelope.
  • StageRecord: a named harness phase such as interview, seed, execute, evaluate, evolve, plugin.
  • StepRecord: one bounded unit of work such as model call, tool call, shell command, subagent dispatch, plugin command, or evaluation check.
  • ArtifactRecord: a file, structured result, patch, verdict, log excerpt, run capsule, or evidence object produced by a step.
  • VerdictRecord: run-level or AC-level result with evidence links.

These records must be projections over EventStore / existing state, not a replacement for the event journal.

Proposed solution

  1. Add Pydantic/dataclass models under a harness/projection namespace, for example:
    • src/ouroboros/harness/projection.py
    • or another name aligned with existing architecture terminology.
  2. Add a projection builder that can construct records from existing EventStore/session data.
  3. Include stable IDs:
    • run_id
    • stage_id
    • step_id
    • artifact_id
    • event_ids[]
  4. Add minimal CLI/MCP query surface:
    • ouroboros status run <run_id> --json, or
    • extend existing status/query handlers if that is the established path.
  5. Preserve compatibility with existing event names and docs.
  6. Document mapping from existing terms to the new projection vocabulary.

Repository direction fit

This is core harness work, not a new user workflow. It preserves the thin-skill model because skills can continue to be small entrypoints while the harness owns execution evidence, projection, and replay vocabulary. It also prepares ouroboros-plugins to attach work to stable steps instead of inventing plugin-local status formats.

Dependency / sequencing

This should be the first issue in the sequence. Run Capsule, isolated eval suites, plugin audit, context inspection, and Harness Inspector should consume this projection rather than defining competing models.

Constraints

  • Do not replace EventStore.
  • Do not require a new server process.
  • Do not make web UI or ADE a dependency.
  • Keep schemas small and append-only where possible.
  • Projection must tolerate missing legacy events.
  • Must preserve local-first operation.
  • Must not move domain/plugin workflows into core.

Non-goals

  • No full Letta-style agent server.
  • No full database migration unless absolutely needed.
  • No self-editing agent memory.
  • No plugin marketplace implementation.
  • No visual inspector UI in this issue.

Implementation decisions required before coding

  • Name and module boundary: harness, observability, persistence, or orchestrator projection namespace.
  • Whether RunRecord corresponds to current orchestrator session ID, execution ID, Seed ID, or a new generated ID with backreferences.
  • Minimal required fields for v1 records.
  • How to map old events that do not contain enough metadata.
  • Whether projection is computed on demand only or cached as a checkpoint.
  • JSON schema versioning strategy.

Acceptance criteria

  • A documented RunRecord / StageRecord / StepRecord / ArtifactRecord schema exists.
  • A projection builder can reconstruct these records from a normal Ouroboros run's persisted state/events.
  • Every projected StepRecord links to one or more source event IDs or explicitly marks itself as legacy/inferred.
  • Artifacts can be attached to steps without plugin-specific code paths.
  • Projection output is available through a machine-readable CLI or MCP query path.
  • Existing tests for event persistence and status continue passing.
  • At least one fixture demonstrates projection for an execution with mechanical evaluation.

Ouroboros 실검증 항목 after implementation

Run these after code implementation and before merge:

uv run pytest tests/ -q
uv run pytest tests/test_*projection* tests/test_*event* tests/test_*status* -q

# Create or reuse a minimal seed fixture that performs a harmless local task.
uv run ouroboros run tests/fixtures/seeds/minimal-local.yaml --runtime codex

# The command name may follow the final CLI decision, but it must return JSON projection data.
uv run ouroboros status run <RUN_ID> --json | jq '.run_id, .stages, .steps, .artifacts'

Manual verification using Ouroboros itself:

  • Start a normal ooo run / ouroboros run flow.
  • Confirm the final status can answer: goal, Seed, stage sequence, step sequence, produced artifacts, final verdict, and source events.
  • Confirm no plugin or skill prompt has to parse raw logs to reconstruct these facts.

References

Checklist

  • I searched existing issues and discussions first.
  • I explained the problem, not just the solution.
  • I included clear scope boundaries and non-goals.
  • I listed concrete acceptance criteria.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-designMulti-PR epic or architectural change, needs human planningtier-2-unblockedPost-wiring Tier 2 work — agentos-substrate-wiring is closed; actionable with #961 sequencing

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions