Skip to content

Meta SSOT slice: L3 — Runtime acceptance substrate (minimal, headless_run only) #1176

@shaun0927

Description

@shaun0927

Meta SSOT slice: L3 — Runtime acceptance substrate (minimal)

Implementation cleanup (2026-05-23 KST). L3 is partially implemented: #1181 merged RuntimeEvidence + HeadlessRunProbe; #1190 merged AutoPipelineResult.runtime_probe_evidence plus completion-grade probe_runner. Remaining known cleanup: #1195 and any evidence-driven real probe binding/scenario additions. sim_trace, render_hash, and api_smoke remain deferred v2 ideas.

This issue is the L3 design slice of #1157. It opens the runtime-evidence substrate that makes PRODUCT_COMPLETE mean "the thing actually runs" rather than "tests pass".

Per the 2026-05-22 minimal-substrate audit (#1157 freshness sync), L3 ships headless_run only in v1. The earlier draft proposed a 4-probe-kind substrate (headless_run / sim_trace / render_hash / api_smoke); the latter three are deferred to evidence-driven follow-up PRs.

Why this is small

headless_run (subprocess invocation + stdout / exit_code / duration capture) is the broadest minimal probe — it can validate any artifact that runs as a command. The 4 canonical scenarios in #1157's acceptance matrix:

Scenario Needs headless_run? Needs other probe?
cli-todo
webhook-receiver ✅ (starts server, runs smoke) api_smoke optional
vertical-slice-refactor ✅ (re-runs test suite)
2d-kart-racer ✅ (headless N-tick sim) sim_trace would refine

Three of four canonical scenarios are fully served by headless_run. The fourth (2d-kart-racer) benefits from sim_trace but can use headless_run as a v1 floor (does the binary start without crashing?). Therefore v1 = headless_run only is sufficient for SSOT acceptance gate.

Substrate honesty

L3 v1 introduces:

  • One new dataclass: RuntimeEvidence (text-based, JSON-serializable). Lives next to existing EvidenceRecord in src/ouroboros/orchestrator/evidence_schema.py or a sibling module.
  • One probe implementation: HeadlessRunProbe — invokes a documented command, captures stdout / stderr / exit_code / duration.
  • One binding lookup: given a TaskClass (L1-a / feat(auto): task-class catalog data (L1-a) #1173), return the list of runtime_probe_kinds declared in TaskClassProfile. The catalog already declares which classes want which kinds.
  • One verifier integration: Track A fat-harness verifier (#920 / #978) receives RuntimeEvidence as a grade input alongside the existing unit-test evidence.
  • One envelope field: AutoPipelineResult.runtime_probe_evidence: tuple[RuntimeEvidence, ...].

No new EventStore event family, no new aggregate type, no projection migration. Pure additive on the existing evidence schema.

Sub-PR breakdown

  1. L3-1src/ouroboros/orchestrator/runtime_evidence.py (new) or extension of existing evidence_schema.py. Ships RuntimeEvidence dataclass + HeadlessRunProbe runner + L1 binding helper. ~180 LoC + unit tests. Folds the original L3-a / L3-b / L3-c into one PR because the three artifacts share an import surface and splitting them creates dependency noise.
  2. L3-2 — Track A verifier integration + envelope surface. Verifier reads RuntimeEvidence if present, includes the probe outcome in the grade input. AutoPipelineResult.runtime_probe_evidence populated in _result(). ~120 LoC + integration tests. Touches src/ouroboros/orchestrator/ — Track A queue (fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 / fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168) has drained, so collision risk is now low.

Total target: ~300 LoC across 2 sub-PRs.

v2 expansion path (deferred — evidence-driven follow-ups)

Each item below opens as its own follow-up issue only when a canonical scenario demonstrates headless_run cannot cover it:

  • sim_trace probe kind: deterministic N-tick simulation + golden-state diff. Useful for game-class scenarios where exit_code is uninformative.
  • render_hash probe kind: screenshot / DOM hash / first-frame fingerprint. Useful for UI-class scenarios where the artifact has no command-line surface.
  • api_smoke probe kind: HTTP request → response shape match. Useful for web-service / webhook scenarios where stdout is empty by design.
  • Quality-gate JSON bundle: borrow OMX ultragoal's structured quality-gate.json shape ({aiSlopCleaner, verification, codeReview, runtimeProbe}) as a single bundled evidence container.

Open design questions (decide in L3-1 PR review)

  1. File placement — extend src/ouroboros/orchestrator/evidence_schema.py or create a sibling runtime_evidence.py? Recommendation: sibling module — keeps the runtime-evidence surface independent of the in-flight Track A evidence schema. Easier to evolve.
  2. HeadlessRunProbe command source — does the probe read its command from TaskClassProfile, from a per-scenario hint, or from a separate config? Recommendation: per-TaskClassProfile default (e.g. library"pytest", cli → user-supplied invocation in the scenario expected.yaml). Concrete shape pinned in L3-1 PR review.
  3. Subprocess sandbox — Sandbox/timeout policy for the headless run? Recommendation: reuse the existing state.runtime_backend invocation infrastructure if it exists, otherwise subprocess.run with the scenario-declared wall_clock_budget_seconds as timeout=.

Acceptance criteria

  • L3-1: RuntimeEvidence dataclass + HeadlessRunProbe + L1 binding helper land with unit tests for each.
  • L3-1: A test exercises the full path TaskClass.CLI → runtime_probe_kinds lookup → HeadlessRunProbe.run() → RuntimeEvidence.
  • L3-2: Track A verifier produces a grade that includes RuntimeEvidence outcome (probe PASS strengthens grade, probe FAIL weakens grade — concrete weight tuned per existing verifier API).
  • L3-2: AutoPipelineResult.runtime_probe_evidence populated on every ooo auto run with at least one probe binding.
  • No regression to existing tests/unit/orchestrator/ evidence tests.

Out of scope

  • Adding new probe kinds (sim_trace, render_hash, api_smoke). v2 expansion path above.
  • L0 canonical scenario fixtures consuming runtime evidence — those are fixture data, not substrate.
  • Bundle JSON shape (OMX ultragoal-style). Adopt only if v1 ships more than one probe kind.

Track A collision check

The original L3 design carried a Track A collision warning because verifier follow-ups #1165 / #1166 / #1168 were active. As of 2026-05-22, #1166 and #1168 are merged on main; #1165 is closed/superseded. Collision risk now low. L3-2 should still rebase carefully on main before opening to pick up any final Track A churn.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    OSCore engine, state machine, internal pipeline, and system-level behaviorenhancementNew feature or meaningful improvementneeds-designMulti-PR epic or architectural change, needs human planning

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions