You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Meta SSOT slice: L3 — Runtime acceptance substrate (minimal)
Implementation cleanup (2026-05-23 KST). L3 is partially implemented: #1181 merged RuntimeEvidence + HeadlessRunProbe; #1190 merged AutoPipelineResult.runtime_probe_evidence plus completion-grade probe_runner. Remaining known cleanup: #1195 and any evidence-driven real probe binding/scenario additions. sim_trace, render_hash, and api_smoke remain deferred v2 ideas.
This issue is the L3 design slice of #1157. It opens the runtime-evidence substrate that makes PRODUCT_COMPLETE mean "the thing actually runs" rather than "tests pass".
Per the 2026-05-22 minimal-substrate audit (#1157 freshness sync), L3 ships headless_run only in v1. The earlier draft proposed a 4-probe-kind substrate (headless_run / sim_trace / render_hash / api_smoke); the latter three are deferred to evidence-driven follow-up PRs.
Why this is small
headless_run (subprocess invocation + stdout / exit_code / duration capture) is the broadest minimal probe — it can validate any artifact that runs as a command. The 4 canonical scenarios in #1157's acceptance matrix:
Scenario
Needs headless_run?
Needs other probe?
cli-todo
✅
—
webhook-receiver
✅ (starts server, runs smoke)
api_smoke optional
vertical-slice-refactor
✅ (re-runs test suite)
—
2d-kart-racer
✅ (headless N-tick sim)
sim_trace would refine
Three of four canonical scenarios are fully served by headless_run. The fourth (2d-kart-racer) benefits from sim_trace but can use headless_run as a v1 floor (does the binary start without crashing?). Therefore v1 = headless_run only is sufficient for SSOT acceptance gate.
Substrate honesty
L3 v1 introduces:
One new dataclass: RuntimeEvidence (text-based, JSON-serializable). Lives next to existing EvidenceRecord in src/ouroboros/orchestrator/evidence_schema.py or a sibling module.
One probe implementation: HeadlessRunProbe — invokes a documented command, captures stdout / stderr / exit_code / duration.
One binding lookup: given a TaskClass (L1-a / feat(auto): task-class catalog data (L1-a) #1173), return the list of runtime_probe_kinds declared in TaskClassProfile. The catalog already declares which classes want which kinds.
One verifier integration: Track A fat-harness verifier (#920 / #978) receives RuntimeEvidence as a grade input alongside the existing unit-test evidence.
One envelope field: AutoPipelineResult.runtime_probe_evidence: tuple[RuntimeEvidence, ...].
No new EventStore event family, no new aggregate type, no projection migration. Pure additive on the existing evidence schema.
Sub-PR breakdown
L3-1 — src/ouroboros/orchestrator/runtime_evidence.py (new) or extension of existing evidence_schema.py. Ships RuntimeEvidence dataclass + HeadlessRunProbe runner + L1 binding helper. ~180 LoC + unit tests. Folds the original L3-a / L3-b / L3-c into one PR because the three artifacts share an import surface and splitting them creates dependency noise.
Each item below opens as its own follow-up issue only when a canonical scenario demonstrates headless_run cannot cover it:
sim_trace probe kind: deterministic N-tick simulation + golden-state diff. Useful for game-class scenarios where exit_code is uninformative.
render_hash probe kind: screenshot / DOM hash / first-frame fingerprint. Useful for UI-class scenarios where the artifact has no command-line surface.
api_smoke probe kind: HTTP request → response shape match. Useful for web-service / webhook scenarios where stdout is empty by design.
Quality-gate JSON bundle: borrow OMX ultragoal's structured quality-gate.json shape ({aiSlopCleaner, verification, codeReview, runtimeProbe}) as a single bundled evidence container.
Open design questions (decide in L3-1 PR review)
File placement — extend src/ouroboros/orchestrator/evidence_schema.pyor create a sibling runtime_evidence.py? Recommendation: sibling module — keeps the runtime-evidence surface independent of the in-flight Track A evidence schema. Easier to evolve.
HeadlessRunProbe command source — does the probe read its command from TaskClassProfile, from a per-scenario hint, or from a separate config? Recommendation: per-TaskClassProfile default (e.g. library → "pytest", cli → user-supplied invocation in the scenario expected.yaml). Concrete shape pinned in L3-1 PR review.
Subprocess sandbox — Sandbox/timeout policy for the headless run? Recommendation: reuse the existing state.runtime_backend invocation infrastructure if it exists, otherwise subprocess.run with the scenario-declared wall_clock_budget_seconds as timeout=.
Acceptance criteria
L3-1: RuntimeEvidence dataclass + HeadlessRunProbe + L1 binding helper land with unit tests for each.
L3-1: A test exercises the full path TaskClass.CLI → runtime_probe_kinds lookup → HeadlessRunProbe.run() → RuntimeEvidence.
L3-2: Track A verifier produces a grade that includes RuntimeEvidence outcome (probe PASS strengthens grade, probe FAIL weakens grade — concrete weight tuned per existing verifier API).
L3-2: AutoPipelineResult.runtime_probe_evidence populated on every ooo auto run with at least one probe binding.
No regression to existing tests/unit/orchestrator/ evidence tests.
L0 canonical scenario fixtures consuming runtime evidence — those are fixture data, not substrate.
Bundle JSON shape (OMX ultragoal-style). Adopt only if v1 ships more than one probe kind.
Track A collision check
The original L3 design carried a Track A collision warning because verifier follow-ups #1165 / #1166 / #1168 were active. As of 2026-05-22, #1166 and #1168 are merged on main; #1165 is closed/superseded. Collision risk now low. L3-2 should still rebase carefully on main before opening to pick up any final Track A churn.
Meta SSOT slice: L3 — Runtime acceptance substrate (minimal)
This issue is the L3 design slice of #1157. It opens the runtime-evidence substrate that makes
PRODUCT_COMPLETEmean "the thing actually runs" rather than "tests pass".Per the 2026-05-22 minimal-substrate audit (#1157 freshness sync), L3 ships
headless_runonly in v1. The earlier draft proposed a 4-probe-kind substrate (headless_run/sim_trace/render_hash/api_smoke); the latter three are deferred to evidence-driven follow-up PRs.Why this is small
headless_run(subprocess invocation + stdout / exit_code / duration capture) is the broadest minimal probe — it can validate any artifact that runs as a command. The 4 canonical scenarios in#1157's acceptance matrix:cli-todowebhook-receiververtical-slice-refactor2d-kart-racerThree of four canonical scenarios are fully served by
headless_run. The fourth (2d-kart-racer) benefits fromsim_tracebut can useheadless_runas a v1 floor (does the binary start without crashing?). Therefore v1 =headless_runonly is sufficient for SSOT acceptance gate.Substrate honesty
L3 v1 introduces:
RuntimeEvidence(text-based, JSON-serializable). Lives next to existingEvidenceRecordinsrc/ouroboros/orchestrator/evidence_schema.pyor a sibling module.HeadlessRunProbe— invokes a documented command, captures stdout / stderr / exit_code / duration.TaskClass(L1-a / feat(auto): task-class catalog data (L1-a) #1173), return the list ofruntime_probe_kindsdeclared inTaskClassProfile. The catalog already declares which classes want which kinds.#920/#978) receivesRuntimeEvidenceas a grade input alongside the existing unit-test evidence.AutoPipelineResult.runtime_probe_evidence: tuple[RuntimeEvidence, ...].No new EventStore event family, no new aggregate type, no projection migration. Pure additive on the existing evidence schema.
Sub-PR breakdown
src/ouroboros/orchestrator/runtime_evidence.py(new) or extension of existingevidence_schema.py. ShipsRuntimeEvidencedataclass +HeadlessRunProberunner + L1 binding helper. ~180 LoC + unit tests. Folds the original L3-a / L3-b / L3-c into one PR because the three artifacts share an import surface and splitting them creates dependency noise.RuntimeEvidenceif present, includes the probe outcome in the grade input.AutoPipelineResult.runtime_probe_evidencepopulated in_result(). ~120 LoC + integration tests. Touchessrc/ouroboros/orchestrator/— Track A queue (fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 / fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168) has drained, so collision risk is now low.Total target: ~300 LoC across 2 sub-PRs.
v2 expansion path (deferred — evidence-driven follow-ups)
Each item below opens as its own follow-up issue only when a canonical scenario demonstrates
headless_runcannot cover it:sim_traceprobe kind: deterministic N-tick simulation + golden-state diff. Useful for game-class scenarios where exit_code is uninformative.render_hashprobe kind: screenshot / DOM hash / first-frame fingerprint. Useful for UI-class scenarios where the artifact has no command-line surface.api_smokeprobe kind: HTTP request → response shape match. Useful for web-service / webhook scenarios where stdout is empty by design.quality-gate.jsonshape ({aiSlopCleaner, verification, codeReview, runtimeProbe}) as a single bundled evidence container.Open design questions (decide in L3-1 PR review)
src/ouroboros/orchestrator/evidence_schema.pyor create a siblingruntime_evidence.py? Recommendation: sibling module — keeps the runtime-evidence surface independent of the in-flight Track A evidence schema. Easier to evolve.HeadlessRunProbecommand source — does the probe read its command fromTaskClassProfile, from a per-scenario hint, or from a separate config? Recommendation: per-TaskClassProfiledefault (e.g.library→"pytest",cli→ user-supplied invocation in the scenarioexpected.yaml). Concrete shape pinned in L3-1 PR review.state.runtime_backendinvocation infrastructure if it exists, otherwisesubprocess.runwith the scenario-declaredwall_clock_budget_secondsastimeout=.Acceptance criteria
RuntimeEvidencedataclass +HeadlessRunProbe+ L1 binding helper land with unit tests for each.TaskClass.CLI → runtime_probe_kinds lookup → HeadlessRunProbe.run() → RuntimeEvidence.RuntimeEvidenceoutcome (probe PASS strengthens grade, probe FAIL weakens grade — concrete weight tuned per existing verifier API).AutoPipelineResult.runtime_probe_evidencepopulated on everyooo autorun with at least one probe binding.tests/unit/orchestrator/evidence tests.Out of scope
sim_trace,render_hash,api_smoke). v2 expansion path above.Track A collision check
The original L3 design carried a Track A collision warning because verifier follow-ups #1165 / #1166 / #1168 were active. As of 2026-05-22, #1166 and #1168 are merged on
main; #1165 is closed/superseded. Collision risk now low. L3-2 should still rebase carefully onmainbefore opening to pick up any final Track A churn.References
ooo auto(L3 lane body, minimal redesign 2026-05-22).TaskClassProfile.runtime_probe_kinds).ooo runtrustworthy with a fat harness execution path #920 / Design spine: AgentOS evidence-gated delivery via TraceGuard #978 — Track A fat-harness verifier (extended by L3-2).