Meta SSOT slice: L3 — Runtime acceptance substrate (minimal, headless_run only)

# Meta SSOT slice: L3 — Runtime acceptance substrate (minimal)

> **Implementation cleanup (2026-05-23 KST).** L3 is partially implemented: #1181 merged `RuntimeEvidence` + `HeadlessRunProbe`; #1190 merged `AutoPipelineResult.runtime_probe_evidence` plus completion-grade `probe_runner`. Remaining known cleanup: #1195 and any evidence-driven real probe binding/scenario additions. `sim_trace`, `render_hash`, and `api_smoke` remain deferred v2 ideas.

This issue is the **L3 design slice** of #1157. It opens the runtime-evidence substrate that makes `PRODUCT_COMPLETE` mean *"the thing actually runs"* rather than *"tests pass"*.

Per the 2026-05-22 minimal-substrate audit (#1157 freshness sync), L3 ships **`headless_run` only** in v1. The earlier draft proposed a 4-probe-kind substrate (`headless_run` / `sim_trace` / `render_hash` / `api_smoke`); the latter three are deferred to evidence-driven follow-up PRs.

## Why this is small

`headless_run` (subprocess invocation + stdout / exit_code / duration capture) is the broadest minimal probe — it can validate any artifact that *runs as a command*. The 4 canonical scenarios in `#1157`'s acceptance matrix:

| Scenario | Needs headless_run? | Needs other probe? |
|---|---|---|
| `cli-todo` | ✅ | — |
| `webhook-receiver` | ✅ (starts server, runs smoke) | api_smoke optional |
| `vertical-slice-refactor` | ✅ (re-runs test suite) | — |
| `2d-kart-racer` | ✅ (headless N-tick sim) | sim_trace would refine |

Three of four canonical scenarios are *fully served* by `headless_run`. The fourth (`2d-kart-racer`) benefits from `sim_trace` but can use `headless_run` as a v1 floor (does the binary start without crashing?). Therefore v1 = `headless_run` only is sufficient for SSOT acceptance gate.

## Substrate honesty

L3 v1 introduces:

- One new dataclass: `RuntimeEvidence` (text-based, JSON-serializable). Lives next to existing `EvidenceRecord` in `src/ouroboros/orchestrator/evidence_schema.py` or a sibling module.
- One probe implementation: `HeadlessRunProbe` — invokes a documented command, captures stdout / stderr / exit_code / duration.
- One binding lookup: given a `TaskClass` (L1-a / #1173), return the list of `runtime_probe_kinds` declared in `TaskClassProfile`. The catalog already declares which classes want which kinds.
- One verifier integration: Track A fat-harness verifier (`#920` / `#978`) receives `RuntimeEvidence` as a grade input alongside the existing unit-test evidence.
- One envelope field: `AutoPipelineResult.runtime_probe_evidence: tuple[RuntimeEvidence, ...]`.

No new EventStore event family, no new aggregate type, no projection migration. Pure additive on the existing evidence schema.

## Sub-PR breakdown

1. **L3-1** — `src/ouroboros/orchestrator/runtime_evidence.py` (new) or extension of existing `evidence_schema.py`. Ships `RuntimeEvidence` dataclass + `HeadlessRunProbe` runner + L1 binding helper. ~180 LoC + unit tests. *Folds the original L3-a / L3-b / L3-c into one PR because the three artifacts share an import surface and splitting them creates dependency noise.*
2. **L3-2** — Track A verifier integration + envelope surface. Verifier reads `RuntimeEvidence` if present, includes the probe outcome in the grade input. `AutoPipelineResult.runtime_probe_evidence` populated in `_result()`. ~120 LoC + integration tests. *Touches `src/ouroboros/orchestrator/` — Track A queue (#1165 / #1166 / #1168) has drained, so collision risk is now low.*

Total target: ~300 LoC across 2 sub-PRs.

## v2 expansion path (deferred — evidence-driven follow-ups)

Each item below opens as its own follow-up issue *only when a canonical scenario demonstrates `headless_run` cannot cover it*:

- **`sim_trace` probe kind**: deterministic N-tick simulation + golden-state diff. Useful for game-class scenarios where exit_code is uninformative.
- **`render_hash` probe kind**: screenshot / DOM hash / first-frame fingerprint. Useful for UI-class scenarios where the artifact has no command-line surface.
- **`api_smoke` probe kind**: HTTP request → response shape match. Useful for web-service / webhook scenarios where stdout is empty by design.
- **Quality-gate JSON bundle**: borrow OMX ultragoal's structured `quality-gate.json` shape (`{aiSlopCleaner, verification, codeReview, runtimeProbe}`) as a single bundled evidence container.

## Open design questions (decide in L3-1 PR review)

1. **File placement** — extend `src/ouroboros/orchestrator/evidence_schema.py` *or* create a sibling `runtime_evidence.py`? *Recommendation:* sibling module — keeps the runtime-evidence surface independent of the in-flight Track A evidence schema. Easier to evolve.
2. **`HeadlessRunProbe` command source** — does the probe read its command from `TaskClassProfile`, from a per-scenario hint, or from a separate config? *Recommendation:* per-`TaskClassProfile` default (e.g. `library` → `"pytest"`, `cli` → user-supplied invocation in the scenario `expected.yaml`). Concrete shape pinned in L3-1 PR review.
3. **Subprocess sandbox** — Sandbox/timeout policy for the headless run? *Recommendation:* reuse the existing `state.runtime_backend` invocation infrastructure if it exists, otherwise `subprocess.run` with the scenario-declared `wall_clock_budget_seconds` as `timeout=`.

## Acceptance criteria

- [ ] L3-1: `RuntimeEvidence` dataclass + `HeadlessRunProbe` + L1 binding helper land with unit tests for each.
- [ ] L3-1: A test exercises the full path `TaskClass.CLI → runtime_probe_kinds lookup → HeadlessRunProbe.run() → RuntimeEvidence`.
- [ ] L3-2: Track A verifier produces a grade that includes `RuntimeEvidence` outcome (probe PASS strengthens grade, probe FAIL weakens grade — concrete weight tuned per existing verifier API).
- [ ] L3-2: `AutoPipelineResult.runtime_probe_evidence` populated on every `ooo auto` run with at least one probe binding.
- [ ] No regression to existing `tests/unit/orchestrator/` evidence tests.

## Out of scope

- Adding new probe kinds (`sim_trace`, `render_hash`, `api_smoke`). v2 expansion path above.
- L0 canonical scenario fixtures consuming runtime evidence — those are *fixture data*, not substrate.
- Bundle JSON shape (OMX ultragoal-style). Adopt only if v1 ships more than one probe kind.

## Track A collision check

The original L3 design carried a Track A collision warning because verifier follow-ups #1165 / #1166 / #1168 were active. As of 2026-05-22, #1166 and #1168 are merged on `main`; #1165 is closed/superseded. **Collision risk now low.** L3-2 should still rebase carefully on `main` before opening to pick up any final Track A churn.

## References

- #1157 — Meta SSOT for `ooo auto` (L3 lane body, minimal redesign 2026-05-22).
- #1173 — L1-a task-class catalog (provides `TaskClassProfile.runtime_probe_kinds`).
- #1174 — L0-a canonical harness (consumes runtime evidence in the live-run path).
- #920 / #978 — Track A fat-harness verifier (extended by L3-2).
- #946 — Track C projection vocabulary (no projection v1 change needed for L3 v1).
- #1166 / #1168 — Track A verifier follow-ups (merged; collision risk cleared).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta SSOT slice: L3 — Runtime acceptance substrate (minimal, headless_run only) #1176

Meta SSOT slice: L3 — Runtime acceptance substrate (minimal)

Why this is small

Substrate honesty

Sub-PR breakdown

v2 expansion path (deferred — evidence-driven follow-ups)

Open design questions (decide in L3-1 PR review)

Acceptance criteria

Out of scope

Track A collision check

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Scenario	Needs headless_run?	Needs other probe?
`cli-todo`	✅	—
`webhook-receiver`	✅ (starts server, runs smoke)	api_smoke optional
`vertical-slice-refactor`	✅ (re-runs test suite)	—
`2d-kart-racer`	✅ (headless N-tick sim)	sim_trace would refine

Meta SSOT slice: L3 — Runtime acceptance substrate (minimal, headless_run only) #1176

Description

Meta SSOT slice: L3 — Runtime acceptance substrate (minimal)

Why this is small

Substrate honesty

Sub-PR breakdown

v2 expansion path (deferred — evidence-driven follow-ups)

Open design questions (decide in L3-1 PR review)

Acceptance criteria

Out of scope

Track A collision check

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions