Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance

# Meta SSOT slice: L0 — Canonical acceptance test (minimal)

> **Implementation cleanup (2026-05-23 KST).** L0 is partially implemented: #1174 merged the canonical harness skeleton/initial scenario, and #1191 merged opt-in live wiring plus L1 catalog cross-validation. Treat remaining scenario additions and #1195 evidence alignment as the open work; nightly/replay/cost-budget sections below are superseded v2 ideas unless new evidence reopens them.

This issue is the **L0 design slice** of #1157. It provides the smallest possible *acceptance gate* that the SSOT itself can point at to declare close-readiness.

## Why this is small on purpose

The earlier draft of L0 proposed a nightly CI workflow, a recorded-replay layer for hermetic mode, a `divergence_count` cross-check job, a monthly LLM-cost budget, and a refresh-rotation ownership policy. **All of that is operational sludge, not `ooo auto` behavior.** L0's *only* job is: *"give the SSOT something concrete to point at when claiming the canonical test matrix passes."*

Per Ouroboros's minimal-substrate principle (*"add substrate only when evidence demands it"*), L0 v1 ships the smallest thing that satisfies the SSOT acceptance gate:

- A `tests/canonical/` directory with one self-contained scenario per canonical goal.
- A `pytest` entry point that drives `ouroboros_auto` end-to-end against a scenario.
- A maintainer runs it *manually* when assessing SSOT readiness.

That's it. No nightly CI, no replay infrastructure, no cost budget, no refresh policy, no divergence detection. If any of those prove valuable later (evidence from real usage), they get added as their own follow-up issues — not pre-built.

## Scope

L0 builds:

1. `tests/canonical/` directory layout.
2. Per-scenario self-contained fixtures.
3. A `pytest`-driven runner (`tests/canonical/conftest.py`) that invokes `ouroboros_auto` against each scenario and asserts the documented terminal state.

## Initial canonical scenarios (one per SSOT canonical goal)

| Slug | Goal one-liner | Domain class (anticipated by L1) | Expected terminal |
|---|---|---|---|
| `cli-todo` | `ooo auto "build a habit-tracker CLI"` | `cli` | `PRODUCT_COMPLETE` |
| `webhook-receiver` | `ooo auto "build a webhook receiver that logs payloads to a SQLite DB"` | `webhook` | `PRODUCT_COMPLETE` |
| `vertical-slice-refactor` | `ooo auto "refactor src/foo into vertical slices"` | `refactor-in-place` | `CODE_COMPLETE` |
| `2d-kart-racer` | `ooo auto "build a tiny 2D kart racing browser game"` | `game-2d` | `PRODUCT_COMPLETE` |

`cli-todo` is the smallest scenario and ships first. Each additional scenario is its own follow-up PR. The `2d-kart-racer` scenario waits on L3's runtime-probe substrate.

## Per-scenario artifact shape

```
tests/canonical/<slug>/
├── goal.txt          # the one-line goal string
├── env/              # fixture files seeded into the temp workdir (often empty)
└── expected.yaml     # domain_class, completion_mode, runtime_probe_kinds, optional wall_clock_budget
```

`expected.yaml`'s `domain_class` is the expected output of L1's `derive_domain_from_ledger()`. The L0 runner asserts the inference matches expectation — this is how L0 indirectly tests L1's catalog correctness without coupling to L1's pattern functions.

## Sub-PR breakdown

1. **L0-a** — `tests/canonical/` skeleton + `conftest.py` runner + **`cli-todo` scenario only**. Manual `pytest tests/canonical/cli-todo/ -v` to execute. ~150 LoC.
2. **L0-b** — add `webhook-receiver` scenario. ~50 LoC.
3. **L0-c** — add `vertical-slice-refactor` scenario. ~50 LoC.
4. **L0-d** — add `2d-kart-racer` scenario. Waits on L3-a/b for `headless_run` probe substrate. ~80 LoC.

Total target: ~330 LoC across 4 PRs.

## How L0 is used

The maintainer runs the canonical matrix *manually* when:

- Assessing SSOT close-readiness (per the *Acceptance gate* in #1157).
- After landing a significant L1/L2/L3/L5 PR, when curious whether integration still works.
- Before tagging a release.

There is no CI obligation. There is no schedule. The matrix is a *checklist artifact*, not a continuous regression engine.

## Acceptance criteria

- [ ] `tests/canonical/cli-todo/` PASSes when run manually against current `main` (`PRODUCT_COMPLETE` terminal, stdout/exit-code golden match).
- [ ] Adding a new scenario requires no infrastructure change — just a new `tests/canonical/<slug>/` directory and a one-line addition to `conftest.py` parametrization.
- [ ] The L0 runner produces a single human-readable summary line per scenario for the maintainer to copy into a PR comment when claiming SSOT progress.

## Out of scope (deliberately)

- Nightly CI workflow — *evidence-driven addition only*. Add when manual cadence proves insufficient.
- Recorded-replay layer for hermetic mode — *evidence-driven addition only*. Add when live cost proves prohibitive.
- Cross-mode `divergence_count` metric — depends on replay layer.
- Cost ceiling / budget tracking — not relevant without nightly CI.
- Refresh-rotation ownership policy — not relevant without replay layer.
- Per-PR fast-subset CI workflow — not relevant without hermetic mode.
- Multi-environment scenarios (Windows, alternative Python versions) — single platform v1.

If any of the above proves valuable (e.g. SSOT close repeatedly slips because manual cadence misses regressions), open a follow-up issue with the concrete evidence motivating it. Do not pre-build.

## Decisions awaiting maintainer triage

**None.** The earlier draft had two BLOCK questions (`L0-2` nightly cost ceiling, `L0-4` replay refresh ownership). Both are retired: nightly CI is not in scope, so cost budget is meaningless; replay layer is not in scope, so refresh ownership is meaningless.

## Self-audit note (2026-05-22)

The original L0 design was *operational over-engineering*: nightly CI, replay layer, hermetic-vs-live cross-check, cost budgeting, refresh ownership rotation. None of it was Ouroboros-substrate. None of it affected `ooo auto`'s runtime behavior. It was project-management infrastructure I added because *"test harnesses usually have CI"* — a habitual reflex from other projects.

Per Ouroboros's minimal-substrate principle, this issue now ships the smallest L0 that satisfies the SSOT acceptance gate (*"one canonical goal end-to-end, reproducible 2x"*) — a manually-runnable pytest harness with four scenario fixtures. Operational infrastructure can be added later if and only if evidence demands it.

## References

- #1157 — Meta SSOT for `ooo auto` (L0 lane body, Acceptance gate).
- #1171 — L1 DomainProfile Catalog (consumed by `expected.yaml`'s `domain_class`).
- #961 — AgentOS roadmap (this lane has no Track A/B/C parent; treat as peer follow-up).


Slug	Goal one-liner	Domain class (anticipated by L1)	Expected terminal
`cli-todo`	`ooo auto "build a habit-tracker CLI"`	`cli`	`PRODUCT_COMPLETE`
`webhook-receiver`	`ooo auto "build a webhook receiver that logs payloads to a SQLite DB"`	`webhook`	`PRODUCT_COMPLETE`
`vertical-slice-refactor`	`ooo auto "refactor src/foo into vertical slices"`	`refactor-in-place`	`CODE_COMPLETE`
`2d-kart-racer`	`ooo auto "build a tiny 2D kart racing browser game"`	`game-2d`	`PRODUCT_COMPLETE`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170

Meta SSOT slice: L0 — Canonical acceptance test (minimal)

Why this is small on purpose

Scope

Initial canonical scenarios (one per SSOT canonical goal)

Per-scenario artifact shape

Sub-PR breakdown

How L0 is used

Acceptance criteria

Out of scope (deliberately)

Decisions awaiting maintainer triage

Self-audit note (2026-05-22)

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170

Description

Meta SSOT slice: L0 — Canonical acceptance test (minimal)

Why this is small on purpose

Scope

Initial canonical scenarios (one per SSOT canonical goal)

Per-scenario artifact shape

Sub-PR breakdown

How L0 is used

Acceptance criteria

Out of scope (deliberately)

Decisions awaiting maintainer triage

Self-audit note (2026-05-22)

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions