Skip to content

Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170

@shaun0927

Description

@shaun0927

Meta SSOT slice: L0 — Canonical acceptance test (minimal)

Implementation cleanup (2026-05-23 KST). L0 is partially implemented: #1174 merged the canonical harness skeleton/initial scenario, and #1191 merged opt-in live wiring plus L1 catalog cross-validation. Treat remaining scenario additions and #1195 evidence alignment as the open work; nightly/replay/cost-budget sections below are superseded v2 ideas unless new evidence reopens them.

This issue is the L0 design slice of #1157. It provides the smallest possible acceptance gate that the SSOT itself can point at to declare close-readiness.

Why this is small on purpose

The earlier draft of L0 proposed a nightly CI workflow, a recorded-replay layer for hermetic mode, a divergence_count cross-check job, a monthly LLM-cost budget, and a refresh-rotation ownership policy. All of that is operational sludge, not ooo auto behavior. L0's only job is: "give the SSOT something concrete to point at when claiming the canonical test matrix passes."

Per Ouroboros's minimal-substrate principle ("add substrate only when evidence demands it"), L0 v1 ships the smallest thing that satisfies the SSOT acceptance gate:

  • A tests/canonical/ directory with one self-contained scenario per canonical goal.
  • A pytest entry point that drives ouroboros_auto end-to-end against a scenario.
  • A maintainer runs it manually when assessing SSOT readiness.

That's it. No nightly CI, no replay infrastructure, no cost budget, no refresh policy, no divergence detection. If any of those prove valuable later (evidence from real usage), they get added as their own follow-up issues — not pre-built.

Scope

L0 builds:

  1. tests/canonical/ directory layout.
  2. Per-scenario self-contained fixtures.
  3. A pytest-driven runner (tests/canonical/conftest.py) that invokes ouroboros_auto against each scenario and asserts the documented terminal state.

Initial canonical scenarios (one per SSOT canonical goal)

Slug Goal one-liner Domain class (anticipated by L1) Expected terminal
cli-todo ooo auto "build a habit-tracker CLI" cli PRODUCT_COMPLETE
webhook-receiver ooo auto "build a webhook receiver that logs payloads to a SQLite DB" webhook PRODUCT_COMPLETE
vertical-slice-refactor ooo auto "refactor src/foo into vertical slices" refactor-in-place CODE_COMPLETE
2d-kart-racer ooo auto "build a tiny 2D kart racing browser game" game-2d PRODUCT_COMPLETE

cli-todo is the smallest scenario and ships first. Each additional scenario is its own follow-up PR. The 2d-kart-racer scenario waits on L3's runtime-probe substrate.

Per-scenario artifact shape

tests/canonical/<slug>/
├── goal.txt          # the one-line goal string
├── env/              # fixture files seeded into the temp workdir (often empty)
└── expected.yaml     # domain_class, completion_mode, runtime_probe_kinds, optional wall_clock_budget

expected.yaml's domain_class is the expected output of L1's derive_domain_from_ledger(). The L0 runner asserts the inference matches expectation — this is how L0 indirectly tests L1's catalog correctness without coupling to L1's pattern functions.

Sub-PR breakdown

  1. L0-atests/canonical/ skeleton + conftest.py runner + cli-todo scenario only. Manual pytest tests/canonical/cli-todo/ -v to execute. ~150 LoC.
  2. L0-b — add webhook-receiver scenario. ~50 LoC.
  3. L0-c — add vertical-slice-refactor scenario. ~50 LoC.
  4. L0-d — add 2d-kart-racer scenario. Waits on L3-a/b for headless_run probe substrate. ~80 LoC.

Total target: ~330 LoC across 4 PRs.

How L0 is used

The maintainer runs the canonical matrix manually when:

There is no CI obligation. There is no schedule. The matrix is a checklist artifact, not a continuous regression engine.

Acceptance criteria

  • tests/canonical/cli-todo/ PASSes when run manually against current main (PRODUCT_COMPLETE terminal, stdout/exit-code golden match).
  • Adding a new scenario requires no infrastructure change — just a new tests/canonical/<slug>/ directory and a one-line addition to conftest.py parametrization.
  • The L0 runner produces a single human-readable summary line per scenario for the maintainer to copy into a PR comment when claiming SSOT progress.

Out of scope (deliberately)

  • Nightly CI workflow — evidence-driven addition only. Add when manual cadence proves insufficient.
  • Recorded-replay layer for hermetic mode — evidence-driven addition only. Add when live cost proves prohibitive.
  • Cross-mode divergence_count metric — depends on replay layer.
  • Cost ceiling / budget tracking — not relevant without nightly CI.
  • Refresh-rotation ownership policy — not relevant without replay layer.
  • Per-PR fast-subset CI workflow — not relevant without hermetic mode.
  • Multi-environment scenarios (Windows, alternative Python versions) — single platform v1.

If any of the above proves valuable (e.g. SSOT close repeatedly slips because manual cadence misses regressions), open a follow-up issue with the concrete evidence motivating it. Do not pre-build.

Decisions awaiting maintainer triage

None. The earlier draft had two BLOCK questions (L0-2 nightly cost ceiling, L0-4 replay refresh ownership). Both are retired: nightly CI is not in scope, so cost budget is meaningless; replay layer is not in scope, so refresh ownership is meaningless.

Self-audit note (2026-05-22)

The original L0 design was operational over-engineering: nightly CI, replay layer, hermetic-vs-live cross-check, cost budgeting, refresh ownership rotation. None of it was Ouroboros-substrate. None of it affected ooo auto's runtime behavior. It was project-management infrastructure I added because "test harnesses usually have CI" — a habitual reflex from other projects.

Per Ouroboros's minimal-substrate principle, this issue now ships the smallest L0 that satisfies the SSOT acceptance gate ("one canonical goal end-to-end, reproducible 2x") — a manually-runnable pytest harness with four scenario fixtures. Operational infrastructure can be added later if and only if evidence demands it.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    OSCore engine, state machine, internal pipeline, and system-level behaviorenhancementNew feature or meaningful improvementneeds-designMulti-PR epic or architectural change, needs human planning

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions