Skip to content

Meta SSOT: ooo auto Vision — Autonomous Completion Engine #1157

@shaun0927

Description

@shaun0927

Meta SSOT: ooo auto Vision — Autonomous Completion Engine

Living SSOT for ooo auto's direction and improvement plan. Stays OPEN
until the engine reliably completes the canonical end-to-end test
(e.g. ooo auto "make me a 2D kart racing game") without human
intervention. Sibling to #961 (AgentOS roadmap sequencing).

Scope note. This SSOT is not about redesigning Ouroboros's Socratic
interview / tacit-knowledge substrate — ooo auto inherits that as-is.
What this SSOT owns is everything around the interview: domain-aware
spec inflation, long-running resilience, runtime acceptance, and a
typed completion contract.

Substrate honesty. Four of the five implementation lanes (L0, L1, L3,
L5) extend existing substrate. One lane — L2 watchdog v1 — adds
exactly one new EventStore event family (runtime.watchdog.cancel)
as the minimum needed to record why a run was cancelled at minute X.
Earlier drafts proposed a richer 3-timer / 4-directive vocabulary; per
Ouroboros's minimal-substrate principle, those are deferred to a v2
expansion path triggered only by evidence of stalls that wall-clock
alone cannot catch.

SSOT cleanup note (2026-05-23 KST). This issue was refreshed after the #1173/#1174/#1175/#1178/#1181/#1188/#1189/#1190/#1191 merge train. Current implementation terms are TaskClass / TaskClassProfile and runtime.watchdog.cancel; older “DomainProfile Catalog”, runtime.watchdog.decision, and directive-vocabulary wording is historical unless a section explicitly says it is describing a rejected prior draft. Open cleanup PRs #1194/#1195/#1196 are the remaining known Track-B/ooo auto follow-ups; no new issue is needed for those.

TL;DR — Status at a glance

Lane Owner / Anchor Status Gate / Next
L0 — Canonical acceptance test #1170 umbrella 🟡 partially implemented #1174 merged the canonical harness skeleton; #1191 merged opt-in OUROBOROS_RUN_CANONICAL=1 live wiring + L1 catalog cross-validation. Remaining work is scenario/evidence cleanup only, especially #1195; no nightly/replay/cost substrate unless later evidence demands it.
L1 — TaskClass Catalog #1171 umbrella 🟡 partially implemented #1173 merged TaskClass / TaskClassProfile catalog data; #1188 merged Seed AC injection + active task class envelope. “DomainProfile Catalog” wording below is historical/stale unless explicitly referring to the old recovery-hint concept. Remaining MCP metadata cleanup is #1196.
L2 — Watchdog v1 (new substrate, minimal) #1172 umbrella; lifts #578 🟡 partially implemented #1178 merged RuntimeControls + wall-clock Watchdog; #1189 merged production AutoPipeline/CLI/MCP consumption via runtime.watchdog.cancel and typed stop reason. Do not revive the old runtime.watchdog.decision / directive-vocabulary design. Remaining cleanup is #1194.
L3 — Runtime Acceptance Substrate #1176 umbrella 🟡 partially implemented #1181 merged RuntimeEvidence + HeadlessRunProbe; #1190 merged AutoPipelineResult.runtime_probe_evidence + completion-grade probe_runner gate. Remaining cleanup is evidence alignment (#1195) and any future real-probe binding/scenario expansion; sim_trace/render_hash/api_smoke stay deferred.
L4 — Auto Envelope v2 #1146/#1148/#1151/#1167/#1169 all merged 🟢 complete All five planned envelope fields landed (defaulted_sections, interview_closure_mode, stop_reason_code, assumption_sources, safe_default closure). Lane frozen.
L5 — Long-running Resilience (minimal) extends ooo unstuck + Ralph oscillation detection 🟢 L5-a merged #1175 merged the oscillation_detectedUNSTUCK_LATERAL routing slice. Further resilience substrate remains evidence-driven, not prebuilt.

North-star

ooo auto "<one vague line>" — a single MCP invocation — drives the work to
one of three typed terminal states without further user input:

  1. CODE_COMPLETE — passing tests + lint + (for libraries) a usable
    API surface, evidence captured.
  2. PRODUCT_COMPLETE — code-complete plus the domain-class
    runtime-acceptance probe passes (headless run / sim / render hash).
  3. BLOCKED(reason_code) — resumable auto_session_id, classified
    stop reason from the 7-code taxonomy, no fabrication, no silent
    abandonment, no untyped failure path
    .

Branching between (1) and (2) is decided by DomainProfile
auto-classification
(L1). The user never has to know which mode applied.

Why this is bigger than "Seed → Run"

skills/auto/SKILL.md currently optimizes:
Interview → A-grade Seed → Run handoff → (optional) Ralph.

That's the engineering ceiling, not the product. The product target
is: deliver a verifiably-working artifact from a vague one-line goal,
without the user pre-thinking the spec, the verification, or the
recovery.

Three invariants have to hold for that:

  • The system must infer the domain without being told.
  • The system must probe runtime behavior, not just unit tests.
  • The system must refuse to die quietly on long-running work.

L1–L5 below are exactly the missing pieces that turn the current
Seed→Run pipeline into a completion engine.

Success conditions (testable, frozen)

We close this SSOT when, against a fixed canonical test matrix
(ooo auto "CLI todo manager", "2D kart racer", "webhook receiver service", "refactor src/foo into vertical slices"), all of the
following hold without human intervention in at least one of the goals,
reproducible across two consecutive runs on a clean repo:

# Condition Today Target
1 Interview closes with closure_mode ∈ {mutual_agreement, ledger_only, safe_default} partial 100% (genuine_deadlock = 0)
2 Seed AC reflects the inferred domain class (e.g. game → game-loop / input / render-target / playable build) none 100% via L1
3 Long-running run never stalls without a runtime.watchdog.decision Directive ad-hoc 100% via L2
4 Evidence bundle contains runtime proof (sim trace / headless run / render hash), not only unit tests unit-tests only 100% via L3
5 Result envelope exposes closure_mode, defaulted_sections, assumptions[source], stop_reason_code (7-code taxonomy) partial 100% via L4
6 Sessions resumable across days; same goal lands on same lineage shipped (#1138 et al.) hold
7 On stall / regression, escalation ladder (unstuck → reframe → safe-default → BLOCKED) runs to a terminal state partial 100% via L5

Lanes

L0 — Canonical acceptance test (minimal)

The acceptance condition in this SSOT (one canonical goal end-to-end,
reproducible 2x
) needs something concrete to point at. L0 provides
the smallest possible thing: a tests/canonical/ directory with one
fixture per canonical goal and a pytest entry point. The maintainer
runs it manually
when assessing SSOT close-readiness — no CI obligation.

Scope:

  • tests/canonical/<slug>/ per goal: goal.txt + optional env/ +
    expected.yaml (domain_class, completion_mode, optional
    wall_clock_budget_seconds).
  • tests/canonical/conftest.py — pytest runner that invokes the
    ouroboros_auto MCP tool against the scenario and asserts the
    documented terminal state.
  • 4 initial scenarios: cli-todo, webhook-receiver,
    vertical-slice-refactor, 2d-kart-racer (last requires L3).

Out of scope (deliberately): nightly CI workflow, recorded-replay
layer, hermetic-vs-live divergence detection, monthly cost budget,
refresh-rotation ownership policy, per-PR fast-subset CI. All of those
were operational sludge added by reflex; each gets opened as a
follow-up only if/when evidence demands it. See #1170 Self-audit note.

Dependency: none. L0-a can ship today.

Design tracked in #1170.

L1 — DomainProfile Catalog

Promote #849 Phase-3 DomainProfile from "typed recovery hint" to a
first-class domain taxonomy that drives default AC, completion
mode, and runtime-probe binding — without a separate LLM classifier.

Design tracked in #1171.

Frozen 7-class catalog for L1-a (deferred classes: game-3d,
desktop-app, notebook-analysis — each becomes its own ≤ 10-LoC
follow-up PR):

  • library, cli, web-service, webhook, data-pipeline,
    game-2d, refactor-in-place.

Each class declares default_completion_mode, default_ac_template
(plain tuple[str, ...] matching Seed.acceptance_criteria),
runtime_probe_kinds (bound by L3), and the existing #849
safe_defaults.

Domain inference is ledger-derived, not LLM-classified. The
Socratic interview already extracts structured SeedDraftLedger
entries (actors, inputs, outputs, runtime_context, …) and
standardizes them toward canonical vocabulary. L1-b is a pure-Python
derive_domain_from_ledger() function in
src/ouroboros/auto/domain_inference.py that pattern-matches those
entries against per-class predicates and returns one of:

  • single match — exactly one class predicate fires.
  • ambiguous — multiple classes fire (e.g. CLI + WEB-SERVICE on a
    CLI that also exposes HTTP). The next interview round gets a
    disambiguation-question candidate appended (small hook in
    interview_driver.py); the existing ambiguity-gate loop drives
    resolution. No new escalation system.
  • unmatched — no predicate fires. Falls to library (narrowest
    completion gate, lowest blast radius) and emits a domain_unmatched
    EventStore event for maintainer review.

Zero new LLM calls, zero new external API surface, zero new
substrate. Adding a new class to the catalog later is a ~10-LoC PR
(pattern function + unit test), not an eval-set re-curation.

Why not a classifier. An earlier draft proposed a separate Sonnet
classifier with an eval set + accuracy floor + opt-in telemetry. That
duplicated the inference the interview is already doing and violated
this SSOT's own scope note ("ooo auto inherits Ouroboros's
substrate as-is"
). The redesign is documented in the Freshness sync
section under the 2026-05-22 self-audit entry.

L2 — Watchdog v1 (minimal new substrate)

The smallest watchdog that satisfies "long-running run never stalls without a recorded reason": one timer per session (wall-clock), one event (runtime.watchdog.cancel), one new stop_reason_code (watchdog_wall_clock_exceeded). When the session start time plus session_wall_clock_seconds exceeds now, the watchdog fires, the EventStore records it, and the pipeline transitions to BLOCKED with the typed code. Timer state is implicit in AutoPipelineState.session_started_at, so resume semantics work without separate serialization.

Substrate addition: exactly one new EventStore event family
(runtime.watchdog.cancel) and one new aggregate_type = "runtime_control". Confirmed additive to projection v1 by reading src/ouroboros/persistence/schema.py (informational confirmation posted on #946). The other four lanes in this SSOT light up existing substrate; L2 adds this single family.

v2 expansion (deferred, evidence-driven): richer 3-timer config
(idle / no_progress / safety), 4-directive vocabulary
(WAIT / RETRY / UNSTUCK / CANCEL), material_progress_events vs
activity_events split, subscriber pattern for cooperative cancel,
ad-hoc-timeout deprecation across MCP / Ralph / evolve. Each opens as
its own slice only when a real-world stall slips past v1 wall-clock.

Design tracked in #1172 (this issue lifts #578 to v1 minimum + documents v2 expansion path).

L3 — Runtime Acceptance Substrate

Extend Track A fat-harness (#920 / #978) evidence schema to legally
accept non-test evidence:

  • headless run logs (stdout, exit_code, duration)
  • deterministic simulation traces (N-tick sim + golden-state diff)
  • screenshot / DOM-hash / render-hash for UI classes
  • API smoke probes (request → response shape match) for service classes

DomainProfile (L1) binds each class to one or more probes. This is the
substrate change that makes PRODUCT_COMPLETE mean "the thing
actually runs"
, not "tests pass".

Minimal-substrate audit pending. Per the L0/L1/L2/L5 self-audits
(2026-05-22), L3 has not yet been re-examined through the same
minimal-substrate lens. Before opening the L3 design issue, ask:
which evidence kinds does v1 actually need, and which are
speculative? Likely v1 collapses to headless_run only (capture
stdout/exit_code/duration); sim_trace, render_hash, and api_smoke
each open as their own follow-up only when a canonical scenario
demands them. This audit happens when L3's design-issue PR is drafted,
not in this lane body.

Track A collision warning. L3's verifier-integration slice (L3-d)
modifies the same src/ouroboros/orchestrator/ evidence-handling
surface that Track A verifier follow-ups #1165 / #1166 / #1168 are
currently active in. Sequencing rule:

  1. Land fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 / fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168 (or their successors) on main first.
  2. Open L3-a (evidence-kind taxonomy) only after Track A queue is
    drained — taxonomy is pure-additive and safe regardless, but it
    becomes the authoritative shape downstream verifier work conforms to.
  3. Open L3-d (verifier integration) last in the L3 sequence so
    it conflicts with at most a single fresh main.

L4 — Auto Envelope v2

A single frozen v2 contract consumed identically by CLI, MCP, and any
future UI. 🟢 Complete — all five planned envelope fields landed:

Field Source Status
defaulted_sections[] #1146 🟢 merged
interview_closure_mode (mutual_agreement / ledger_only / safe_default) #1148 + #1167 🟢 merged
stop_reason_code (8-code) #1151 + #1167 (interview_unsafe_gaps_remain) 🟢 merged
safe-default finalization on ledger_done=False (PR-B2) #1167 🟢 merged
assumption_sources[] with AssumptionRecord provenance (PR-C2) #1169 🟢 merged

Lane frozen. Becomes the canonical shape used by L1 / L2 / L3 / L5
downstream.

L5 — Long-running Resilience (minimal, existing substrate only)

The "refuse to die quietly" invariant — but built on what already exists,
not as new substrate.

What exists today:

What L5 v1 actually adds (the only missing link):

  • L5-a — when Ralph emits oscillation_detected during a single
    ooo auto session
    , automatically invoke ooo unstuck once before
    bailing. ~50 LoC + integration test.
  • L5-b — when ooo unstuck exhausts its budget (default: 1 attempt),
    emit a typed stop_reason_code="unstuck_exhausted" (new 10th code) so
    the result envelope distinguishes "tried unstuck and failed" from
    "never tried". ~50 LoC.

Out of scope (v1): new escalation-ladder state machine, new
oscillation-detector substrate, budget unification with L2, reframe
(ontologist) as a separate stage. Each can be added later if/when
evidence shows the v1 plumbing is too thin.

Total: ~100 LoC across 2 sub-PRs. Earlier draft was ~600 LoC of new
state-machine substrate that duplicated existing detection signals.

Lane dependency graph

The lanes are not independent — three real dependencies and one
collision risk constrain the ordering. Reading this graph before
opening a lane PR avoids both paper completion (L0 invariant) and
the Track A collision (L3 warning above).

                  ┌──────────────────────────────────────┐
                  │  L0 — Canonical Test Harness         │
                  │  (meta-lane: gates every other lane) │
                  └──────────────┬───────────────────────┘
                                 │ acceptance invariant
              ┌──────────────────┼──────────────────┐
              ▼                  ▼                  ▼
       ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
       │  L1         │    │  L2         │    │  L4         │
       │ DomainProf. │    │ RuntimeCtl. │    │ Envelope v2 │
       │ Catalog     │    │ v1 (new sub)│    │ 🟢 complete │
       └──────┬──────┘    └──────┬──────┘    └─────────────┘
              │                  │
              │ probe binding    │ UNSTUCK/CANCEL
              │                  │ directive hooks
              ▼                  ▼
       ┌─────────────┐    ┌─────────────┐
       │  L3         │    │  L5         │
       │ Runtime Acc.│    │ Long-running│
       │ Substrate   │    │ Resilience  │
       └──────┬──────┘    └──────┬──────┘
              │                  │
              │ runtime evidence │ typed terminals
              └────────┬─────────┘
                       ▼
              ┌─────────────────────┐
              │ Canonical matrix    │
              │ 1+ goal end-to-end  │
              │ × 2 reproducible    │
              └─────────────────────┘
                       │
                       ▼
                 SSOT #1157 close

Hard dependencies (a PR opened upstream of an arrow cannot claim
done until the downstream consumer is at least design-locked):

  • L1 → L3. L3 evidence probes are bound per DomainProfile class.
    L1-a (catalog data) must land before L3-c (probe binding).
  • L2 → L5. L5 escalation ladder hooks into L2's
    UNSTUCK / CANCEL directives. L5-a (state machine) can start
    before L2 lands; L5-c (watchdog integration) cannot.
  • L0 → everyone. Every other lane's "complete" claim is
    validated by the canonical matrix runner. L0 doesn't block lane
    implementation — it blocks lane completion claims.

Soft dependency / collision risk:

Recommended parallelism:

Wave Run in parallel
1 L0 design issue + L1-a catalog data + L2-a #578 RFC格상
2 L1-b ledger-derive inference + pattern unit tests + L2-b/c (controls + watchdog) + L5-a state machine
3 L3-a/b/c after L1-a lands and Track A queue drained + L5-b oscillation detector
4 L3-d (verifier integration) + L5-c (L2 watchdog integration) — must be sequential within their own queue
5 L0 canonical matrix run on main; if PASS count ≥ 1 reproducible → SSOT close

Positioning vs OMX ultragoal

OMX ultragoal (in
docs/ultragoal.md
and plugins/oh-my-codex/skills/ultragoal/SKILL.md)
is the closest in-class prior art — a durable, repo-native multi-goal
workflow layered over Codex CLI's goals feature. It is genuinely
excellent at executing a known plan to completion. Naming where it
ends is how we name where ooo auto must do something different — not
just "more".

What OMX ultragoal actually does (sharp summary)

Aspect Behavior
Input Free-text brief (--brief, --brief-file, --from-stdin)
Decomposition LLM decomposes brief into G001 / G002 / … stories stored in .omx/ultragoal/goals.json
Durable state .omx/ultragoal/brief.md + goals.json (plan + status + attempts + evidence) + ledger.jsonl (append-only event log)
Codex coupling Aggregate mode (default): one Codex goal covers the whole run; pointer-style objective references goals.json rather than enumerating ids, so steering can add/split stories without weakening the end goal
Execution loop omx ultragoal complete-goals prints a handoff; the agent calls get_goalcreate_goal (only if none active) → completes the OMX story → omx ultragoal checkpoint --status complete --evidence … --codex-goal-json <get_goal snapshot>
Steering Explicit-only structured mutations — add_subgoal, split_subgoal, reorder_pending, revise_pending_wording, annotate_ledger, mark_blocked_superseded. Prose ("make it easier") is rejected. Every accept / reject appends an audit entry.
Final quality gate Mandatory on the final story only: targeted verification → ai-slop-cleaner on changed files → re-verification → $code-review. Clean = recommendation:APPROVE + architectStatus:CLEAR. Non-clean → record-review-blockers appends a new pending blocker-resolution story and the run continues.
Terminal contract Ledger event kinds — goal_completed, goal_failed, goal_blocked, goal_review_blocked, final_review_failed, aggregate_objective_migrated, …

This is a strong substrate. The append-only ledger, pointer-style
aggregate objective, structured-steering-with-audit, and mandatory
final quality gate are patterns we should learn from — not redo from
scratch.

What OMX ultragoal does not do (the room for ooo auto)

OMX ultragoal assumes the brief is sufficient and the LLM
decomposition is correct. Every gap below comes from those two
assumptions.

Capability OMX ultragoal ooo auto target Lane
Spec elicitation under ambiguity none — brief is accepted as-is bounded Socratic + ledger + tacit-knowledge digging until ambiguity ≤ 0.2; refuses to proceed otherwise inherits from Ouroboros
Tacit knowledge / mental model surfacing none Ouroboros Socratic interview + ontologist + contrarian personas already crystallize implicit domain assumptions inherits from Ouroboros
Domain-aware default AC none — stories are free-text decomposition L1 DomainProfile catalog injects class-specific AC (game-2d → game-loop / input / playable build / render target) into the Seed before execution L1
Completion-mode auto-branching one quality gate shape (review-centric) for all classes DomainProfile picks CODE_COMPLETE (library) vs PRODUCT_COMPLETE (game / app); user never has to choose L1
Long-running watchdog contract relies on Codex's native token/time accounting; no WAIT / RETRY / UNSTUCK / CANCEL directive at the OMX layer single RuntimeControls used by MCP / evolve / auto; runtime.watchdog.decision events replay why a run paused or died L2
Runtime acceptance evidence quality gate = ai-slop-cleaner + re-verification + code-review APPROVE/CLEAR — heavy on review, light on actual runtime non-test evidence is first-class: headless run logs, deterministic N-tick sim traces, render/DOM hashes; DomainProfile (L1) binds the probe per class L3
Typed terminal taxonomy ledger event kinds exist, but no typed reason_code on the run-level result 7-code stop_reason_code taxonomy on AutoPipelineResult; every terminal carries one L4 (merged)
Result-envelope provenance evidence is free-text in goals.json[].evidence; no defaulted_sections or assumptions[source] surface result envelope exposes closure_mode, defaulted_sections[], assumptions[].source, stop_reason_code L4 (partial — 3/5 merged)
Resilience escalation ladder one recovery path: record-review-blockers appends a new pending story, keep going layered ladder — unstuck-persona → reframe (ontologist) → safe-default closure → BLOCKED(reason_code); each rung bounded; falling off the bottom is always a typed terminal L5
Auto self-correction only explicit human/agent steering directives (omx ultragoal steer …) explicit steering and automatic correction via Track-B oscillation detector / grade-regression / fingerprinted recovery (already shipped via #928, etc.) inherited (Track B)

Where ultragoal validates existing ooo auto substrate

Two ultragoal patterns line up with substrate Ouroboros already has or
already plans in this SSOT — listed here as confirmation, not as new
behavior commitments. Per the scope note at the top of this issue, this
SSOT does not propose changes to Ouroboros's interview / steering /
recovery substrate; it only owns the L1–L5 lanes.

  1. Append-only durable ledger as audit-of-record. ooo auto
    already has this via Track C EventStore ([Feature] Define Run/Step/Artifact projections as the canonical harness vocabulary #946 / Agent OS: introduce typed Workflow IR for fat-harness execution planning #956). Ultragoal's
    .omx/ultragoal/ledger.jsonl pattern reinforces the existing design
    decision: keep treating EventStore as SSOT for every lifecycle event
    — including L2 watchdog decisions and L5 escalation transitions when
    those lanes land — and never compute terminal status from anywhere
    else. No behavior change implied.
  2. Quality-gate as a single bundled evidence artifact. Ultragoal's
    --quality-gate-json (aiSlopCleaner + verification +
    codeReview keys) is a clean evidence container shape. L3 (Runtime
    Acceptance Substrate) — which is in this SSOT's scope — should emit
    a comparable structured payload for the runtime probe, so
    PRODUCT_COMPLETE carries one inspectable evidence object instead
    of scattered files. Lane-internal shape choice, within L3 scope.

Patterns explicitly not adopted

Ultragoal has two further patterns that could be ported, but doing so
would change Ouroboros substrate outside this SSOT's scope. They are
listed here only so future readers know they were considered and
rejected for this issue:

  • Pointer-style aggregate Ralph handoff (Ralph references Seed +
    ledger live rather than a snapshotted AC list). Whether to adopt is
    outside L1–L5 scope; if pursued, it requires a separate design issue
    against ooo:ralph and --complete-product chaining.
  • Structured-steering mutation vocabulary (a finite set of allowed
    plan-revision kinds with evidence + audit, à la ultragoal's
    add_subgoal / split_subgoal / etc.). ooo unstuck and the typed
    recovery plan (feat(auto): persist typed recovery plans after QA failure #928) already cover mid-run revision today; replacing
    them with a finite vocabulary would be a substrate redesign, which
    the scope note at the top of this issue forbids.

One-line positioning

OMX ultragoal owns "executing a known plan to completion" under
Codex's goal feature. ooo auto owns "deciding what the plan is,
defending it against ambiguity, running it past a runtime gate, and
never dying without a typed reason"
— i.e. everything upstream of
execution (spec / domain / watchdog) and everything downstream of
execution (runtime acceptance / typed termination), via DomainProfile
auto-branching between CODE_COMPLETE and PRODUCT_COMPLETE.

Where the two systems agree on a pattern, prefer ultragoal's
audit-first, append-only style. Where ultragoal accepts a brief as
truth, ooo auto refuses to accept ambiguity as truth.

Double-diamond mapping

Discover           Define                Develop                Deliver
─────────          ──────                ───────                ───────
ooo interview      ooo seed              ooo run / evolve       ooo qa + L3 probe
  │                  │                     │                      │
  Ouroboros          L1 DomainProfile      L2 watchdog +          L4 envelope +
  Socratic +         default AC +          L5 escalation          typed reason_code
  tacit ledger       completion mode       ladder

ooo auto is the single entrypoint that drives the entire 4-step
diamond without the user ever choosing which step they are in.

AgentOS substrate dependencies (from #961)

Mapping each lane to the #961 track that the warden uses for triage,
plus the routing the warden has actually applied to merged L4 PRs
(#1167 / #1169 were classified as "Track B follow-up outside Track C
tier gates"
):

Four of the five implementation lanes (L0 / L1 / L3 / L5) light up
existing substrate. L2 adds one new substrate family
runtime.watchdog.decision events plus the directive vocabulary —
which is the only deliberately new building block in this SSOT. See
the Substrate honesty note at the top of this issue.

Anyone asking "what can I do right now"

  1. Review and merge L4 in-flight PRs feat(auto): surface defaulted_sections in AutoPipelineResult #1146 / fix(auto): close interview on ledger-only consensus at max_rounds #1148 / feat(auto): canonical stop_reason_code for interview-layer blockers #1151
    merged 2026-05-21 (squash to main, in that order after
    sequential rebases on AutoPipelineResult / _result()).
  2. Merge adjacent unblocker fix(persistence): sanitize Windows-reserved chars in checkpoint seed_id (fixes #1155) #1156 (Windows checkpoint sanitize)
    merged 2026-05-21; ooo auto run phase now reachable on
    Windows.
  3. Open PR-B2 for ledger_done=False safe-default finalization
    merged 2026-05-22 as feat(auto): safe-default closure mode + partial-unsafe blocker code (PR-B2) #1167 (safe_default closure_mode +
    interview_unsafe_gaps_remain 8th stop_reason_code).
  4. Open PR-C2 for assumptions[].source provenance promotion
    merged 2026-05-22 as feat(auto): additive assumption_sources provenance surface (PR-C2) #1169 (AssumptionRecord +
    AutoPipelineResult.assumption_sources, additive surface).
  5. Open design issues for L0 / L1 / L2opened 2026-05-22
    as Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170 / Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171 / Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172, then redesigned to minimal-substrate v1
    (see freshness sync entries below). All three are now ready for
    their respective *-a PR slices to start in parallel.
  6. Start L0-a — minimal tests/canonical/cli-todo/ scenario +
    pytest runner skeleton. See Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170. No CI, no replay, no budget.
  7. Start L1-a — 7-class catalog data + DomainProfile dataclass
    per-class fields + unit test per class. See Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171.
  8. Start L2-aAgent OS: define RuntimeControls watchdog contract #578 RFC promotion to v1 minimum (single
    wall-clock timer, single cancel event, single new stop_reason_code).
    v2 expansion path documented for evidence-driven future. See Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172.
  9. Open L3 design issue — but first apply the same minimal-substrate
    audit (likely outcome: v1 ships headless_run evidence kind only;
    sim_trace / render_hash / api_smoke each as their own follow-up
    when a canonical scenario demands them).
  10. Start L5-a — plumb existing oscillation_detected Ralph signal
    into existing ooo unstuck. ~50 LoC. Does not need its own design
    issue if scope stays this small.
  11. (deferred) When L3 opens, adopt OMX ultragoal's bundled
    quality-gate JSON shape (single evidence object with named sub-keys)
    for the runtime-probe payload — only if v1 ships more than one
    evidence kind.

Anyone asking "what's blocked"

Acceptance gate (when this SSOT closes)

This SSOT closes when all of the following are true:

  • L0, L1, L2, L3, L4, L5 are each either 🟢 merged or 🟢 explicitly
    superseded by a documented alternative.
  • The L0 canonical test matrix runs end-to-end on a clean repo
    with zero human intervention for at least one canonical goal
    (e.g. ooo auto "2D kart racer").
  • The result is reproducible across two consecutive runs on the
    L0 nightly job.

Until then this issue stays OPEN and serves as the living
guideline. Warden-style freshness syncs append below as lanes progress.

Freshness sync

2026-05-21 (initial post). Issue opened as living SSOT.

2026-05-21 (L4 partial 🟢). L4 Envelope v2 lane advanced from 🟡 to
🟢 partial — squash-merged #1156 (Windows checkpoint sanitize, the
adjacent prerequisite), then #1151 (stop_reason_code 7-code
taxonomy), then #1148 (interview_closure_mode ledger-only closure;
required one src/ouroboros/auto/state.py rebase conflict
resolution — both PRs added a payload.setdefault line, kept both),
then #1146 (defaulted_sections surface). Three of the five planned
L4 envelope fields landed; PR-B2 (ledger_done=False safe-default
finalization) and PR-C2 (assumptions[].source provenance) remain
deferred and unblock the L4 freeze.

2026-05-21 (Positioning revised). Original "Positioning vs OMC
autopilot / ultragoal-style flows" section was inaccurate — it
conflated OMC autopilot with OMX ultragoal. Rewrote against the
actual Yeachan-Heo/oh-my-codex ultragoal contract
(durable .omx/ultragoal/ ledger, pointer-style aggregate Codex goal,
six structured-steering mutation kinds, mandatory final
ai-slop-cleaner + code-review APPROVE/CLEAR gate).

2026-05-21 (Positioning scope-tightened). First draft of the
absorb-these-patterns subsection proposed two ultragoal patterns —
pointer-style aggregate handoff for --complete-product Ralph, and a
finite structured-steering mutation vocabulary to replace ooo unstuck

  • feat(auto): persist typed recovery plans after QA failure #928 typed recovery — that would change Ouroboros substrate outside
    the L1–L5 lanes this SSOT owns. That contradicted the scope note at
    the top of the issue ("ooo auto inherits Ouroboros's substrate as-is").
    Removed both from the absorb list; moved them into a new "Patterns
    explicitly not adopted" subsection that documents the consideration
  • rejection. Remaining absorbed patterns (append-only ledger
    confirmation, L3 bundled quality-gate JSON shape) are scope-clean —
    the first restates existing Track C behavior, the second is a
    lane-internal output schema choice.

2026-05-22 (L4 lane 🟢 complete). Squash-merged the two deferred
L4 follow-ups: #1167 (PR-B2 — safe_default closure_mode + new
interview_unsafe_gaps_remain stop_reason_code, taxonomy now 8 codes)
and #1169 (PR-C2 — AssumptionRecord frozen dataclass + additive
AutoPipelineResult.assumption_sources surface broadening from
LedgerSource.ASSUMPTION only to all three assumption-class sources:
ASSUMPTION, INFERENCE, CONSERVATIVE_DEFAULT). All five planned
L4 envelope fields are now live on main. L4 lane status moves from
🟡 partial to 🟢 complete; envelope v2 is frozen.

2026-05-22 (SSOT self-audit corrections, 5 items). A scope-and-design
audit against #961 flagged five issues in the prior draft. Fixed in
this revision:

  1. L4 status updated to 🟢 complete and the L4 lane body table
    regenerated against landed PR numbers / shas.
  2. Substrate honesty — the original "This SSOT introduces no new
    substrate"
    sentence at the bottom of the AgentOS-dependencies
    section conflicted with L2 RuntimeControls v1, which is
    genuinely new substrate (new EventStore event family
    runtime.watchdog.decision, new directive vocabulary). Added a
    Substrate honesty paragraph to the issue header and rewrote the
    AgentOS-dependencies section to name L2 explicitly as the single
    new-substrate lane (the other four lanes light up existing
    substrate).
  3. New L0 — Canonical Test Harness lane added as the meta-lane that
    gates every other lane's completion claim against actual canonical
    matrix behavior. Without L0, lane PRs can paper-merge while the
    integration stays broken.
  4. L1 classifier acceptance gate — L1-b cannot claim done without a
    frozen eval set (≥ 5 examples × 10 classes), ≥ 90% top-1 accuracy,
    and 100% confidence-floor escalation behavior. Added to the L1
    lane body.
    Superseded by the 2026-05-22 ledger-derive
    redesign (entry below). The classifier acceptance gate is no
    longer in scope.
  5. Lane dependency graph + Track A collision warning — new section
    between L5 and Positioning showing hard dependencies
    (L1 → L3, L2 → L5, L0 → everyone) and the soft collision risk
    between L3 and the active Track A verifier follow-ups
    (fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 / fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168). The L3 lane body now carries the explicit
    sequencing rule.

2026-05-22 (Design issues opened: #1170 L0, #1171 L1, #1172 L2).
Three design slices opened on Q00/ouroboros for the design-stage
lanes. Each issue body locks the formerly-open design questions with
recommended defaults under a Decisions awaiting maintainer triage
section so a maintainer-only triage pass can answer the remaining
4 BLOCK questions in one round (L0-2 cost ceiling, L0-4 replay
refresh ownership, plus the L1-5/L1-10 questions later retired in
the redesign below; #1172 has no remaining BLOCK questions after
verifying the runtime_control aggregate is additive to projection
v1 — informational confirmation posted on #946).

2026-05-22 (L1 self-audit: classifier → ledger-derive redesign).
Reviewer feedback on the L1 design called out that introducing a
separate Sonnet classifier with an eval set, accuracy floor, and
opt-in telemetry pipeline duplicates the work the Socratic interview
already does (structured-spec extraction into the ledger) and
violates this SSOT's own scope note ("ooo auto inherits
Ouroboros's substrate as-is"
). The audit was correct. L1's design
was rewritten:

  • Removed: classifier LLM, eval set, accuracy floor (≥ 90%),
    confidence threshold knob, model-routing decision (haiku vs
    sonnet vs opus), opt-in telemetry pipeline, Q00/ouroboros-eval-data
    private dataset proposal.
  • Kept: 7-class catalog, per-class default_completion_mode /
    default_ac_template / runtime_probe_kinds, the
    domain_unmatched audit event for catalog gaps.
  • Added: derive_domain_from_ledger() pure-Python pattern matcher
    in src/ouroboros/auto/domain_inference.py (~150 LoC), small
    hook in interview_driver.py to feed disambiguation question
    candidates back into the existing ambiguity-gate loop when the
    inference is ambiguous.
  • BLOCK questions retired: L1-5 (classifier model) and L1-10
    (opt-in telemetry) are both moot under the redesign. Maintainer
    triage now needs to answer 2 BLOCK questions total (L0-2 cost,
    L0-4 replay refresh ownership).

The TL;DR and L1 lane body in this SSOT have been rewritten to
reflect the redesign. #1171 carries the full design.

2026-05-22 (Minimal-substrate audit: L0 / L2 / L5 redesigned, L3 flagged).
Same pattern that produced the L1 classifier mistake was found in L0
(nightly CI + replay layer + cost budget + ownership policy) and L2
(3-timer config + 4-directive vocabulary + subscriber pattern), with
suspicions in L5 (new state-machine substrate) and L3 (4-probe-kind
substrate). All redesigned through the minimal-substrate lens
("add substrate only when evidence demands it"):

  • L0 (Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170): rewritten to a manual pytest harness with 4
    scenario fixtures and no CI/replay/budget/ownership infrastructure.
    ~330 LoC across 4 sub-PRs. The 2 BLOCK questions (L0-2 cost ceiling,
    L0-4 replay refresh ownership) are now both retired — they were
    decisions about substrate that no longer exists.
  • L2 (Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172): rewritten to a single wall-clock timer + a single
    runtime.watchdog.cancel event + a single new watchdog_wall_clock_exceeded
    stop_reason_code. ~150 LoC across 3 sub-PRs. v2 expansion path
    (3 timers, 4 directives, subscriber pattern, ad-hoc timeout
    deprecation) documented inside Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172 as evidence-driven follow-ups.
  • L5 (Meta SSOT: ooo auto Vision — Autonomous Completion Engine #1157 lane body): rewritten to plumb the existing
    oscillation_detected Ralph signal into the existing ooo unstuck,
    plus one new typed unstuck_exhausted stop_reason_code. ~100 LoC
    across 2 sub-PRs. No new state-machine, no new oscillation-detector
    substrate.
  • L3 (no design issue yet): a minimal-substrate audit note has
    been added to the L3 lane body so the same scope-tightening happens
    before the L3 design issue is drafted (likely outcome: v1 ships
    headless_run evidence kind only; sim/render/api each as their own
    follow-up when a canonical scenario demands them).

Net result: 0 BLOCK questions across all open design issues; L0-a,
L1-a, L2-a, L5-a all ready to start in parallel; total estimated
implementation across L0/L1/L2/L5 minimal v1 is ~730 LoC versus the
~2,000+ LoC pre-audit plan.

The meta-lesson: Ouroboros's minimal-substrate principle is "add
substrate only when evidence demands it."
Twice in one session I
defaulted to standard-engineering patterns (ML classifier, CI
infrastructure, state-machine substrate) and twice the maintainer
caught it. The pattern to internalize: if a lane body lists
infrastructure that solves a class of problems we have not yet
observed, that infrastructure does not belong in v1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    OSCore engine, state machine, internal pipeline, and system-level behaviorenhancementNew feature or meaningful improvementmeta-ssotSingle source of truth meta issueneeds-designMulti-PR epic or architectural change, needs human planning

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions