You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Meta SSOT: ooo auto Vision — Autonomous Completion Engine
Living SSOT for ooo auto's direction and improvement plan. Stays OPEN
until the engine reliably completes the canonical end-to-end test
(e.g. ooo auto "make me a 2D kart racing game") without human
intervention. Sibling to #961 (AgentOS roadmap sequencing).
Scope note. This SSOT is not about redesigning Ouroboros's Socratic
interview / tacit-knowledge substrate — ooo auto inherits that as-is.
What this SSOT owns is everything around the interview: domain-aware
spec inflation, long-running resilience, runtime acceptance, and a
typed completion contract.
Substrate honesty. Four of the five implementation lanes (L0, L1, L3,
L5) extend existing substrate. One lane — L2 watchdog v1 — adds
exactly one new EventStore event family (runtime.watchdog.cancel)
as the minimum needed to record why a run was cancelled at minute X.
Earlier drafts proposed a richer 3-timer / 4-directive vocabulary; per
Ouroboros's minimal-substrate principle, those are deferred to a v2
expansion path triggered only by evidence of stalls that wall-clock
alone cannot catch.
SSOT cleanup note (2026-05-23 KST). This issue was refreshed after the #1173/#1174/#1175/#1178/#1181/#1188/#1189/#1190/#1191 merge train. Current implementation terms are TaskClass / TaskClassProfile and runtime.watchdog.cancel; older “DomainProfile Catalog”, runtime.watchdog.decision, and directive-vocabulary wording is historical unless a section explicitly says it is describing a rejected prior draft. Open cleanup PRs #1194/#1195/#1196 are the remaining known Track-B/ooo auto follow-ups; no new issue is needed for those.
#1174 merged the canonical harness skeleton; #1191 merged opt-in OUROBOROS_RUN_CANONICAL=1 live wiring + L1 catalog cross-validation. Remaining work is scenario/evidence cleanup only, especially #1195; no nightly/replay/cost substrate unless later evidence demands it.
#1173 merged TaskClass / TaskClassProfile catalog data; #1188 merged Seed AC injection + active task class envelope. “DomainProfile Catalog” wording below is historical/stale unless explicitly referring to the old recovery-hint concept. Remaining MCP metadata cleanup is #1196.
#1178 merged RuntimeControls + wall-clock Watchdog; #1189 merged production AutoPipeline/CLI/MCP consumption via runtime.watchdog.cancel and typed stop reason. Do not revive the old runtime.watchdog.decision / directive-vocabulary design. Remaining cleanup is #1194.
All five planned envelope fields landed (defaulted_sections, interview_closure_mode, stop_reason_code, assumption_sources, safe_default closure). Lane frozen.
L5 — Long-running Resilience(minimal)
extends ooo unstuck + Ralph oscillation detection
🟢 L5-a merged
#1175 merged the oscillation_detected → UNSTUCK_LATERAL routing slice. Further resilience substrate remains evidence-driven, not prebuilt.
North-star
ooo auto "<one vague line>" — a single MCP invocation — drives the work to
one of three typed terminal states without further user input:
CODE_COMPLETE — passing tests + lint + (for libraries) a usable
API surface, evidence captured.
PRODUCT_COMPLETE — code-complete plus the domain-class
runtime-acceptance probe passes (headless run / sim / render hash).
BLOCKED(reason_code) — resumable auto_session_id, classified
stop reason from the 7-code taxonomy, no fabrication, no silent
abandonment, no untyped failure path.
Branching between (1) and (2) is decided by DomainProfile
auto-classification (L1). The user never has to know which mode applied.
Why this is bigger than "Seed → Run"
skills/auto/SKILL.md currently optimizes: Interview → A-grade Seed → Run handoff → (optional) Ralph.
That's the engineering ceiling, not the product. The product target
is: deliver a verifiably-working artifact from a vague one-line goal,
without the user pre-thinking the spec, the verification, or the
recovery.
Three invariants have to hold for that:
The system must infer the domain without being told.
The system must probe runtime behavior, not just unit tests.
The system must refuse to die quietly on long-running work.
L1–L5 below are exactly the missing pieces that turn the current
Seed→Run pipeline into a completion engine.
Success conditions (testable, frozen)
We close this SSOT when, against a fixed canonical test matrix
(ooo auto "CLI todo manager", "2D kart racer", "webhook receiver service", "refactor src/foo into vertical slices"), all of the
following hold without human intervention in at least one of the goals,
reproducible across two consecutive runs on a clean repo:
#
Condition
Today
Target
1
Interview closes with closure_mode ∈ {mutual_agreement, ledger_only, safe_default}
partial
100% (genuine_deadlock = 0)
2
Seed AC reflects the inferred domain class (e.g. game → game-loop / input / render-target / playable build)
none
100% via L1
3
Long-running run never stalls without a runtime.watchdog.decision Directive
ad-hoc
100% via L2
4
Evidence bundle contains runtime proof (sim trace / headless run / render hash), not only unit tests
unit-tests only
100% via L3
5
Result envelope exposes closure_mode, defaulted_sections, assumptions[source], stop_reason_code (7-code taxonomy)
partial
100% via L4
6
Sessions resumable across days; same goal lands on same lineage
On stall / regression, escalation ladder (unstuck → reframe → safe-default → BLOCKED) runs to a terminal state
partial
100% via L5
Lanes
L0 — Canonical acceptance test (minimal)
The acceptance condition in this SSOT (one canonical goal end-to-end,
reproducible 2x) needs something concrete to point at. L0 provides
the smallest possible thing: a tests/canonical/ directory with one
fixture per canonical goal and a pytest entry point. The maintainer
runs it manually when assessing SSOT close-readiness — no CI obligation.
Out of scope (deliberately): nightly CI workflow, recorded-replay
layer, hermetic-vs-live divergence detection, monthly cost budget,
refresh-rotation ownership policy, per-PR fast-subset CI. All of those
were operational sludge added by reflex; each gets opened as a
follow-up only if/when evidence demands it. See #1170Self-audit note.
Promote #849 Phase-3 DomainProfile from "typed recovery hint" to a
first-class domain taxonomy that drives default AC, completion
mode, and runtime-probe binding — without a separate LLM classifier.
Each class declares default_completion_mode, default_ac_template
(plain tuple[str, ...] matching Seed.acceptance_criteria), runtime_probe_kinds (bound by L3), and the existing #849 safe_defaults.
Domain inference is ledger-derived, not LLM-classified. The
Socratic interview already extracts structured SeedDraftLedger
entries (actors, inputs, outputs, runtime_context, …) and standardizes them toward canonical vocabulary. L1-b is a pure-Python derive_domain_from_ledger() function in src/ouroboros/auto/domain_inference.py that pattern-matches those
entries against per-class predicates and returns one of:
single match — exactly one class predicate fires.
ambiguous — multiple classes fire (e.g. CLI + WEB-SERVICE on a
CLI that also exposes HTTP). The next interview round gets a
disambiguation-question candidate appended (small hook in interview_driver.py); the existing ambiguity-gate loop drives
resolution. No new escalation system.
unmatched — no predicate fires. Falls to library (narrowest
completion gate, lowest blast radius) and emits a domain_unmatched
EventStore event for maintainer review.
Zero new LLM calls, zero new external API surface, zero new
substrate. Adding a new class to the catalog later is a ~10-LoC PR
(pattern function + unit test), not an eval-set re-curation.
Why not a classifier. An earlier draft proposed a separate Sonnet
classifier with an eval set + accuracy floor + opt-in telemetry. That
duplicated the inference the interview is already doing and violated
this SSOT's own scope note ("ooo auto inherits Ouroboros's
substrate as-is"). The redesign is documented in the Freshness sync
section under the 2026-05-22 self-audit entry.
L2 — Watchdog v1 (minimal new substrate)
The smallest watchdog that satisfies "long-running run never stalls without a recorded reason": one timer per session (wall-clock), one event (runtime.watchdog.cancel), one new stop_reason_code (watchdog_wall_clock_exceeded). When the session start time plus session_wall_clock_seconds exceeds now, the watchdog fires, the EventStore records it, and the pipeline transitions to BLOCKED with the typed code. Timer state is implicit in AutoPipelineState.session_started_at, so resume semantics work without separate serialization.
Substrate addition: exactly one new EventStore event family
(runtime.watchdog.cancel) and one new aggregate_type = "runtime_control". Confirmed additive to projection v1 by reading src/ouroboros/persistence/schema.py (informational confirmation posted on #946). The other four lanes in this SSOT light up existing substrate; L2 adds this single family.
v2 expansion (deferred, evidence-driven): richer 3-timer config
(idle / no_progress / safety), 4-directive vocabulary
(WAIT / RETRY / UNSTUCK / CANCEL), material_progress_events vs activity_events split, subscriber pattern for cooperative cancel,
ad-hoc-timeout deprecation across MCP / Ralph / evolve. Each opens as
its own slice only when a real-world stall slips past v1 wall-clock.
Design tracked in #1172 (this issue lifts #578 to v1 minimum + documents v2 expansion path).
L3 — Runtime Acceptance Substrate
Extend Track A fat-harness (#920 / #978) evidence schema to legally
accept non-test evidence:
screenshot / DOM-hash / render-hash for UI classes
API smoke probes (request → response shape match) for service classes
DomainProfile (L1) binds each class to one or more probes. This is the
substrate change that makes PRODUCT_COMPLETE mean "the thing
actually runs", not "tests pass".
Minimal-substrate audit pending. Per the L0/L1/L2/L5 self-audits
(2026-05-22), L3 has not yet been re-examined through the same
minimal-substrate lens. Before opening the L3 design issue, ask:
which evidence kinds does v1 actually need, and which are
speculative? Likely v1 collapses to headless_run only (capture
stdout/exit_code/duration); sim_trace, render_hash, and api_smoke
each open as their own follow-up only when a canonical scenario
demands them. This audit happens when L3's design-issue PR is drafted,
not in this lane body.
Track A collision warning. L3's verifier-integration slice (L3-d)
modifies the same src/ouroboros/orchestrator/ evidence-handling
surface that Track A verifier follow-ups #1165 / #1166 / #1168 are
currently active in. Sequencing rule:
Open L3-a (evidence-kind taxonomy) only after Track A queue is
drained — taxonomy is pure-additive and safe regardless, but it
becomes the authoritative shape downstream verifier work conforms to.
Open L3-d (verifier integration) last in the L3 sequence so
it conflicts with at most a single fresh main.
L4 — Auto Envelope v2
A single frozen v2 contract consumed identically by CLI, MCP, and any
future UI. 🟢 Complete — all five planned envelope fields landed:
Terminal taxonomy invariant already enforced by L4's 8-code envelope.
What L5 v1 actually adds (the only missing link):
L5-a — when Ralph emits oscillation_detectedduring a single ooo auto session, automatically invoke ooo unstuck once before
bailing. ~50 LoC + integration test.
L5-b — when ooo unstuck exhausts its budget (default: 1 attempt),
emit a typed stop_reason_code="unstuck_exhausted" (new 10th code) so
the result envelope distinguishes "tried unstuck and failed" from "never tried". ~50 LoC.
Out of scope (v1): new escalation-ladder state machine, new
oscillation-detector substrate, budget unification with L2, reframe
(ontologist) as a separate stage. Each can be added later if/when
evidence shows the v1 plumbing is too thin.
Total: ~100 LoC across 2 sub-PRs. Earlier draft was ~600 LoC of new
state-machine substrate that duplicated existing detection signals.
Lane dependency graph
The lanes are not independent — three real dependencies and one
collision risk constrain the ordering. Reading this graph before
opening a lane PR avoids both paper completion (L0 invariant) and
the Track A collision (L3 warning above).
Hard dependencies (a PR opened upstream of an arrow cannot claim
done until the downstream consumer is at least design-locked):
L1 → L3. L3 evidence probes are bound per DomainProfile class.
L1-a (catalog data) must land before L3-c (probe binding).
L2 → L5. L5 escalation ladder hooks into L2's UNSTUCK / CANCEL directives. L5-a (state machine) can start
before L2 lands; L5-c (watchdog integration) cannot.
L0 → everyone. Every other lane's "complete" claim is
validated by the canonical matrix runner. L0 doesn't block lane implementation — it blocks lane completion claims.
L1-b ledger-derive inference + pattern unit tests + L2-b/c (controls + watchdog) + L5-a state machine
3
L3-a/b/c after L1-a lands and Track A queue drained + L5-b oscillation detector
4
L3-d (verifier integration) + L5-c (L2 watchdog integration) — must be sequential within their own queue
5
L0 canonical matrix run on main; if PASS count ≥ 1 reproducible → SSOT close
Positioning vs OMX ultragoal
OMX ultragoal (in docs/ultragoal.md
and plugins/oh-my-codex/skills/ultragoal/SKILL.md)
is the closest in-class prior art — a durable, repo-native multi-goal
workflow layered over Codex CLI's goals feature. It is genuinely
excellent at executing a known plan to completion. Naming where it
ends is how we name where ooo auto must do something different — not
just "more".
Aggregate mode (default): one Codex goal covers the whole run; pointer-style objective references goals.json rather than enumerating ids, so steering can add/split stories without weakening the end goal
Execution loop
omx ultragoal complete-goals prints a handoff; the agent calls get_goal → create_goal (only if none active) → completes the OMX story → omx ultragoal checkpoint --status complete --evidence … --codex-goal-json <get_goal snapshot>
Steering
Explicit-only structured mutations — add_subgoal, split_subgoal, reorder_pending, revise_pending_wording, annotate_ledger, mark_blocked_superseded. Prose ("make it easier") is rejected. Every accept / reject appends an audit entry.
Final quality gate
Mandatory on the final story only: targeted verification → ai-slop-cleaner on changed files → re-verification → $code-review. Clean = recommendation:APPROVE + architectStatus:CLEAR. Non-clean → record-review-blockers appends a new pending blocker-resolution story and the run continues.
This is a strong substrate. The append-only ledger, pointer-style
aggregate objective, structured-steering-with-audit, and mandatory
final quality gate are patterns we should learn from — not redo from
scratch.
What OMX ultragoal does not do (the room for ooo auto)
OMX ultragoal assumes the brief is sufficient and the LLM
decomposition is correct. Every gap below comes from those two
assumptions.
Capability
OMX ultragoal
ooo auto target
Lane
Spec elicitation under ambiguity
none — brief is accepted as-is
bounded Socratic + ledger + tacit-knowledge digging until ambiguity ≤ 0.2; refuses to proceed otherwise
L1 DomainProfile catalog injects class-specific AC (game-2d → game-loop / input / playable build / render target) into the Seed before execution
L1
Completion-mode auto-branching
one quality gate shape (review-centric) for all classes
DomainProfile picks CODE_COMPLETE (library) vs PRODUCT_COMPLETE (game / app); user never has to choose
L1
Long-running watchdog contract
relies on Codex's native token/time accounting; no WAIT / RETRY / UNSTUCK / CANCEL directive at the OMX layer
single RuntimeControls used by MCP / evolve / auto; runtime.watchdog.decision events replay why a run paused or died
L2
Runtime acceptance evidence
quality gate = ai-slop-cleaner + re-verification + code-review APPROVE/CLEAR — heavy on review, light on actual runtime
non-test evidence is first-class: headless run logs, deterministic N-tick sim traces, render/DOM hashes; DomainProfile (L1) binds the probe per class
L3
Typed terminal taxonomy
ledger event kinds exist, but no typed reason_code on the run-level result
7-code stop_reason_code taxonomy on AutoPipelineResult; every terminal carries one
L4 (merged)
Result-envelope provenance
evidence is free-text in goals.json[].evidence; no defaulted_sections or assumptions[source] surface
result envelope exposes closure_mode, defaulted_sections[], assumptions[].source, stop_reason_code
L4 (partial — 3/5 merged)
Resilience escalation ladder
one recovery path: record-review-blockers appends a new pending story, keep going
layered ladder — unstuck-persona → reframe (ontologist) → safe-default closure → BLOCKED(reason_code); each rung bounded; falling off the bottom is always a typed terminal
L5
Auto self-correction
only explicit human/agent steering directives (omx ultragoal steer …)
explicit steering and automatic correction via Track-B oscillation detector / grade-regression / fingerprinted recovery (already shipped via #928, etc.)
inherited (Track B)
Where ultragoal validates existing ooo auto substrate
Two ultragoal patterns line up with substrate Ouroboros already has or
already plans in this SSOT — listed here as confirmation, not as new
behavior commitments. Per the scope note at the top of this issue, this
SSOT does not propose changes to Ouroboros's interview / steering /
recovery substrate; it only owns the L1–L5 lanes.
Quality-gate as a single bundled evidence artifact. Ultragoal's --quality-gate-json (aiSlopCleaner + verification + codeReview keys) is a clean evidence container shape. L3 (Runtime
Acceptance Substrate) — which is in this SSOT's scope — should emit
a comparable structured payload for the runtime probe, so PRODUCT_COMPLETE carries one inspectable evidence object instead
of scattered files. Lane-internal shape choice, within L3 scope.
Patterns explicitly not adopted
Ultragoal has two further patterns that could be ported, but doing so
would change Ouroboros substrate outside this SSOT's scope. They are
listed here only so future readers know they were considered and
rejected for this issue:
Pointer-style aggregate Ralph handoff (Ralph references Seed +
ledger live rather than a snapshotted AC list). Whether to adopt is
outside L1–L5 scope; if pursued, it requires a separate design issue
against ooo:ralph and --complete-product chaining.
Structured-steering mutation vocabulary (a finite set of allowed
plan-revision kinds with evidence + audit, à la ultragoal's add_subgoal / split_subgoal / etc.). ooo unstuck and the typed
recovery plan (feat(auto): persist typed recovery plans after QA failure #928) already cover mid-run revision today; replacing
them with a finite vocabulary would be a substrate redesign, which
the scope note at the top of this issue forbids.
One-line positioning
OMX ultragoal owns "executing a known plan to completion" under
Codex's goal feature. ooo auto owns "deciding what the plan is,
defending it against ambiguity, running it past a runtime gate, and
never dying without a typed reason" — i.e. everything upstream of
execution (spec / domain / watchdog) and everything downstream of
execution (runtime acceptance / typed termination), via DomainProfile
auto-branching between CODE_COMPLETE and PRODUCT_COMPLETE.
Where the two systems agree on a pattern, prefer ultragoal's
audit-first, append-only style. Where ultragoal accepts a brief as
truth, ooo auto refuses to accept ambiguity as truth.
Mapping each lane to the #961 track that the warden uses for triage,
plus the routing the warden has actually applied to merged L4 PRs
(#1167 / #1169 were classified as "Track B follow-up outside Track C
tier gates"):
Outside Meta SSOT: AgentOS roadmap sequencing (#920–#960) #961 tracks: L0 is a test-harness lane with no
Track parent; classify as a peer follow-up. L2 anchors on #578, which Meta SSOT: AgentOS roadmap sequencing (#920–#960) #961's "Scope of the Tier system" paragraph
explicitly excludes from Track C tier gating — #578 keeps its
own design-issue lifecycle. L1 extends #849, a merged
Track B Phase-3 artifact, so its follow-ups inherit Track B
routing.
Four of the five implementation lanes (L0 / L1 / L3 / L5) light up
existing substrate. L2 adds one new substrate family — runtime.watchdog.decision events plus the directive vocabulary —
which is the only deliberately new building block in this SSOT. See
the Substrate honesty note at the top of this issue.
Open L3 design issue — but first apply the same minimal-substrate
audit (likely outcome: v1 ships headless_run evidence kind only;
sim_trace / render_hash / api_smoke each as their own follow-up
when a canonical scenario demands them).
Start L5-a — plumb existing oscillation_detected Ralph signal
into existing ooo unstuck. ~50 LoC. Does not need its own design
issue if scope stays this small.
(deferred) When L3 opens, adopt OMX ultragoal's bundled
quality-gate JSON shape (single evidence object with named sub-keys)
for the runtime-probe payload — only if v1 ships more than one
evidence kind.
This SSOT closes when all of the following are true:
L0, L1, L2, L3, L4, L5 are each either 🟢 merged or 🟢 explicitly
superseded by a documented alternative.
The L0 canonical test matrix runs end-to-end on a clean repo
with zero human intervention for at least one canonical goal
(e.g. ooo auto "2D kart racer").
The result is reproducible across two consecutive runs on the
L0 nightly job.
Until then this issue stays OPEN and serves as the living
guideline. Warden-style freshness syncs append below as lanes progress.
Freshness sync
2026-05-21 (initial post). Issue opened as living SSOT.
2026-05-21 (L4 partial 🟢). L4 Envelope v2 lane advanced from 🟡 to
🟢 partial — squash-merged #1156 (Windows checkpoint sanitize, the
adjacent prerequisite), then #1151 (stop_reason_code 7-code
taxonomy), then #1148 (interview_closure_mode ledger-only closure;
required one src/ouroboros/auto/state.py rebase conflict
resolution — both PRs added a payload.setdefault line, kept both),
then #1146 (defaulted_sections surface). Three of the five planned
L4 envelope fields landed; PR-B2 (ledger_done=False safe-default
finalization) and PR-C2 (assumptions[].source provenance) remain
deferred and unblock the L4 freeze.
2026-05-21 (Positioning revised). Original "Positioning vs OMC
autopilot / ultragoal-style flows" section was inaccurate — it
conflated OMC autopilot with OMX ultragoal. Rewrote against the
actual Yeachan-Heo/oh-my-codex ultragoal contract
(durable .omx/ultragoal/ ledger, pointer-style aggregate Codex goal,
six structured-steering mutation kinds, mandatory final
ai-slop-cleaner + code-review APPROVE/CLEAR gate).
2026-05-21 (Positioning scope-tightened). First draft of the
absorb-these-patterns subsection proposed two ultragoal patterns —
pointer-style aggregate handoff for --complete-product Ralph, and a
finite structured-steering mutation vocabulary to replace ooo unstuck
feat(auto): persist typed recovery plans after QA failure #928 typed recovery — that would change Ouroboros substrate outside
the L1–L5 lanes this SSOT owns. That contradicted the scope note at
the top of the issue ("ooo auto inherits Ouroboros's substrate as-is").
Removed both from the absorb list; moved them into a new "Patterns
explicitly not adopted" subsection that documents the consideration
rejection. Remaining absorbed patterns (append-only ledger
confirmation, L3 bundled quality-gate JSON shape) are scope-clean —
the first restates existing Track C behavior, the second is a
lane-internal output schema choice.
2026-05-22 (L4 lane 🟢 complete). Squash-merged the two deferred
L4 follow-ups: #1167 (PR-B2 — safe_default closure_mode + new interview_unsafe_gaps_remain stop_reason_code, taxonomy now 8 codes)
and #1169 (PR-C2 — AssumptionRecord frozen dataclass + additive AutoPipelineResult.assumption_sources surface broadening from LedgerSource.ASSUMPTION only to all three assumption-class sources: ASSUMPTION, INFERENCE, CONSERVATIVE_DEFAULT). All five planned
L4 envelope fields are now live on main. L4 lane status moves from
🟡 partial to 🟢 complete; envelope v2 is frozen.
2026-05-22 (SSOT self-audit corrections, 5 items). A scope-and-design
audit against #961 flagged five issues in the prior draft. Fixed in
this revision:
L4 status updated to 🟢 complete and the L4 lane body table
regenerated against landed PR numbers / shas.
Substrate honesty — the original "This SSOT introduces no new
substrate" sentence at the bottom of the AgentOS-dependencies
section conflicted with L2 RuntimeControls v1, which is
genuinely new substrate (new EventStore event family runtime.watchdog.decision, new directive vocabulary). Added a Substrate honesty paragraph to the issue header and rewrote the
AgentOS-dependencies section to name L2 explicitly as the single
new-substrate lane (the other four lanes light up existing
substrate).
New L0 — Canonical Test Harness lane added as the meta-lane that
gates every other lane's completion claim against actual canonical
matrix behavior. Without L0, lane PRs can paper-merge while the
integration stays broken.
L1 classifier acceptance gate — L1-b cannot claim done without a
frozen eval set (≥ 5 examples × 10 classes), ≥ 90% top-1 accuracy,
and 100% confidence-floor escalation behavior. Added to the L1
lane body.Superseded by the 2026-05-22 ledger-derive
redesign (entry below). The classifier acceptance gate is no
longer in scope.
2026-05-22 (Design issues opened: #1170 L0, #1171 L1, #1172 L2).
Three design slices opened on Q00/ouroboros for the design-stage
lanes. Each issue body locks the formerly-open design questions with
recommended defaults under a Decisions awaiting maintainer triage
section so a maintainer-only triage pass can answer the remaining
4 BLOCK questions in one round (L0-2 cost ceiling, L0-4 replay
refresh ownership, plus the L1-5/L1-10 questions later retired in
the redesign below; #1172 has no remaining BLOCK questions after
verifying the runtime_control aggregate is additive to projection
v1 — informational confirmation posted on #946).
2026-05-22 (L1 self-audit: classifier → ledger-derive redesign).
Reviewer feedback on the L1 design called out that introducing a
separate Sonnet classifier with an eval set, accuracy floor, and
opt-in telemetry pipeline duplicates the work the Socratic interview
already does (structured-spec extraction into the ledger) and
violates this SSOT's own scope note ("ooo auto inherits
Ouroboros's substrate as-is"). The audit was correct. L1's design
was rewritten:
Kept: 7-class catalog, per-class default_completion_mode / default_ac_template / runtime_probe_kinds, the domain_unmatched audit event for catalog gaps.
Added: derive_domain_from_ledger() pure-Python pattern matcher
in src/ouroboros/auto/domain_inference.py (~150 LoC), small
hook in interview_driver.py to feed disambiguation question
candidates back into the existing ambiguity-gate loop when the
inference is ambiguous.
BLOCK questions retired: L1-5 (classifier model) and L1-10
(opt-in telemetry) are both moot under the redesign. Maintainer
triage now needs to answer 2 BLOCK questions total (L0-2 cost,
L0-4 replay refresh ownership).
The TL;DR and L1 lane body in this SSOT have been rewritten to
reflect the redesign. #1171 carries the full design.
2026-05-22 (Minimal-substrate audit: L0 / L2 / L5 redesigned, L3 flagged).
Same pattern that produced the L1 classifier mistake was found in L0
(nightly CI + replay layer + cost budget + ownership policy) and L2
(3-timer config + 4-directive vocabulary + subscriber pattern), with
suspicions in L5 (new state-machine substrate) and L3 (4-probe-kind
substrate). All redesigned through the minimal-substrate lens
("add substrate only when evidence demands it"):
L0 (Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170): rewritten to a manual pytest harness with 4
scenario fixtures and no CI/replay/budget/ownership infrastructure.
~330 LoC across 4 sub-PRs. The 2 BLOCK questions (L0-2 cost ceiling,
L0-4 replay refresh ownership) are now both retired — they were
decisions about substrate that no longer exists.
L5 (Meta SSOT: ooo auto Vision — Autonomous Completion Engine #1157 lane body): rewritten to plumb the existing oscillation_detected Ralph signal into the existingooo unstuck,
plus one new typed unstuck_exhausted stop_reason_code. ~100 LoC
across 2 sub-PRs. No new state-machine, no new oscillation-detector
substrate.
L3 (no design issue yet): a minimal-substrate audit note has
been added to the L3 lane body so the same scope-tightening happens before the L3 design issue is drafted (likely outcome: v1 ships headless_run evidence kind only; sim/render/api each as their own
follow-up when a canonical scenario demands them).
Net result: 0 BLOCK questions across all open design issues; L0-a,
L1-a, L2-a, L5-a all ready to start in parallel; total estimated
implementation across L0/L1/L2/L5 minimal v1 is ~730 LoC versus the
~2,000+ LoC pre-audit plan.
The meta-lesson: Ouroboros's minimal-substrate principle is "add
substrate only when evidence demands it." Twice in one session I
defaulted to standard-engineering patterns (ML classifier, CI
infrastructure, state-machine substrate) and twice the maintainer
caught it. The pattern to internalize: if a lane body lists
infrastructure that solves a class of problems we have not yet
observed, that infrastructure does not belong in v1.
Meta SSOT:
ooo autoVision — Autonomous Completion EngineTL;DR — Status at a glance
OUROBOROS_RUN_CANONICAL=1live wiring + L1 catalog cross-validation. Remaining work is scenario/evidence cleanup only, especially #1195; no nightly/replay/cost substrate unless later evidence demands it.TaskClass/TaskClassProfilecatalog data; #1188 merged Seed AC injection + active task class envelope. “DomainProfile Catalog” wording below is historical/stale unless explicitly referring to the old recovery-hint concept. Remaining MCP metadata cleanup is #1196.RuntimeControls+ wall-clockWatchdog; #1189 merged productionAutoPipeline/CLI/MCP consumption viaruntime.watchdog.canceland typed stop reason. Do not revive the oldruntime.watchdog.decision/ directive-vocabulary design. Remaining cleanup is #1194.RuntimeEvidence+HeadlessRunProbe; #1190 mergedAutoPipelineResult.runtime_probe_evidence+ completion-gradeprobe_runnergate. Remaining cleanup is evidence alignment (#1195) and any future real-probe binding/scenario expansion;sim_trace/render_hash/api_smokestay deferred.defaulted_sections,interview_closure_mode,stop_reason_code,assumption_sources,safe_defaultclosure). Lane frozen.ooo unstuck+ Ralph oscillation detectionoscillation_detected→UNSTUCK_LATERALrouting slice. Further resilience substrate remains evidence-driven, not prebuilt.North-star
ooo auto "<one vague line>"— a single MCP invocation — drives the work toone of three typed terminal states without further user input:
CODE_COMPLETE— passing tests + lint + (for libraries) a usableAPI surface, evidence captured.
PRODUCT_COMPLETE— code-complete plus the domain-classruntime-acceptance probe passes (headless run / sim / render hash).
BLOCKED(reason_code)— resumableauto_session_id, classifiedstop reason from the 7-code taxonomy, no fabrication, no silent
abandonment, no untyped failure path.
Branching between (1) and (2) is decided by DomainProfile
auto-classification (L1). The user never has to know which mode applied.
Why this is bigger than "Seed → Run"
skills/auto/SKILL.mdcurrently optimizes:Interview → A-grade Seed → Run handoff → (optional) Ralph.That's the engineering ceiling, not the product. The product target
is: deliver a verifiably-working artifact from a vague one-line goal,
without the user pre-thinking the spec, the verification, or the
recovery.
Three invariants have to hold for that:
L1–L5 below are exactly the missing pieces that turn the current
Seed→Run pipeline into a completion engine.
Success conditions (testable, frozen)
We close this SSOT when, against a fixed canonical test matrix
(
ooo auto "CLI todo manager","2D kart racer","webhook receiver service","refactor src/foo into vertical slices"), all of thefollowing hold without human intervention in at least one of the goals,
reproducible across two consecutive runs on a clean repo:
closure_mode ∈ {mutual_agreement, ledger_only, safe_default}genuine_deadlock = 0)runtime.watchdog.decisionDirectiveclosure_mode,defaulted_sections,assumptions[source],stop_reason_code(7-code taxonomy)unstuck → reframe → safe-default → BLOCKED) runs to a terminal stateLanes
L0 — Canonical acceptance test (minimal)
The acceptance condition in this SSOT (one canonical goal end-to-end,
reproducible 2x) needs something concrete to point at. L0 provides
the smallest possible thing: a
tests/canonical/directory with onefixture per canonical goal and a
pytestentry point. The maintainerruns it manually when assessing SSOT close-readiness — no CI obligation.
Scope:
tests/canonical/<slug>/per goal:goal.txt+ optionalenv/+expected.yaml(domain_class,completion_mode, optionalwall_clock_budget_seconds).tests/canonical/conftest.py— pytest runner that invokes theouroboros_autoMCP tool against the scenario and asserts thedocumented terminal state.
cli-todo,webhook-receiver,vertical-slice-refactor,2d-kart-racer(last requires L3).Out of scope (deliberately): nightly CI workflow, recorded-replay
layer, hermetic-vs-live divergence detection, monthly cost budget,
refresh-rotation ownership policy, per-PR fast-subset CI. All of those
were operational sludge added by reflex; each gets opened as a
follow-up only if/when evidence demands it. See #1170 Self-audit note.
Dependency: none. L0-a can ship today.
Design tracked in #1170.
L1 — DomainProfile Catalog
Promote
#849Phase-3 DomainProfile from "typed recovery hint" to afirst-class domain taxonomy that drives default AC, completion
mode, and runtime-probe binding — without a separate LLM classifier.
Design tracked in #1171.
Frozen 7-class catalog for L1-a (deferred classes:
game-3d,desktop-app,notebook-analysis— each becomes its own ≤ 10-LoCfollow-up PR):
library,cli,web-service,webhook,data-pipeline,game-2d,refactor-in-place.Each class declares
default_completion_mode,default_ac_template(plain
tuple[str, ...]matchingSeed.acceptance_criteria),runtime_probe_kinds(bound by L3), and the existing #849safe_defaults.Domain inference is ledger-derived, not LLM-classified. The
Socratic interview already extracts structured
SeedDraftLedgerentries (
actors,inputs,outputs,runtime_context, …) andstandardizes them toward canonical vocabulary. L1-b is a pure-Python
derive_domain_from_ledger()function insrc/ouroboros/auto/domain_inference.pythat pattern-matches thoseentries against per-class predicates and returns one of:
CLI that also exposes HTTP). The next interview round gets a
disambiguation-question candidate appended (small hook in
interview_driver.py); the existing ambiguity-gate loop drivesresolution. No new escalation system.
library(narrowestcompletion gate, lowest blast radius) and emits a
domain_unmatchedEventStore event for maintainer review.
Zero new LLM calls, zero new external API surface, zero new
substrate. Adding a new class to the catalog later is a ~10-LoC PR
(pattern function + unit test), not an eval-set re-curation.
Why not a classifier. An earlier draft proposed a separate Sonnet
classifier with an eval set + accuracy floor + opt-in telemetry. That
duplicated the inference the interview is already doing and violated
this SSOT's own scope note ("
ooo autoinherits Ouroboros'ssubstrate as-is"). The redesign is documented in the Freshness sync
section under the 2026-05-22 self-audit entry.
L2 — Watchdog v1 (minimal new substrate)
The smallest watchdog that satisfies "long-running run never stalls without a recorded reason": one timer per session (wall-clock), one event (
runtime.watchdog.cancel), one new stop_reason_code (watchdog_wall_clock_exceeded). When the session start time plussession_wall_clock_secondsexceeds now, the watchdog fires, the EventStore records it, and the pipeline transitions to BLOCKED with the typed code. Timer state is implicit inAutoPipelineState.session_started_at, so resume semantics work without separate serialization.Substrate addition: exactly one new EventStore event family
(
runtime.watchdog.cancel) and one newaggregate_type = "runtime_control". Confirmed additive to projection v1 by readingsrc/ouroboros/persistence/schema.py(informational confirmation posted on #946). The other four lanes in this SSOT light up existing substrate; L2 adds this single family.v2 expansion (deferred, evidence-driven): richer 3-timer config
(
idle/no_progress/safety), 4-directive vocabulary(
WAIT/RETRY/UNSTUCK/CANCEL),material_progress_eventsvsactivity_eventssplit, subscriber pattern for cooperative cancel,ad-hoc-timeout deprecation across MCP / Ralph / evolve. Each opens as
its own slice only when a real-world stall slips past v1 wall-clock.
Design tracked in #1172 (this issue lifts #578 to v1 minimum + documents v2 expansion path).
L3 — Runtime Acceptance Substrate
Extend Track A fat-harness (#920 / #978) evidence schema to legally
accept non-test evidence:
stdout,exit_code,duration)DomainProfile (L1) binds each class to one or more probes. This is the
substrate change that makes
PRODUCT_COMPLETEmean "the thingactually runs", not "tests pass".
Minimal-substrate audit pending. Per the L0/L1/L2/L5 self-audits
(2026-05-22), L3 has not yet been re-examined through the same
minimal-substrate lens. Before opening the L3 design issue, ask:
which evidence kinds does v1 actually need, and which are
speculative? Likely v1 collapses to
headless_runonly (capturestdout/exit_code/duration);
sim_trace,render_hash, andapi_smokeeach open as their own follow-up only when a canonical scenario
demands them. This audit happens when L3's design-issue PR is drafted,
not in this lane body.
Track A collision warning. L3's verifier-integration slice (L3-d)
modifies the same
src/ouroboros/orchestrator/evidence-handlingsurface that Track A verifier follow-ups #1165 / #1166 / #1168 are
currently active in. Sequencing rule:
mainfirst.drained — taxonomy is pure-additive and safe regardless, but it
becomes the authoritative shape downstream verifier work conforms to.
it conflicts with at most a single fresh
main.L4 — Auto Envelope v2
A single frozen v2 contract consumed identically by CLI, MCP, and any
future UI. 🟢 Complete — all five planned envelope fields landed:
defaulted_sections[]interview_closure_mode(mutual_agreement/ledger_only/safe_default)stop_reason_code(8-code)interview_unsafe_gaps_remain)ledger_done=False(PR-B2)assumption_sources[]withAssumptionRecordprovenance (PR-C2)Lane frozen. Becomes the canonical shape used by L1 / L2 / L3 / L5
downstream.
L5 — Long-running Resilience (minimal, existing substrate only)
The "refuse to die quietly" invariant — but built on what already exists,
not as new substrate.
What exists today:
ooo unstuck(lateral persona swap) already runs when invoked.oscillation_detected/grade_regressingasstop_reason_codevalues (feat(auto): canonical stop_reason_code for interview-layer blockers #1151).What L5 v1 actually adds (the only missing link):
oscillation_detectedduring a singleooo autosession, automatically invokeooo unstuckonce beforebailing. ~50 LoC + integration test.
ooo unstuckexhausts its budget (default: 1 attempt),emit a typed
stop_reason_code="unstuck_exhausted"(new 10th code) sothe result envelope distinguishes "tried unstuck and failed" from
"never tried". ~50 LoC.
Out of scope (v1): new escalation-ladder state machine, new
oscillation-detector substrate, budget unification with L2, reframe
(ontologist) as a separate stage. Each can be added later if/when
evidence shows the v1 plumbing is too thin.
Total: ~100 LoC across 2 sub-PRs. Earlier draft was ~600 LoC of new
state-machine substrate that duplicated existing detection signals.
Lane dependency graph
The lanes are not independent — three real dependencies and one
collision risk constrain the ordering. Reading this graph before
opening a lane PR avoids both paper completion (L0 invariant) and
the Track A collision (L3 warning above).
Hard dependencies (a PR opened upstream of an arrow cannot claim
done until the downstream consumer is at least design-locked):
L1-a (catalog data) must land before L3-c (probe binding).
UNSTUCK / CANCELdirectives. L5-a (state machine) can startbefore L2 lands; L5-c (watchdog integration) cannot.
validated by the canonical matrix runner. L0 doesn't block lane
implementation — it blocks lane completion claims.
Soft dependency / collision risk:
fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168). Same
src/ouroboros/orchestrator/evidence-handlingsurface. See the Track A collision warning in the L3 section.
Recommended parallelism:
#578RFC格상main; if PASS count ≥ 1 reproducible → SSOT closePositioning vs OMX
ultragoalOMX
ultragoal(indocs/ultragoal.mdand
plugins/oh-my-codex/skills/ultragoal/SKILL.md)is the closest in-class prior art — a durable, repo-native multi-goal
workflow layered over Codex CLI's
goalsfeature. It is genuinelyexcellent at executing a known plan to completion. Naming where it
ends is how we name where
ooo automust do something different — notjust "more".
What OMX ultragoal actually does (sharp summary)
--brief,--brief-file,--from-stdin)G001/G002/ … stories stored in.omx/ultragoal/goals.json.omx/ultragoal/brief.md+goals.json(plan + status + attempts + evidence) +ledger.jsonl(append-only event log)goals.jsonrather than enumerating ids, so steering can add/split stories without weakening the end goalomx ultragoal complete-goalsprints a handoff; the agent callsget_goal→create_goal(only if none active) → completes the OMX story →omx ultragoal checkpoint --status complete --evidence … --codex-goal-json <get_goal snapshot>add_subgoal,split_subgoal,reorder_pending,revise_pending_wording,annotate_ledger,mark_blocked_superseded. Prose ("make it easier") is rejected. Every accept / reject appends an audit entry.ai-slop-cleaneron changed files → re-verification →$code-review. Clean =recommendation:APPROVE+architectStatus:CLEAR. Non-clean →record-review-blockersappends a new pending blocker-resolution story and the run continues.goal_completed,goal_failed,goal_blocked,goal_review_blocked,final_review_failed,aggregate_objective_migrated, …This is a strong substrate. The append-only ledger, pointer-style
aggregate objective, structured-steering-with-audit, and mandatory
final quality gate are patterns we should learn from — not redo from
scratch.
What OMX ultragoal does not do (the room for
ooo auto)OMX ultragoal assumes the brief is sufficient and the LLM
decomposition is correct. Every gap below comes from those two
assumptions.
ultragoalooo autotargetambiguity ≤ 0.2; refuses to proceed otherwisegame-2d→ game-loop / input / playable build / render target) into the Seed before executionCODE_COMPLETE(library) vsPRODUCT_COMPLETE(game / app); user never has to chooseWAIT / RETRY / UNSTUCK / CANCELdirective at the OMX layerruntime.watchdog.decisionevents replay why a run paused or diedAPPROVE/CLEAR— heavy on review, light on actual runtimereason_codeon the run-level resultstop_reason_codetaxonomy onAutoPipelineResult; every terminal carries onegoals.json[].evidence; nodefaulted_sectionsorassumptions[source]surfaceclosure_mode,defaulted_sections[],assumptions[].source,stop_reason_coderecord-review-blockersappends a new pending story, keep goingunstuck-persona → reframe (ontologist) → safe-default closure → BLOCKED(reason_code); each rung bounded; falling off the bottom is always a typed terminalomx ultragoal steer …)Where ultragoal validates existing
ooo autosubstrateTwo ultragoal patterns line up with substrate Ouroboros already has or
already plans in this SSOT — listed here as confirmation, not as new
behavior commitments. Per the scope note at the top of this issue, this
SSOT does not propose changes to Ouroboros's interview / steering /
recovery substrate; it only owns the L1–L5 lanes.
ooo autoalready has this via Track C EventStore ([Feature] Define Run/Step/Artifact projections as the canonical harness vocabulary #946 / Agent OS: introduce typed Workflow IR for fat-harness execution planning #956). Ultragoal's
.omx/ultragoal/ledger.jsonlpattern reinforces the existing designdecision: keep treating EventStore as SSOT for every lifecycle event
— including L2 watchdog decisions and L5 escalation transitions when
those lanes land — and never compute terminal status from anywhere
else. No behavior change implied.
--quality-gate-json(aiSlopCleaner+verification+codeReviewkeys) is a clean evidence container shape. L3 (RuntimeAcceptance Substrate) — which is in this SSOT's scope — should emit
a comparable structured payload for the runtime probe, so
PRODUCT_COMPLETEcarries one inspectable evidence object insteadof scattered files. Lane-internal shape choice, within L3 scope.
Patterns explicitly not adopted
Ultragoal has two further patterns that could be ported, but doing so
would change Ouroboros substrate outside this SSOT's scope. They are
listed here only so future readers know they were considered and
rejected for this issue:
ledger live rather than a snapshotted AC list). Whether to adopt is
outside L1–L5 scope; if pursued, it requires a separate design issue
against
ooo:ralphand--complete-productchaining.plan-revision kinds with evidence + audit, à la ultragoal's
add_subgoal/split_subgoal/ etc.).ooo unstuckand the typedrecovery plan (feat(auto): persist typed recovery plans after QA failure #928) already cover mid-run revision today; replacing
them with a finite vocabulary would be a substrate redesign, which
the scope note at the top of this issue forbids.
One-line positioning
Double-diamond mapping
ooo autois the single entrypoint that drives the entire 4-stepdiamond without the user ever choosing which step they are in.
AgentOS substrate dependencies (from #961)
Mapping each lane to the
#961track that the warden uses for triage,plus the routing the warden has actually applied to merged L4 PRs
(#1167 / #1169 were classified as "Track B follow-up outside Track C
tier gates"):
ooo runfat-harness, Agent OS roadmap: makeooo runtrustworthy with a fat harness execution path #920 / Design spine: AgentOS evidence-gated delivery via TraceGuard #978): L3 extendsthe evidence schema — additive, no backwards-incompatible change.
Warden classification likely: "Track A follow-up outside Track C
gates" (mirroring fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 / fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168 precedent).
ooo autoself-healing, closed [EPIC] ooo auto end-to-end product completion #772 / RFC: ooo auto evolution to domain-agnostic self-healing E2E #809 / feat(auto): make ooo auto autonomous end-to-end — short goals must reach COMPLETE without human input #821):L4 (🟢 complete via feat(auto): surface defaulted_sections in AutoPipelineResult #1146 / fix(auto): close interview on ledger-only consensus at max_rounds #1148 / feat(auto): canonical stop_reason_code for interview-layer blockers #1151 / feat(auto): safe-default closure mode + partial-unsafe blocker code (PR-B2) #1167 / feat(auto): additive assumption_sources provenance surface (PR-C2) #1169)
and L5 are natural continuations. Warden classification:
"Track B follow-up outside Track C tier gates" — same precedent
as the merged L4 PRs.
decisions ride the EventStore + projection vocabulary; L4 envelope
fields live on the lifecycle event family.
Track parent; classify as a peer follow-up. L2 anchors on
#578, which Meta SSOT: AgentOS roadmap sequencing (#920–#960) #961's "Scope of the Tier system" paragraphexplicitly excludes from Track C tier gating —
#578keeps itsown design-issue lifecycle. L1 extends
#849, a mergedTrack B Phase-3 artifact, so its follow-ups inherit Track B
routing.
Four of the five implementation lanes (L0 / L1 / L3 / L5) light up
existing substrate. L2 adds one new substrate family —
runtime.watchdog.decisionevents plus the directive vocabulary —which is the only deliberately new building block in this SSOT. See
the Substrate honesty note at the top of this issue.
Anyone asking "what can I do right now"
Review and merge L4 in-flight PRs feat(auto): surface defaulted_sections in AutoPipelineResult #1146 / fix(auto): close interview on ledger-only consensus at max_rounds #1148 / feat(auto): canonical stop_reason_code for interview-layer blockers #1151— merged 2026-05-21 (squash to main, in that order after
sequential rebases on
AutoPipelineResult/_result()).Merge adjacent unblocker fix(persistence): sanitize Windows-reserved chars in checkpoint seed_id (fixes #1155) #1156 (Windows checkpoint sanitize)— merged 2026-05-21;
ooo autorun phase now reachable onWindows.
Open PR-B2 forledger_done=Falsesafe-default finalization— merged 2026-05-22 as feat(auto): safe-default closure mode + partial-unsafe blocker code (PR-B2) #1167 (
safe_defaultclosure_mode +interview_unsafe_gaps_remain8th stop_reason_code).Open PR-C2 forassumptions[].sourceprovenance promotion— merged 2026-05-22 as feat(auto): additive assumption_sources provenance surface (PR-C2) #1169 (
AssumptionRecord+AutoPipelineResult.assumption_sources, additive surface).Open design issues for L0 / L1 / L2— opened 2026-05-22as Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170 / Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171 / Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172, then redesigned to minimal-substrate v1
(see freshness sync entries below). All three are now ready for
their respective
*-aPR slices to start in parallel.tests/canonical/cli-todo/scenario +pytestrunner skeleton. See Meta SSOT slice: L0 — Canonical Test Harness for ooo auto acceptance #1170. No CI, no replay, no budget.DomainProfiledataclassper-class fields + unit test per class. See Meta SSOT slice: L1 — TaskClass Catalog (ledger-derived domain inference + default AC injection) #1171.
wall-clock timer, single cancel event, single new stop_reason_code).
v2 expansion path documented for evidence-driven future. See Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172.
audit (likely outcome: v1 ships
headless_runevidence kind only;sim_trace / render_hash / api_smoke each as their own follow-up
when a canonical scenario demands them).
oscillation_detectedRalph signalinto existing
ooo unstuck. ~50 LoC. Does not need its own designissue if scope stays this small.
ultragoal's bundledquality-gate JSON shape (single evidence object with named sub-keys)
for the runtime-probe payload — only if v1 ships more than one
evidence kind.
Anyone asking "what's blocked"
L1-a ready.
(Agent OS: define RuntimeControls watchdog contract #578 body promotion). No BLOCK questions; v2 expansion deferred to
evidence-driven follow-ups.
also waits on the active Track A verifier queue (fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 /
fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168) draining.
oscillation_detected→ooo unstuck) ready. ~50 LoC.Acceptance gate (when this SSOT closes)
This SSOT closes when all of the following are true:
superseded by a documented alternative.
with zero human intervention for at least one canonical goal
(e.g.
ooo auto "2D kart racer").L0 nightly job.
Until then this issue stays OPEN and serves as the living
guideline. Warden-style freshness syncs append below as lanes progress.
Freshness sync
2026-05-21 (initial post). Issue opened as living SSOT.
2026-05-21 (L4 partial 🟢). L4 Envelope v2 lane advanced from 🟡 to
🟢 partial — squash-merged #1156 (Windows checkpoint sanitize, the
adjacent prerequisite), then #1151 (
stop_reason_code7-codetaxonomy), then #1148 (
interview_closure_modeledger-only closure;required one
src/ouroboros/auto/state.pyrebase conflictresolution — both PRs added a
payload.setdefaultline, kept both),then #1146 (
defaulted_sectionssurface). Three of the five plannedL4 envelope fields landed; PR-B2 (
ledger_done=Falsesafe-defaultfinalization) and PR-C2 (
assumptions[].sourceprovenance) remaindeferred and unblock the L4 freeze.
2026-05-21 (Positioning revised). Original "Positioning vs OMC
autopilot / ultragoal-style flows" section was inaccurate — it
conflated OMC autopilot with OMX
ultragoal. Rewrote against theactual
Yeachan-Heo/oh-my-codexultragoal contract(durable
.omx/ultragoal/ledger, pointer-style aggregate Codex goal,six structured-steering mutation kinds, mandatory final
ai-slop-cleaner + code-review APPROVE/CLEAR gate).
2026-05-21 (Positioning scope-tightened). First draft of the
absorb-these-patterns subsection proposed two ultragoal patterns —
pointer-style aggregate handoff for
--complete-productRalph, and afinite structured-steering mutation vocabulary to replace
ooo unstuckthe L1–L5 lanes this SSOT owns. That contradicted the scope note at
the top of the issue ("ooo auto inherits Ouroboros's substrate as-is").
Removed both from the absorb list; moved them into a new "Patterns
explicitly not adopted" subsection that documents the consideration
confirmation, L3 bundled quality-gate JSON shape) are scope-clean —
the first restates existing Track C behavior, the second is a
lane-internal output schema choice.
2026-05-22 (L4 lane 🟢 complete). Squash-merged the two deferred
L4 follow-ups: #1167 (PR-B2 —
safe_defaultclosure_mode + newinterview_unsafe_gaps_remainstop_reason_code, taxonomy now 8 codes)and #1169 (PR-C2 —
AssumptionRecordfrozen dataclass + additiveAutoPipelineResult.assumption_sourcessurface broadening fromLedgerSource.ASSUMPTIONonly to all three assumption-class sources:ASSUMPTION,INFERENCE,CONSERVATIVE_DEFAULT). All five plannedL4 envelope fields are now live on
main. L4 lane status moves from🟡 partial to 🟢 complete; envelope v2 is frozen.
2026-05-22 (SSOT self-audit corrections, 5 items). A scope-and-design
audit against
#961flagged five issues in the prior draft. Fixed inthis revision:
regenerated against landed PR numbers / shas.
substrate" sentence at the bottom of the AgentOS-dependencies
section conflicted with L2 RuntimeControls v1, which is
genuinely new substrate (new EventStore event family
runtime.watchdog.decision, new directive vocabulary). Added aSubstrate honesty paragraph to the issue header and rewrote the
AgentOS-dependencies section to name L2 explicitly as the single
new-substrate lane (the other four lanes light up existing
substrate).
gates every other lane's completion claim against actual canonical
matrix behavior. Without L0, lane PRs can paper-merge while the
integration stays broken.
L1 classifier acceptance gate — L1-b cannot claim done without aSuperseded by the 2026-05-22 ledger-derivefrozen eval set (≥ 5 examples × 10 classes), ≥ 90% top-1 accuracy,
and 100% confidence-floor escalation behavior. Added to the L1
lane body.
redesign (entry below). The classifier acceptance gate is no
longer in scope.
between L5 and Positioning showing hard dependencies
(L1 → L3, L2 → L5, L0 → everyone) and the soft collision risk
between L3 and the active Track A verifier follow-ups
(fix(orchestrator): log AC verifier and dependency diagnostics #1165 / fix(orchestrator): credit transcript test commands for tests_passed claims #1166 / fix(orchestrator): match command claims wrapped in output redirection/pager pipes #1168). The L3 lane body now carries the explicit
sequencing rule.
2026-05-22 (Design issues opened: #1170 L0, #1171 L1, #1172 L2).
Three design slices opened on
Q00/ouroborosfor the design-stagelanes. Each issue body locks the formerly-open design questions with
recommended defaults under a Decisions awaiting maintainer triage
section so a maintainer-only triage pass can answer the remaining
4 BLOCK questions in one round (L0-2 cost ceiling, L0-4 replay
refresh ownership, plus the L1-5/L1-10 questions later retired in
the redesign below; #1172 has no remaining BLOCK questions after
verifying the
runtime_controlaggregate is additive to projectionv1 — informational confirmation posted on #946).
2026-05-22 (L1 self-audit: classifier → ledger-derive redesign).
Reviewer feedback on the L1 design called out that introducing a
separate Sonnet classifier with an eval set, accuracy floor, and
opt-in telemetry pipeline duplicates the work the Socratic interview
already does (structured-spec extraction into the ledger) and
violates this SSOT's own scope note ("
ooo autoinheritsOuroboros's substrate as-is"). The audit was correct. L1's design
was rewritten:
≥ 90%),confidence threshold knob, model-routing decision (
haikuvssonnetvsopus), opt-in telemetry pipeline,Q00/ouroboros-eval-dataprivate dataset proposal.
default_completion_mode/default_ac_template/runtime_probe_kinds, thedomain_unmatchedaudit event for catalog gaps.derive_domain_from_ledger()pure-Python pattern matcherin
src/ouroboros/auto/domain_inference.py(~150 LoC), smallhook in
interview_driver.pyto feed disambiguation questioncandidates back into the existing ambiguity-gate loop when the
inference is ambiguous.
(opt-in telemetry) are both moot under the redesign. Maintainer
triage now needs to answer 2 BLOCK questions total (L0-2 cost,
L0-4 replay refresh ownership).
The TL;DR and L1 lane body in this SSOT have been rewritten to
reflect the redesign. #1171 carries the full design.
2026-05-22 (Minimal-substrate audit: L0 / L2 / L5 redesigned, L3 flagged).
Same pattern that produced the L1 classifier mistake was found in L0
(nightly CI + replay layer + cost budget + ownership policy) and L2
(3-timer config + 4-directive vocabulary + subscriber pattern), with
suspicions in L5 (new state-machine substrate) and L3 (4-probe-kind
substrate). All redesigned through the minimal-substrate lens
("add substrate only when evidence demands it"):
pytestharness with 4scenario fixtures and no CI/replay/budget/ownership infrastructure.
~330 LoC across 4 sub-PRs. The 2 BLOCK questions (L0-2 cost ceiling,
L0-4 replay refresh ownership) are now both retired — they were
decisions about substrate that no longer exists.
runtime.watchdog.cancelevent + a single newwatchdog_wall_clock_exceededstop_reason_code. ~150 LoC across 3 sub-PRs. v2 expansion path
(3 timers, 4 directives, subscriber pattern, ad-hoc timeout
deprecation) documented inside Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172 as evidence-driven follow-ups.
oscillation_detectedRalph signal into the existingooo unstuck,plus one new typed
unstuck_exhaustedstop_reason_code. ~100 LoCacross 2 sub-PRs. No new state-machine, no new oscillation-detector
substrate.
been added to the L3 lane body so the same scope-tightening happens
before the L3 design issue is drafted (likely outcome: v1 ships
headless_runevidence kind only; sim/render/api each as their ownfollow-up when a canonical scenario demands them).
Net result: 0 BLOCK questions across all open design issues; L0-a,
L1-a, L2-a, L5-a all ready to start in parallel; total estimated
implementation across L0/L1/L2/L5 minimal v1 is ~730 LoC versus the
~2,000+ LoC pre-audit plan.
The meta-lesson: Ouroboros's minimal-substrate principle is "add
substrate only when evidence demands it." Twice in one session I
defaulted to standard-engineering patterns (ML classifier, CI
infrastructure, state-machine substrate) and twice the maintainer
caught it. The pattern to internalize: if a lane body lists
infrastructure that solves a class of problems we have not yet
observed, that infrastructure does not belong in v1.