You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Meta SSOT slice: L2 — Runtime watchdog v1 (minimal)
Implementation cleanup (2026-05-23 KST). L2 is partially implemented: #1178 merged RuntimeControls + wall-clock Watchdog; #1189 merged CLI/MCP/AutoPipeline consumption through runtime.watchdog.cancel. Treat old runtime.watchdog.decision, directive vocabulary, idle/no-progress timers, and subscriber-framework wording as superseded v2 ideas. Remaining known cleanup: #1194.
This issue is the L2 implementation slice of #1157. It is the smallest possible watchdog substrate that satisfies the SSOT invariant "long-running run never stalls without a recorded reason".
Why this is small on purpose
The earlier draft of L2 proposed a three-timer config (idle_timeout / no_progress_timeout / safety_timeout), a four-directive vocabulary (WAIT / RETRY / UNSTUCK / CANCEL), a subscriber pattern for cooperative cancellation, an activity_events vs material_progress_events event-set distinction, and a deprecation roadmap for ad-hoc timeouts spread across MCP / auto / Ralph layers. That was 6× the substrate the actual problem requires.
The actual problem is small: ooo auto sessions sometimes hang silently, and operators cannot tell why. The minimal substrate that fixes this:
One timer per session (wall-clock).
One outcome: when it fires, the session is cancelled and the EventStore records why.
That's it.
Once v1 lands and we have evidence of which kinds of stalls slip through wall-clock-only (e.g. busy-loop without actual progress), the more refined timer types and directive vocabulary can be added as their own slices.
Substrate honesty
L2 v1 introduces one new EventStore event family: runtime.watchdog.cancel. This is the only new substrate in #1157. Per the Substrate honesty note at the top of #1157, every other lane lights up existing substrate; L2 adds exactly this.
The earlier draft also proposed introducing a runtime_control aggregate type on the projection. Confirmed additive via code reading of src/ouroboros/persistence/schema.py — aggregate_type is a free-form VARCHAR(100), no enum constraint, no Python allowlist. See the informational comment posted on #946.
Minimal contract
One config knob
runtime_controls:
session_wall_clock_seconds: 14400# 4h default; configurable per goal
That's the entire config. No idle_timeout, no no_progress_timeout, no safety_timeout — until evidence shows they're needed.
Adds watchdog_wall_clock_exceeded to the existing 8-code taxonomy (L4), bringing it to 9 codes. The auto pipeline catches the event, transitions to BLOCKED with this code, and returns a resumable auto_session_id per existing L4 semantics.
Resume semantics
The watchdog reads session_started_at from the auto session state on every tick. If now - session_started_at > budget, fire. This means timer state naturally persists across resume without a separate serialization step — the original session start time is already in AutoPipelineState.
Sub-PR breakdown
L2-a — #578 body promotion: write the implementation-ready RFC sections answering the irreducible 3 questions (when fires, what records, how resume behaves). Remove needs-design label. ~0 LoC code. Defers the broader directive vocabulary / idle vs progress distinction to a documented v2 expansion path inside Agent OS: define RuntimeControls watchdog contract #578.
L2-b — src/ouroboros/runtime/watchdog.py (new): single-timer watchdog. ~100 LoC + tests. Runs in the auto pipeline's main loop (no separate process). Reads session_started_at from AutoPipelineState, checks against session_wall_clock_seconds, fires via EventStore.append().
L2-c — auto pipeline integration: interview_driver / evolve / ralph_handoff subscribe to runtime.watchdog.cancel events on their session's aggregate; on receipt, they exit cleanly and let the pipeline transition to BLOCKED with stop_reason_code=watchdog_wall_clock_exceeded. ~50 LoC + integration test.
Total ~150 LoC. The earlier ~700 LoC plan included infrastructure for problems we have not yet observed.
Acceptance criteria
L2-a: #578 body lifted to RFC; the 3 irreducible questions answered; broader directive vocabulary documented as v2 expansion path.
L2-b: a session that exceeds session_wall_clock_seconds produces exactly one runtime.watchdog.cancel event and the pipeline transitions to BLOCKED.
L2-c: existing L0 canonical scenario cli-todo (when seeded with an absurdly tiny budget like 10 s) demonstrates the watchdog firing and the result envelope carrying stop_reason_code="watchdog_wall_clock_exceeded".
Resume after watchdog cancel succeeds: a new ooo auto --resume <session_id> continues from where the session left off, with a fresh wall-clock window starting at the resume moment.
v2 expansion path (deliberately deferred)
The following are documented as possible future slices, not v1 scope. Each gets opened as its own follow-up issue only when evidence demands it (i.e. a real-world stall slips past wall-clock-only):
idle_timeout vs no_progress_timeout distinction (catches busy-but-stuck LLM loops).
WAIT / RETRY / UNSTUCK directive vocabulary (richer responses than just cancel).
material_progress_events vs activity_events event-set config.
Per-layer ad-hoc timeout deprecation (MCP transport / Ralph generation / evolve step).
The discipline: open each only when you can point at a specific recorded stall that wall-clock-only failed to catch.
Decisions awaiting maintainer triage
None. v1 has no BLOCK questions. The only externally-relevant choice (runtime_control aggregate compatibility) was confirmed additive via code reading; informational comment posted on #946.
Self-audit note (2026-05-22)
The earlier draft was speculative substrate — building infrastructure for stall categories we have not actually observed in ooo auto runs. Per Ouroboros's minimal-substrate principle, this issue now ships the smallest thing that fixes the observed problem (silent hangs), and documents richer designs as v2 expansion to be triggered by evidence, not anticipation.
Known v1 limitations (documented, not blockers)
Wall-clock-only stall detection. A busy-but-stuck session (LLM looping without progress) burns through wall-clock budget naturally; v1 catches it eventually but not crisply. v2 idle/progress distinction would catch it earlier — open when evidence shows v1 is too crude.
No external-signal handling. SIGTERM / Ctrl+C does not produce a runtime.watchdog.cancel event in v1. Resume sees no record; treat as external kill without audit trail.
Meta SSOT slice: L2 — Runtime watchdog v1 (minimal)
This issue is the L2 implementation slice of #1157. It is the smallest possible watchdog substrate that satisfies the SSOT invariant "long-running run never stalls without a recorded reason".
Why this is small on purpose
The earlier draft of L2 proposed a three-timer config (
idle_timeout/no_progress_timeout/safety_timeout), a four-directive vocabulary (WAIT/RETRY/UNSTUCK/CANCEL), a subscriber pattern for cooperative cancellation, anactivity_eventsvsmaterial_progress_eventsevent-set distinction, and a deprecation roadmap for ad-hoc timeouts spread across MCP / auto / Ralph layers. That was 6× the substrate the actual problem requires.The actual problem is small:
ooo autosessions sometimes hang silently, and operators cannot tell why. The minimal substrate that fixes this:Once v1 lands and we have evidence of which kinds of stalls slip through wall-clock-only (e.g. busy-loop without actual progress), the more refined timer types and directive vocabulary can be added as their own slices.
Substrate honesty
L2 v1 introduces one new EventStore event family:
runtime.watchdog.cancel. This is the only new substrate in #1157. Per the Substrate honesty note at the top of #1157, every other lane lights up existing substrate; L2 adds exactly this.The earlier draft also proposed introducing a
runtime_controlaggregate type on the projection. Confirmed additive via code reading ofsrc/ouroboros/persistence/schema.py—aggregate_typeis a free-formVARCHAR(100), no enum constraint, no Python allowlist. See the informational comment posted on #946.Minimal contract
One config knob
That's the entire config. No
idle_timeout, nono_progress_timeout, nosafety_timeout— until evidence shows they're needed.One event
{ "aggregate_type": "runtime_control", "aggregate_id": "<auto_session_id>", "event_kind": "runtime.watchdog.cancel", "reason": "wall_clock_exceeded", # only reason for v1 "session_started_at": "<iso8601>", "fired_at": "<iso8601>", "elapsed_seconds": int, "configured_budget_seconds": int, }One stop_reason_code
Adds
watchdog_wall_clock_exceededto the existing 8-code taxonomy (L4), bringing it to 9 codes. The auto pipeline catches the event, transitions to BLOCKED with this code, and returns a resumableauto_session_idper existing L4 semantics.Resume semantics
The watchdog reads
session_started_atfrom the auto session state on every tick. Ifnow - session_started_at > budget, fire. This means timer state naturally persists across resume without a separate serialization step — the original session start time is already inAutoPipelineState.Sub-PR breakdown
#578body promotion: write the implementation-ready RFC sections answering the irreducible 3 questions (when fires, what records, how resume behaves). Removeneeds-designlabel. ~0 LoC code. Defers the broader directive vocabulary / idle vs progress distinction to a documented v2 expansion path inside Agent OS: define RuntimeControls watchdog contract #578.src/ouroboros/runtime/watchdog.py(new): single-timer watchdog. ~100 LoC + tests. Runs in the auto pipeline's main loop (no separate process). Readssession_started_atfromAutoPipelineState, checks againstsession_wall_clock_seconds, fires viaEventStore.append().interview_driver/evolve/ralph_handoffsubscribe toruntime.watchdog.cancelevents on their session's aggregate; on receipt, they exit cleanly and let the pipeline transition to BLOCKED withstop_reason_code=watchdog_wall_clock_exceeded. ~50 LoC + integration test.Total ~150 LoC. The earlier ~700 LoC plan included infrastructure for problems we have not yet observed.
Acceptance criteria
#578body lifted to RFC; the 3 irreducible questions answered; broader directive vocabulary documented as v2 expansion path.session_wall_clock_secondsproduces exactly oneruntime.watchdog.cancelevent and the pipeline transitions to BLOCKED.cli-todo(when seeded with an absurdly tiny budget like 10 s) demonstrates the watchdog firing and the result envelope carryingstop_reason_code="watchdog_wall_clock_exceeded".ooo auto --resume <session_id>continues from where the session left off, with a fresh wall-clock window starting at the resume moment.v2 expansion path (deliberately deferred)
The following are documented as possible future slices, not v1 scope. Each gets opened as its own follow-up issue only when evidence demands it (i.e. a real-world stall slips past wall-clock-only):
idle_timeoutvsno_progress_timeoutdistinction (catches busy-but-stuck LLM loops).WAIT/RETRY/UNSTUCKdirective vocabulary (richer responses than just cancel).material_progress_eventsvsactivity_eventsevent-set config.The discipline: open each only when you can point at a specific recorded stall that wall-clock-only failed to catch.
Decisions awaiting maintainer triage
None. v1 has no BLOCK questions. The only externally-relevant choice (
runtime_controlaggregate compatibility) was confirmed additive via code reading; informational comment posted on #946.Self-audit note (2026-05-22)
The earlier draft was speculative substrate — building infrastructure for stall categories we have not actually observed in
ooo autoruns. Per Ouroboros's minimal-substrate principle, this issue now ships the smallest thing that fixes the observed problem (silent hangs), and documents richer designs as v2 expansion to be triggered by evidence, not anticipation.Known v1 limitations (documented, not blockers)
runtime.watchdog.cancelevent in v1. Resume sees no record; treat as external kill without audit trail.References
ooo auto(L2 lane body, Substrate honesty paragraph).