Skip to content

Meta SSOT slice: L2 — Runtime watchdog v1 (minimal, lifts #578) #1172

@shaun0927

Description

@shaun0927

Meta SSOT slice: L2 — Runtime watchdog v1 (minimal)

Implementation cleanup (2026-05-23 KST). L2 is partially implemented: #1178 merged RuntimeControls + wall-clock Watchdog; #1189 merged CLI/MCP/AutoPipeline consumption through runtime.watchdog.cancel. Treat old runtime.watchdog.decision, directive vocabulary, idle/no-progress timers, and subscriber-framework wording as superseded v2 ideas. Remaining known cleanup: #1194.

This issue is the L2 implementation slice of #1157. It is the smallest possible watchdog substrate that satisfies the SSOT invariant "long-running run never stalls without a recorded reason".

Why this is small on purpose

The earlier draft of L2 proposed a three-timer config (idle_timeout / no_progress_timeout / safety_timeout), a four-directive vocabulary (WAIT / RETRY / UNSTUCK / CANCEL), a subscriber pattern for cooperative cancellation, an activity_events vs material_progress_events event-set distinction, and a deprecation roadmap for ad-hoc timeouts spread across MCP / auto / Ralph layers. That was 6× the substrate the actual problem requires.

The actual problem is small: ooo auto sessions sometimes hang silently, and operators cannot tell why. The minimal substrate that fixes this:

  • One timer per session (wall-clock).
  • One outcome: when it fires, the session is cancelled and the EventStore records why.
  • That's it.

Once v1 lands and we have evidence of which kinds of stalls slip through wall-clock-only (e.g. busy-loop without actual progress), the more refined timer types and directive vocabulary can be added as their own slices.

Substrate honesty

L2 v1 introduces one new EventStore event family: runtime.watchdog.cancel. This is the only new substrate in #1157. Per the Substrate honesty note at the top of #1157, every other lane lights up existing substrate; L2 adds exactly this.

The earlier draft also proposed introducing a runtime_control aggregate type on the projection. Confirmed additive via code reading of src/ouroboros/persistence/schema.pyaggregate_type is a free-form VARCHAR(100), no enum constraint, no Python allowlist. See the informational comment posted on #946.

Minimal contract

One config knob

runtime_controls:
  session_wall_clock_seconds: 14400   # 4h default; configurable per goal

That's the entire config. No idle_timeout, no no_progress_timeout, no safety_timeout — until evidence shows they're needed.

One event

{
  "aggregate_type": "runtime_control",
  "aggregate_id": "<auto_session_id>",
  "event_kind": "runtime.watchdog.cancel",
  "reason": "wall_clock_exceeded",       # only reason for v1
  "session_started_at": "<iso8601>",
  "fired_at": "<iso8601>",
  "elapsed_seconds": int,
  "configured_budget_seconds": int,
}

One stop_reason_code

Adds watchdog_wall_clock_exceeded to the existing 8-code taxonomy (L4), bringing it to 9 codes. The auto pipeline catches the event, transitions to BLOCKED with this code, and returns a resumable auto_session_id per existing L4 semantics.

Resume semantics

The watchdog reads session_started_at from the auto session state on every tick. If now - session_started_at > budget, fire. This means timer state naturally persists across resume without a separate serialization step — the original session start time is already in AutoPipelineState.

Sub-PR breakdown

  1. L2-a#578 body promotion: write the implementation-ready RFC sections answering the irreducible 3 questions (when fires, what records, how resume behaves). Remove needs-design label. ~0 LoC code. Defers the broader directive vocabulary / idle vs progress distinction to a documented v2 expansion path inside Agent OS: define RuntimeControls watchdog contract #578.
  2. L2-bsrc/ouroboros/runtime/watchdog.py (new): single-timer watchdog. ~100 LoC + tests. Runs in the auto pipeline's main loop (no separate process). Reads session_started_at from AutoPipelineState, checks against session_wall_clock_seconds, fires via EventStore.append().
  3. L2-c — auto pipeline integration: interview_driver / evolve / ralph_handoff subscribe to runtime.watchdog.cancel events on their session's aggregate; on receipt, they exit cleanly and let the pipeline transition to BLOCKED with stop_reason_code=watchdog_wall_clock_exceeded. ~50 LoC + integration test.

Total ~150 LoC. The earlier ~700 LoC plan included infrastructure for problems we have not yet observed.

Acceptance criteria

  • L2-a: #578 body lifted to RFC; the 3 irreducible questions answered; broader directive vocabulary documented as v2 expansion path.
  • L2-b: a session that exceeds session_wall_clock_seconds produces exactly one runtime.watchdog.cancel event and the pipeline transitions to BLOCKED.
  • L2-c: existing L0 canonical scenario cli-todo (when seeded with an absurdly tiny budget like 10 s) demonstrates the watchdog firing and the result envelope carrying stop_reason_code="watchdog_wall_clock_exceeded".
  • Resume after watchdog cancel succeeds: a new ooo auto --resume <session_id> continues from where the session left off, with a fresh wall-clock window starting at the resume moment.

v2 expansion path (deliberately deferred)

The following are documented as possible future slices, not v1 scope. Each gets opened as its own follow-up issue only when evidence demands it (i.e. a real-world stall slips past wall-clock-only):

  • idle_timeout vs no_progress_timeout distinction (catches busy-but-stuck LLM loops).
  • WAIT / RETRY / UNSTUCK directive vocabulary (richer responses than just cancel).
  • material_progress_events vs activity_events event-set config.
  • Subscriber pattern for cooperative cancellation (when AgentProcess M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518 lands).
  • Per-layer ad-hoc timeout deprecation (MCP transport / Ralph generation / evolve step).

The discipline: open each only when you can point at a specific recorded stall that wall-clock-only failed to catch.

Decisions awaiting maintainer triage

None. v1 has no BLOCK questions. The only externally-relevant choice (runtime_control aggregate compatibility) was confirmed additive via code reading; informational comment posted on #946.

Self-audit note (2026-05-22)

The earlier draft was speculative substrate — building infrastructure for stall categories we have not actually observed in ooo auto runs. Per Ouroboros's minimal-substrate principle, this issue now ships the smallest thing that fixes the observed problem (silent hangs), and documents richer designs as v2 expansion to be triggered by evidence, not anticipation.

Known v1 limitations (documented, not blockers)

  • Wall-clock-only stall detection. A busy-but-stuck session (LLM looping without progress) burns through wall-clock budget naturally; v1 catches it eventually but not crisply. v2 idle/progress distinction would catch it earlier — open when evidence shows v1 is too crude.
  • No external-signal handling. SIGTERM / Ctrl+C does not produce a runtime.watchdog.cancel event in v1. Resume sees no record; treat as external kill without audit trail.
  • No cooperative cancellation handshake. v1 emits the event and exits; downstream cleanup is whatever the existing pipeline does on BLOCKED. AgentProcess cooperative cancellation (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) is its own substrate.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    OSCore engine, state machine, internal pipeline, and system-level behaviorenhancementNew feature or meaningful improvementneeds-designMulti-PR epic or architectural change, needs human planning

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions