Skip to content

Agent OS: define RuntimeControls watchdog contract #578

@Q00

Description

@Q00

Agent OS: RuntimeControls watchdog v1 (minimum) + v2 expansion path

2026-05-22 status update. This issue is lifted from needs-design
to implementation-ready v1 RFC as part of the #1157 minimal-substrate
redesign. The v1 contract below is intentionally narrow; richer
designs originally proposed in this issue's earlier draft are
preserved as a v2 expansion path and become slices only when
evidence demands them.

Implementation roadmap tracked in #1172. Wave 1: ooo auto SSOT (#1157).


v1 contract (implement now)

One timer

runtime_controls:
  session_wall_clock_seconds: 14400   # 4h default; configurable per goal

Per-session wall-clock budget only. No idle_timeout, no
no_progress_timeout, no safety_timeout — until evidence demands them.

One event family

class RuntimeWatchdogCancelEvent:
    """Recorded when a session's wall-clock budget is exceeded."""

    aggregate_type: str = "runtime_control"
    aggregate_id: str   # the auto_session_id
    event_kind: str = "runtime.watchdog.cancel"
    reason: str = "wall_clock_exceeded"   # only reason for v1
    session_started_at: datetime
    fired_at: datetime
    elapsed_seconds: int
    configured_budget_seconds: int

The runtime_control aggregate type is confirmed additive to the
projection v1 vocabulary by reading src/ouroboros/persistence/schema.py
(aggregate_type is a free-form VARCHAR(100), no enum constraint, no
Python allowlist). Informational confirmation posted on #946.

One new stop_reason_code

Adds watchdog_wall_clock_exceeded to the existing 8-code taxonomy
(#1167 added the previous 8th), bringing it to 9 codes. The auto
pipeline catches the event, transitions to BLOCKED with this code, and
returns a resumable auto_session_id per existing L4 semantics.

Resume semantics

Timer state is implicit in AutoPipelineState.session_started_at.
On every watchdog tick: read session_started_at from the auto
session state, compare against now. No separate serialization.
On resume, session_started_at is preserved by the existing state
restore path — the watchdog naturally sees the real elapsed time
since the original start.

Cancellation behavior

When the watchdog fires:

  1. Append runtime.watchdog.cancel event to EventStore.
  2. The auto pipeline's main loop polls for this event on its own
    session aggregate at each phase boundary; on observing the event,
    it exits cleanly via the existing BLOCKED transition with
    stop_reason_code="watchdog_wall_clock_exceeded".
  3. No subscriber-pattern dispatcher; no cooperative-cancellation
    handshake with AgentProcess (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) — those are v2 expansion items.

Acceptance criteria

  • runtime.watchdog.cancel event appears in the EventStore when a
    session exceeds session_wall_clock_seconds.
  • The auto pipeline transitions to BLOCKED with
    stop_reason_code="watchdog_wall_clock_exceeded" after the event.
  • Resume after watchdog cancel starts a fresh wall-clock window
    from the resume moment (the original session_started_at is not
    carried forward — that would auto-cancel re-attached sessions).
  • The watchdog is implemented in the auto pipeline's main loop;
    no separate process or thread.

v2 expansion path (deferred — open as separate slices only when evidence demands)

The following were considered and deferred in v1. Each becomes its
own slice only when a real-world stall slips past v1 wall-clock:

  • Multi-timer config: idle_timeout (no activity at all for N
    seconds), no_progress_timeout (activity but no material progress
    for M seconds), safety_timeout (hard wall-clock cap). Catches
    busy-but-stuck LLM loops that wall-clock alone catches too late.
  • 4-directive vocabulary: WAIT (poll again later), RETRY
    (re-issue the last operation), UNSTUCK (hand off to L5 escalation),
    CANCEL (terminate). v1's single cancel outcome maps to CANCEL.
  • material_progress_events vs activity_events event-set
    config: explicit per-event-kind classification so timer resets are
    precise rather than implicit.
  • Subscriber-pattern dispatcher: the watchdog emits a directive
    to known subscribers; AgentProcess (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) becomes a subscriber
    when it lands.
  • Per-layer ad-hoc timeout deprecation: MCP transport
    (mcp_tool_timeout_seconds), Ralph generation cap, evolve_step
    timeout all become transport-level only, with logical run-duration
    enforcement moved to RuntimeControls.

Each item above was a real concern in the earlier draft. The reason
they are deferred: they solve classes of problems we have not yet
observed in ooo auto runs.
When such a class is observed and
documented (an actual stall that wall-clock-only failed to catch
crisply), open a follow-up issue referencing this one and the
observed stall.


Known v1 limitations (documented, not blockers)

  • Wall-clock-only stall detection. A busy-but-stuck session (LLM
    looping without progress) burns wall-clock naturally and is caught,
    but not until the budget expires. Earlier crisp detection moves to
    v2's idle/progress timer split.
  • No external-signal handling. SIGTERM / Ctrl+C does not produce
    a runtime.watchdog.cancel event in v1. Resume sees no record;
    treat external kill as audit-free for now.
  • No cooperative cancellation handshake. v1 emits the event and
    exits; downstream cleanup is whatever the existing pipeline does on
    BLOCKED. AgentProcess cooperative cancellation (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) is its own
    substrate.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or meaningful improvement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions