You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2026-05-22 status update. This issue is lifted from needs-design
to implementation-ready v1 RFC as part of the #1157minimal-substrate
redesign. The v1 contract below is intentionally narrow; richer
designs originally proposed in this issue's earlier draft are
preserved as a v2 expansion path and become slices only when
evidence demands them.
Implementation roadmap tracked in #1172. Wave 1: ooo auto SSOT (#1157).
v1 contract (implement now)
One timer
runtime_controls:
session_wall_clock_seconds: 14400# 4h default; configurable per goal
Per-session wall-clock budget only. No idle_timeout, no no_progress_timeout, no safety_timeout — until evidence demands them.
One event family
classRuntimeWatchdogCancelEvent:
"""Recorded when a session's wall-clock budget is exceeded."""aggregate_type: str="runtime_control"aggregate_id: str# the auto_session_idevent_kind: str="runtime.watchdog.cancel"reason: str="wall_clock_exceeded"# only reason for v1session_started_at: datetimefired_at: datetimeelapsed_seconds: intconfigured_budget_seconds: int
The runtime_control aggregate type is confirmed additive to the
projection v1 vocabulary by reading src/ouroboros/persistence/schema.py
(aggregate_type is a free-form VARCHAR(100), no enum constraint, no
Python allowlist). Informational confirmation posted on #946.
One new stop_reason_code
Adds watchdog_wall_clock_exceeded to the existing 8-code taxonomy
(#1167 added the previous 8th), bringing it to 9 codes. The auto
pipeline catches the event, transitions to BLOCKED with this code, and
returns a resumable auto_session_id per existing L4 semantics.
Resume semantics
Timer state is implicit in AutoPipelineState.session_started_at.
On every watchdog tick: read session_started_at from the auto
session state, compare against now. No separate serialization.
On resume, session_started_at is preserved by the existing state
restore path — the watchdog naturally sees the real elapsed time
since the original start.
Cancellation behavior
When the watchdog fires:
Append runtime.watchdog.cancel event to EventStore.
The auto pipeline's main loop polls for this event on its own
session aggregate at each phase boundary; on observing the event,
it exits cleanly via the existing BLOCKED transition with stop_reason_code="watchdog_wall_clock_exceeded".
runtime.watchdog.cancel event appears in the EventStore when a
session exceeds session_wall_clock_seconds.
The auto pipeline transitions to BLOCKED with stop_reason_code="watchdog_wall_clock_exceeded" after the event.
Resume after watchdog cancel starts a fresh wall-clock window
from the resume moment (the original session_started_at is not
carried forward — that would auto-cancel re-attached sessions).
The watchdog is implemented in the auto pipeline's main loop;
no separate process or thread.
v2 expansion path (deferred — open as separate slices only when evidence demands)
The following were considered and deferred in v1. Each becomes its
own slice only when a real-world stall slips past v1 wall-clock:
Multi-timer config: idle_timeout (no activity at all for N
seconds), no_progress_timeout (activity but no material progress
for M seconds), safety_timeout (hard wall-clock cap). Catches
busy-but-stuck LLM loops that wall-clock alone catches too late.
4-directive vocabulary: WAIT (poll again later), RETRY
(re-issue the last operation), UNSTUCK (hand off to L5 escalation), CANCEL (terminate). v1's single cancel outcome maps to CANCEL.
material_progress_events vs activity_events event-set
config: explicit per-event-kind classification so timer resets are
precise rather than implicit.
Per-layer ad-hoc timeout deprecation: MCP transport
(mcp_tool_timeout_seconds), Ralph generation cap, evolve_step
timeout all become transport-level only, with logical run-duration
enforcement moved to RuntimeControls.
Each item above was a real concern in the earlier draft. The reason
they are deferred: they solve classes of problems we have not yet
observed in ooo auto runs. When such a class is observed and
documented (an actual stall that wall-clock-only failed to catch
crisply), open a follow-up issue referencing this one and the
observed stall.
Known v1 limitations (documented, not blockers)
Wall-clock-only stall detection. A busy-but-stuck session (LLM
looping without progress) burns wall-clock naturally and is caught,
but not until the budget expires. Earlier crisp detection moves to
v2's idle/progress timer split.
No external-signal handling. SIGTERM / Ctrl+C does not produce
a runtime.watchdog.cancel event in v1. Resume sees no record;
treat external kill as audit-free for now.
Agent OS: RuntimeControls watchdog v1 (minimum) + v2 expansion path
v1 contract (implement now)
One timer
Per-session wall-clock budget only. No
idle_timeout, nono_progress_timeout, nosafety_timeout— until evidence demands them.One event family
The
runtime_controlaggregate type is confirmed additive to theprojection v1 vocabulary by reading
src/ouroboros/persistence/schema.py(
aggregate_typeis a free-formVARCHAR(100), no enum constraint, noPython allowlist). Informational confirmation posted on #946.
One new stop_reason_code
Adds
watchdog_wall_clock_exceededto the existing 8-code taxonomy(#1167 added the previous 8th), bringing it to 9 codes. The auto
pipeline catches the event, transitions to BLOCKED with this code, and
returns a resumable
auto_session_idper existing L4 semantics.Resume semantics
Timer state is implicit in
AutoPipelineState.session_started_at.On every watchdog tick: read
session_started_atfrom the autosession state, compare against
now. No separate serialization.On resume,
session_started_atis preserved by the existing staterestore path — the watchdog naturally sees the real elapsed time
since the original start.
Cancellation behavior
When the watchdog fires:
runtime.watchdog.cancelevent to EventStore.session aggregate at each phase boundary; on observing the event,
it exits cleanly via the existing BLOCKED transition with
stop_reason_code="watchdog_wall_clock_exceeded".handshake with AgentProcess (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) — those are v2 expansion items.
Acceptance criteria
runtime.watchdog.cancelevent appears in the EventStore when asession exceeds
session_wall_clock_seconds.stop_reason_code="watchdog_wall_clock_exceeded"after the event.from the resume moment (the original
session_started_atis notcarried forward — that would auto-cancel re-attached sessions).
no separate process or thread.
v2 expansion path (deferred — open as separate slices only when evidence demands)
The following were considered and deferred in v1. Each becomes its
own slice only when a real-world stall slips past v1 wall-clock:
idle_timeout(no activity at all for Nseconds),
no_progress_timeout(activity but no material progressfor M seconds),
safety_timeout(hard wall-clock cap). Catchesbusy-but-stuck LLM loops that wall-clock alone catches too late.
WAIT(poll again later),RETRY(re-issue the last operation),
UNSTUCK(hand off to L5 escalation),CANCEL(terminate). v1's singlecanceloutcome maps toCANCEL.material_progress_eventsvsactivity_eventsevent-setconfig: explicit per-event-kind classification so timer resets are
precise rather than implicit.
to known subscribers; AgentProcess (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) becomes a subscriber
when it lands.
(
mcp_tool_timeout_seconds), Ralph generation cap,evolve_steptimeout all become transport-level only, with logical run-duration
enforcement moved to RuntimeControls.
Each item above was a real concern in the earlier draft. The reason
they are deferred: they solve classes of problems we have not yet
observed in
ooo autoruns. When such a class is observed anddocumented (an actual stall that wall-clock-only failed to catch
crisply), open a follow-up issue referencing this one and the
observed stall.
Known v1 limitations (documented, not blockers)
looping without progress) burns wall-clock naturally and is caught,
but not until the budget expires. Earlier crisp detection moves to
v2's idle/progress timer split.
a
runtime.watchdog.cancelevent in v1. Resume sees no record;treat external kill as audit-free for now.
exits; downstream cleanup is whatever the existing pipeline does on
BLOCKED. AgentProcess cooperative cancellation (M6 AgentProcess lifecycle (lite) — spawn/pause/resume/cancel/replay #518) is its own
substrate.
References
ooo auto(L2 lane body).than replaces.
runtime_controlis additive posted on the thread.deferred to v2 expansion).
per the "Scope of the Tier system" paragraph).