Skip to content

fix: make watchdog controls replay safe#1207

Open
Q00 wants to merge 3 commits into
mainfrom
fix/watchdog-runtime-controls
Open

fix: make watchdog controls replay safe#1207
Q00 wants to merge 3 commits into
mainfrom
fix/watchdog-runtime-controls

Conversation

@Q00
Copy link
Copy Markdown
Owner

@Q00 Q00 commented May 24, 2026

Summary

  • suppress duplicate runtime.watchdog.cancel events after process restart by consulting EventStore replay when available
  • allow production CLI/MCP composition using load_runtime_controls(None) to honor OUROBOROS_SESSION_WALL_CLOCK_SECONDS
  • keep load_runtime_controls compatible with existing runtime_controls config keys owned by the main config loader

Original PR blocker coverage

Addresses watchdog/runtime-control blockers reported on #1178 and #1189.

Verification

  • uv run pytest tests/unit/runtime/test_controls.py tests/unit/runtime/test_watchdog.py tests/unit/auto/test_pipeline_watchdog_integration.py -q
  • uv run ruff check src/ouroboros/runtime/controls.py src/ouroboros/runtime/watchdog.py tests/unit/runtime/test_controls.py tests/unit/runtime/test_watchdog.py

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1207 |
| HEAD checked | 3cd853a87c31b10876fce52f462a76cf0df62a85 |
| Request ID | req_retry_exhausted_new_timeout_1779693693_1207 |
| Review record | 1f2e9ed3-41ad-4b02-9ac7-056a73433b50 |

What Improved

  • Adds an environment override for the auto session wall-clock budget.
  • Makes the watchdog consult persisted cancel events to avoid duplicate restart emissions.
  • Keeps the runtime controls loader compatible with existing config-owned keys.

Issue Requirements

Requirement Status
No linked issue / PR body requirement captured N/A

Prior Findings Status

No prior human or inline review comments were present in the provided artifacts.

Blockers

# File:Line Severity Finding
1 src/ouroboros/runtime/watchdog.py:203 BLOCKING The restart idempotency path suppresses the watchdog decision when a persisted cancel event already exists, which lets an over-budget session continue if the previous process appended runtime.watchdog.cancel but crashed or was cancelled before AutoPipeline._check_watchdog marked and saved the state as BLOCKED. The pipeline treats decision is None as “proceed normally” at src/ouroboros/auto/pipeline.py:1190, so replaying an existing cancel event must still surface a cancellation decision or otherwise cause the pipeline to block.

Follow-up Findings

# File:Line Priority Confidence Suggestion
None None None None None.

Non-blocking Suggestions

| None | None | None | None. |

Test Coverage Notes

  • Reviewed changed unit tests for runtime controls and watchdog behavior, plus existing auto pipeline watchdog integration coverage.
  • Could not execute tests in this environment: pytest is not installed, uv is unavailable, and /usr/bin/python3 -m pytest reports No module named pytest.
  • Existing tests cover duplicate-event suppression, but not the crash/restart boundary where the cancel event exists while auto state was not yet transitioned to BLOCKED.

Design Notes

The direction is reasonable, but restart idempotency crosses the EventStore-to-AutoPipeline state boundary. A persisted cancel event is a terminal control signal, not just a duplicate-append guard.

Design / Roadmap Gate

Affected boundary: Watchdog.check() appends to EventStore, then AutoPipeline._check_watchdog() separately mutates and saves auto state. Because those are not atomic, restart replay must handle the intermediate state where the EventStore has the cancel event but the state file is not BLOCKED. Current code treats that replay as no-op, violating the runtime control contract.

Directional Notes

Maintainer memory emphasized runtime reality over optimistic docs. That shaped focus on whether the claimed replay/idempotency behavior actually preserves cancellation semantics after partial failure.

Test Coverage

  • Reviewed changed unit tests for runtime controls and watchdog behavior, plus existing auto pipeline watchdog integration coverage.
  • Could not execute tests in this environment: pytest is not installed, uv is unavailable, and /usr/bin/python3 -m pytest reports No module named pytest.
  • Existing tests cover duplicate-event suppression, but not the crash/restart boundary where the cancel event exists while auto state was not yet transitioned to BLOCKED.

Merge Recommendation

Do not merge until existing persisted watchdog cancel events cause the resumed pipeline to block instead of proceeding. Add a regression test that pre-populates a cancel event without marking state BLOCKED, then resumes/checks and verifies the session transitions to BLOCKED without appending a duplicate event.

Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: 3cd853a
request_id: req_retry_exhausted_new_timeout_1779693693_1207
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Return a WatchdogDecision when a persisted cancel event already exists so resumed auto sessions still transition to BLOCKED without appending duplicate events. This preserves cancellation semantics across the EventStore/state-save crash window.\n\nServices: shared\nAffected files:\n- src/ouroboros/runtime/watchdog.py\n- tests/unit/auto/test_pipeline_watchdog_integration.py\n- tests/unit/runtime/test_watchdog.py
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1207 |
| HEAD checked | 5915c27a758c1932f986f3a679053af4caee74f6 |
| Request ID | req_1779704190_9 |
| Review record | 2684eaa8-fdeb-46c7-a4bc-5121b2560ecc |

What Improved

  • Adds an environment override for the auto session wall-clock budget.
  • Keeps the runtime controls loader compatible with existing config-owned runtime control keys.
  • Adds replay handling and regression coverage for persisted watchdog cancel events when the session state was not saved as BLOCKED.

Issue Requirements

Requirement Status
No linked issue requirement captured N/A

Prior Findings Status

The prior bot blocker is partially addressed: same-budget replay now returns a WatchdogDecision and the pipeline blocks without appending a duplicate event. The concern is maintained in modified form because replay still happens too late in Watchdog.check(); current controls can suppress an already-persisted cancel event before it is read.

Blockers

# File:Line Severity Finding
1 src/ouroboros/runtime/watchdog.py:195 BLOCKING Persisted watchdog cancel events are only replayed after the current process decides the session is still over the current budget. If the first process ran with a lower budget, appended runtime.watchdog.cancel, and crashed before AutoPipeline._check_watchdog() saved BLOCKED, a later resume with the env override removed or raised falls through at elapsed_seconds <= budget and proceeds normally despite the terminal cancel event already in EventStore. The same bypass happens at line 188 if the operator restarts with OUROBOROS_SESSION_WALL_CLOCK_SECONDS=0. Replay needs to check existing cancel events before applying the current disabled/under-budget gates, or otherwise explicitly preserve prior cancellation semantics.

Follow-up Findings

# File:Line Priority Confidence Suggestion
None.

Non-blocking Suggestions

| None. |

Test Coverage Notes

  • Reviewed changed unit tests for runtime controls, watchdog replay, and auto pipeline watchdog integration.
  • Could not execute tests: /usr/bin/python3 -m pytest reports No module named pytest.
  • Coverage now includes the prior same-budget replay gap, but misses the critical restart path where the persisted cancel event exists and current controls are disabled or raised above the current elapsed time.

Design Notes

The direction is sound, but the EventStore replay path must be the source of truth for prior cancellation before evaluating current runtime-control knobs.

Design / Roadmap Gate

Affected boundary: Watchdog.check() persists cancel intent in EventStore, while AutoPipeline._check_watchdog() separately marks state BLOCKED. Because those writes are not atomic, replay must handle stale or changed runtime controls on resume. Current HEAD only replays after current budget checks pass, so an existing cancel event can be ignored when the budget is raised or disabled.

Directional Notes

Maintainer memory emphasized runtime reality over optimistic docs. That shaped review focus on whether persisted cancellation remains terminal across restart, env drift, and partial failure.

Test Coverage

  • Reviewed changed unit tests for runtime controls, watchdog replay, and auto pipeline watchdog integration.
  • Could not execute tests: /usr/bin/python3 -m pytest reports No module named pytest.
  • Coverage now includes the prior same-budget replay gap, but misses the critical restart path where the persisted cancel event exists and current controls are disabled or raised above the current elapsed time.

Merge Recommendation

Do not merge until existing watchdog cancel events are honored before current disabled/under-budget gates. Add a regression test that pre-populates a cancel event created with a lower budget, resumes with a higher or disabled budget, and verifies the pipeline still transitions to BLOCKED without appending a duplicate event.

Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: 5915c27
request_id: req_1779704190_9
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Metadata

| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|---|
| PR | #1207 |
| HEAD checked | 807bdd96057a8878fb85b78301a9e5913cc3b0da |
| Request ID | req_1779710219_24 |
| Review record | acea355e-08d0-457e-b34c-7d25a6e070b1 |

What Improved

  • Adds OUROBOROS_SESSION_WALL_CLOCK_SECONDS as an Auto watchdog wall-clock override.
  • Preserves compatibility with existing config-owned runtime_controls keys.
  • Replays persisted watchdog cancel events before current budget/disable gates, preserving terminal cancellation across restart/control drift.

Issue Requirements

Requirement Status
No linked issue / PR body requirement captured N/A

Prior Findings Status

Prior bot blockers are withdrawn. Current HEAD replays existing runtime.watchdog.cancel events before checking watchdog_enabled or elapsed_seconds <= budget at src/ouroboros/runtime/watchdog.py:193, and the regression tests cover resumes with budget 0 and 10_000 at both watchdog and pipeline boundaries.

Blockers

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

# File:Line Priority Confidence Suggestion
None.

Non-blocking Suggestions

| 1 | src/ouroboros/runtime/controls.py:79 | Documentation | The new public env override is only documented in the loader docstring/tests. Consider adding it to the CLI/config env-var reference so operators can discover it. |

Test Coverage Notes

  • Reviewed changed unit coverage in tests/unit/runtime/test_controls.py, tests/unit/runtime/test_watchdog.py, and tests/unit/auto/test_pipeline_watchdog_integration.py.
  • New coverage includes env override parsing, duplicate event suppression, same-budget replay, and replay before disabled/raised-budget gates.
  • Could not execute tests: python is unavailable and python3 -m pytest reports No module named pytest.

Design Notes

The design now treats the EventStore cancel event as the durable source of truth before evaluating process-local runtime controls, which is the right boundary for restart recovery.

Design / Roadmap Gate

Affected boundary: Watchdog.check() persists cancel intent in EventStore while AutoPipeline._check_watchdog() separately saves BLOCKED state. Current HEAD handles the non-atomic gap by querying existing cancel events first, returning a WatchdogDecision without appending a duplicate, and then letting the pipeline persist BLOCKED. Production EventStore.query_events() matches the replay protocol.

Directional Notes

Maintainer memory emphasized runtime reality over optimistic docs, so review focused on whether persisted cancellation remains terminal across restart, env drift, and partial failure. No blocker remains on that boundary.

Test Coverage

  • Reviewed changed unit coverage in tests/unit/runtime/test_controls.py, tests/unit/runtime/test_watchdog.py, and tests/unit/auto/test_pipeline_watchdog_integration.py.
  • New coverage includes env override parsing, duplicate event suppression, same-budget replay, and replay before disabled/raised-budget gates.
  • Could not execute tests: python is unavailable and python3 -m pytest reports No module named pytest.

Merge Recommendation

Approve. I found no blocking runtime, persistence, or API contract issue in the changed boundary. The only remaining item is discoverability documentation for the new env var.

Review-Metadata:
verdict: APPROVE
head_sha: 807bdd9
request_id: req_1779710219_24
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator

Merge-readiness rationale (English)

This PR is ready to merge. It is the watchdog/runtime-controls hardening blocker called out on #1178 and #1189, and it took two REQUEST_CHANGES iterations to land the right semantics.

What it does

Three coordinated changes inside the Track B L2 watchdog/runtime-controls surface:

  1. EventStore is the durable source of truth for terminal cancellation. Watchdog.check() now consults EventStore.query_events() for an existing runtime.watchdog.cancel for the session before evaluating any current control knob. If a prior process already persisted a cancel event, the resumed pipeline reconstructs the original WatchdogDecision from the persisted payload and blocks — without appending a duplicate event.
  2. load_runtime_controls(None) honours OUROBOROS_SESSION_WALL_CLOCK_SECONDS. This unblocks production CLI/MCP composition paths that don't pass a controls path, while keeping int() parsing strict (non-integer values raise ValueError).
  3. Compatibility with the existing runtime_controls config block. The loader now accepts (and ignores) the older keys owned by ouroboros.config.loader (mcp_tool_timeout_seconds, generation_*_timeout_seconds, watchdog_poll_seconds) instead of rejecting them, so callers don't have to maintain two parallel YAML schemas.

A pair of runtime_checkable Protocols (_EventAppender, _WatchdogEventReader) keeps the dependency on the EventStore minimal and testable — the watchdog only requires the replay reader when the appender happens to expose query_events.

Why it aligns with the SSOT direction

Why it is not over-engineered

  • 5 files changed, +291/-12. Of those 291 lines, 161 are tests across three modules covering (a) env-override parsing, (b) duplicate-suppression, (c) same-budget replay, (d) restart with budget raised to 10000, and (e) restart with watchdog disabled (budget=0).
  • Two new small helpers (_coerce_datetime, _coerce_int) reconstruct the persisted WatchdogDecision defensively without trusting the on-disk payload type — these are eight lines each and have explicit happy-path / fallback tests.
  • The two runtime_checkable Protocols replace any need for adapter classes; the existing EventStore already implements both shapes.
  • No new public API surface; no new env vars beyond the one stated; no new config keys.

Why the bot's review trajectory matters

The first review caught the right bug: an over-budget process could append runtime.watchdog.cancel and then crash before the pipeline saved BLOCKED, so replaying with the current code would silently proceed. The fix moved replay into the "before any current-knob gate" path. The second review caught a tighter version of the same bug: even with replay in place, an operator who restarted with the budget raised (or the watchdog disabled) would bypass replay because the current-knob gate ran first. The fix at src/ouroboros/runtime/watchdog.py:193 consults the persisted cancel event before watchdog_enabled or elapsed_seconds <= budget. Both regressions are now pinned with explicit tests.

Why it is mergeable

  • reviewDecision: APPROVED on commit 807bdd9.
  • Latest ouroboros-agent verdict: APPROVE — both prior blockers (replay too late; replay bypassed by raised/disabled budget) are explicitly withdrawn against current HEAD.
  • The only non-blocking remark is a discoverability suggestion to document the env var in the CLI/config reference. That's a docs-only follow-up, not a merge gate.
  • All 8 required checks green (Ruff, MyPy, Bridge TS, enforce-envelope, enforce-boundary, Tests 3.12/3.13/3.14).
  • mergeStateStatus: CLEAN, mergeable: MERGEABLE.

Risk assessment

  • Boundary: durable cancel-event ↔ in-memory _fired_sessions set ↔ in-process control knobs. Current HEAD orders them correctly: persisted event wins, then in-memory guard, then current knobs.
  • Replay: explicitly tested. Existing sessions with prior cancel events transition to BLOCKED on resume regardless of the resumed process's budget configuration.
  • Wire format: cancel event payload schema is unchanged; replay reads existing fields with defaults.
  • Backwards compatibility: appenders that do not implement query_events (the runtime_checkable Protocol check fails) skip the replay branch entirely and behave exactly as before. No test infrastructure has to change.

Recommending merge.

@shaun0927
Copy link
Copy Markdown
Collaborator

PR Review Summary

Posted via /oh-my-claudecode:pr-review skill — evidence-backed verdict.

Verdict

Approve

Scope Reviewed

  • PR intent: make the auto-session watchdog restart-safe by treating the persisted runtime.watchdog.cancel event as the durable source of truth for terminal cancellation, regardless of how the resumed process's runtime-control knobs are configured; expose OUROBOROS_SESSION_WALL_CLOCK_SECONDS for load_runtime_controls(None); keep the runtime-controls YAML loader compatible with older keys owned by ouroboros.config.loader.
  • Main changed areas:
    • src/ouroboros/runtime/controls.py (+23/-3) — env override parsing + key-allow-list expansion.
    • src/ouroboros/runtime/watchdog.py (+78/-3) — _WatchdogEventReader Protocol, _replay_existing_decision() helper, and reordered check() so replay runs before any current-knob gate.
    • tests/unit/runtime/test_controls.py (+29/-6), tests/unit/runtime/test_watchdog.py (+82), tests/unit/auto/test_pipeline_watchdog_integration.py (+79).
  • Tests reviewed: env override valid/invalid parsing; controls loader compatibility for additional keys; watchdog duplicate suppression; same-budget replay; replay with budget raised to 10000; replay with OUROBOROS_SESSION_WALL_CLOCK_SECONDS=0; pipeline integration covering both watchdog-disabled and raised-budget restart paths.
  • Checks considered: Ruff Lint, MyPy, Bridge TypeScript, enforce-envelope, enforce-boundary, Tests Python 3.12 / 3.13 / 3.14, ouroboros-agent verdict on 807bdd9.

Blocking Issues

None.

Warnings

None.

Mutation-Test Thinking

  • Likely mutants the tests would kill:
    • Moving the _replay_existing_decision(...) call after the watchdog_enabled check or after the elapsed_seconds <= budget check — would re-introduce both prior blockers and be killed by the disabled-budget and raised-budget restart tests.
    • Removing the isinstance(self._appender, _WatchdogEventReader) guard — would AttributeError for legacy in-memory appenders that don't implement query_events; the existing-appender integration test would fail.
    • Returning None from _replay_existing_decision() when existing is non-empty — same restart tests fail.
    • Forgetting to add the session to _fired_sessions after replay — duplicate-suppression tests would fail because a second check() call in the same process would re-emit a cancel event.
    • Strict-typing event.data.get("fired_at") so a non-ISO string blows up — _coerce_datetime's fallback test covers this.
    • Changing the env-var name or removing the integer cast — test_load_runtime_controls_applies_env_session_budget and the ValueError test kill these.
    • Removing one of the new allowed keys from controls.py — the compatibility test that uses an existing config block fails.
  • Mutants the tests may not catch:
    • A regression where query_events(limit=1) is changed to a larger limit but only the latest event is read — semantically identical, low concern.
    • A persistent cancel event whose event.data is None or missing — event.data.get(...) on a None would raise, but the BaseEvent contract guarantees a dict; the helpers default-coerce on string/int parse failures. Acceptable.
  • Additional tests recommended: none for merge gating. Optional: a test where event.data["fired_at"] is a malformed ISO string to pin _coerce_datetime's None branch explicitly (the production code already falls back to fired_at_default, but the explicit assertion would be cheap).

Complexity / CRAP-style Risk

  • High-risk functions/modules: Watchdog.check() gained one helper call and reorders the guard sequence. _replay_existing_decision() is straight-line: guard on isinstance, await query, coerce-and-return. No nested loops, no shared mutable state beyond _fired_sessions (already in-class).
  • Complexity increase: bounded. The reorder is the most semantically loaded part of the change; it is two lines of moved code and is the whole point of the fix.
  • Test coverage concern: branch coverage on the new replay path is comprehensive — the three "should-block" replay shapes (same budget, raised budget, disabled budget) are all asserted at both the watchdog unit boundary and the pipeline integration boundary.
  • Refactoring recommendation: none. _coerce_datetime and _coerce_int are intentionally tiny defensive helpers; collapsing them would obscure their fallback semantics.

Test Quality Assessment

  • Strong tests: the pipeline-level integration tests (test_pipeline_watchdog_integration.py:+79) assert end-to-end that a session restarts into BLOCKED without appending a duplicate runtime.watchdog.cancel event. This is the contract the original bug violated, and it is now pinned.
  • Weak tests: none observed.
  • Missing edge cases: a multi-session appender where session A is cancelled and session B is over budget in the same process — not directly asserted, but the _fired_sessions set is keyed per session_id and the replay query filters by aggregate_id=session_id, so this is structurally safe.
  • Mocking concerns: tests use small in-test appender stubs (append + query_events); they match the production EventStore's signature exactly. The bot's review confirmed the replay protocol matches EventStore.query_events().

Security / Operational Risk

  • Auth/authz: untouched.
  • Data: read-only against the EventStore for replay; no schema migration.
  • Persistence: cancel event payload schema unchanged. Replay reads fired_at, elapsed_seconds, configured_budget_seconds defensively with fallbacks.
  • Replay: now correct across restart, env drift, and partial-failure. This is the central improvement.
  • Failure mode: if the EventStore itself fails on query_events(), the exception propagates and the watchdog does not silently bypass — fail-closed behavior is preserved at the pipeline level.
  • Observability: cancel events stay singular per session; operators do not see spurious duplicates on resume.

Looks Good

  • The fix correctly identifies the EventStore-vs-in-process-state split as the root cause and chooses durable persistence as the source of truth, which is the right boundary.
  • The runtime_checkable Protocol pair lets the watchdog cooperate with appenders that don't expose query_events (production has it; some tests don't), without runtime type-check explosions.
  • Test coverage spans unit and integration boundaries; both prior bot blockers have explicit regression tests.
  • Env-var integer parsing is strict (raises on non-int), so silent misconfiguration is impossible.
  • The YAML-key compatibility list documents itself in a comment about the dual-loader boundary.
  • All CI green; clean against main.

Final Recommendation

APPROVE. The PR closes both prior REQUEST_CHANGES blockers with explicit regression tests, and the resulting semantics (persisted cancel event is durable, env override is strict, dual loaders coexist) match the SSOT Track B L2 watchdog contract. No blocking findings, no warnings.

Review-Metadata:
verdict: APPROVE
skill: oh-my-claudecode:pr-review
head_sha: 807bdd9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants