fix: make watchdog controls replay safe#1207
Conversation
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Metadata
| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|---|
| PR | #1207 |
| HEAD checked | 3cd853a87c31b10876fce52f462a76cf0df62a85 |
| Request ID | req_retry_exhausted_new_timeout_1779693693_1207 |
| Review record | 1f2e9ed3-41ad-4b02-9ac7-056a73433b50 |
What Improved
- Adds an environment override for the auto session wall-clock budget.
- Makes the watchdog consult persisted cancel events to avoid duplicate restart emissions.
- Keeps the runtime controls loader compatible with existing config-owned keys.
Issue Requirements
| Requirement | Status |
|---|---|
| No linked issue / PR body requirement captured | N/A |
Prior Findings Status
No prior human or inline review comments were present in the provided artifacts.
Blockers
| # | File:Line | Severity | Finding |
|---|---|---|---|
| 1 | src/ouroboros/runtime/watchdog.py:203 | BLOCKING | The restart idempotency path suppresses the watchdog decision when a persisted cancel event already exists, which lets an over-budget session continue if the previous process appended runtime.watchdog.cancel but crashed or was cancelled before AutoPipeline._check_watchdog marked and saved the state as BLOCKED. The pipeline treats decision is None as “proceed normally” at src/ouroboros/auto/pipeline.py:1190, so replaying an existing cancel event must still surface a cancellation decision or otherwise cause the pipeline to block. |
Follow-up Findings
| # | File:Line | Priority | Confidence | Suggestion |
|---|---|---|---|---|
| None | None | None | None | None. |
Non-blocking Suggestions
| None | None | None | None. |
Test Coverage Notes
- Reviewed changed unit tests for runtime controls and watchdog behavior, plus existing auto pipeline watchdog integration coverage.
- Could not execute tests in this environment:
pytestis not installed,uvis unavailable, and/usr/bin/python3 -m pytestreportsNo module named pytest. - Existing tests cover duplicate-event suppression, but not the crash/restart boundary where the cancel event exists while auto state was not yet transitioned to BLOCKED.
Design Notes
The direction is reasonable, but restart idempotency crosses the EventStore-to-AutoPipeline state boundary. A persisted cancel event is a terminal control signal, not just a duplicate-append guard.
Design / Roadmap Gate
Affected boundary: Watchdog.check() appends to EventStore, then AutoPipeline._check_watchdog() separately mutates and saves auto state. Because those are not atomic, restart replay must handle the intermediate state where the EventStore has the cancel event but the state file is not BLOCKED. Current code treats that replay as no-op, violating the runtime control contract.
Directional Notes
Maintainer memory emphasized runtime reality over optimistic docs. That shaped focus on whether the claimed replay/idempotency behavior actually preserves cancellation semantics after partial failure.
Test Coverage
- Reviewed changed unit tests for runtime controls and watchdog behavior, plus existing auto pipeline watchdog integration coverage.
- Could not execute tests in this environment:
pytestis not installed,uvis unavailable, and/usr/bin/python3 -m pytestreportsNo module named pytest. - Existing tests cover duplicate-event suppression, but not the crash/restart boundary where the cancel event exists while auto state was not yet transitioned to BLOCKED.
Merge Recommendation
Do not merge until existing persisted watchdog cancel events cause the resumed pipeline to block instead of proceeding. Add a regression test that pre-populates a cancel event without marking state BLOCKED, then resumes/checks and verifies the session transitions to BLOCKED without appending a duplicate event.
Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: 3cd853a
request_id: req_retry_exhausted_new_timeout_1779693693_1207
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Return a WatchdogDecision when a persisted cancel event already exists so resumed auto sessions still transition to BLOCKED without appending duplicate events. This preserves cancellation semantics across the EventStore/state-save crash window.\n\nServices: shared\nAffected files:\n- src/ouroboros/runtime/watchdog.py\n- tests/unit/auto/test_pipeline_watchdog_integration.py\n- tests/unit/runtime/test_watchdog.py
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Metadata
| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|---|
| PR | #1207 |
| HEAD checked | 5915c27a758c1932f986f3a679053af4caee74f6 |
| Request ID | req_1779704190_9 |
| Review record | 2684eaa8-fdeb-46c7-a4bc-5121b2560ecc |
What Improved
- Adds an environment override for the auto session wall-clock budget.
- Keeps the runtime controls loader compatible with existing config-owned runtime control keys.
- Adds replay handling and regression coverage for persisted watchdog cancel events when the session state was not saved as BLOCKED.
Issue Requirements
| Requirement | Status |
|---|---|
| No linked issue requirement captured | N/A |
Prior Findings Status
The prior bot blocker is partially addressed: same-budget replay now returns a WatchdogDecision and the pipeline blocks without appending a duplicate event. The concern is maintained in modified form because replay still happens too late in Watchdog.check(); current controls can suppress an already-persisted cancel event before it is read.
Blockers
| # | File:Line | Severity | Finding |
|---|---|---|---|
| 1 | src/ouroboros/runtime/watchdog.py:195 | BLOCKING | Persisted watchdog cancel events are only replayed after the current process decides the session is still over the current budget. If the first process ran with a lower budget, appended runtime.watchdog.cancel, and crashed before AutoPipeline._check_watchdog() saved BLOCKED, a later resume with the env override removed or raised falls through at elapsed_seconds <= budget and proceeds normally despite the terminal cancel event already in EventStore. The same bypass happens at line 188 if the operator restarts with OUROBOROS_SESSION_WALL_CLOCK_SECONDS=0. Replay needs to check existing cancel events before applying the current disabled/under-budget gates, or otherwise explicitly preserve prior cancellation semantics. |
Follow-up Findings
| # | File:Line | Priority | Confidence | Suggestion |
|---|---|---|---|---|
| None. |
Non-blocking Suggestions
| None. |
Test Coverage Notes
- Reviewed changed unit tests for runtime controls, watchdog replay, and auto pipeline watchdog integration.
- Could not execute tests:
/usr/bin/python3 -m pytestreportsNo module named pytest. - Coverage now includes the prior same-budget replay gap, but misses the critical restart path where the persisted cancel event exists and current controls are disabled or raised above the current elapsed time.
Design Notes
The direction is sound, but the EventStore replay path must be the source of truth for prior cancellation before evaluating current runtime-control knobs.
Design / Roadmap Gate
Affected boundary: Watchdog.check() persists cancel intent in EventStore, while AutoPipeline._check_watchdog() separately marks state BLOCKED. Because those writes are not atomic, replay must handle stale or changed runtime controls on resume. Current HEAD only replays after current budget checks pass, so an existing cancel event can be ignored when the budget is raised or disabled.
Directional Notes
Maintainer memory emphasized runtime reality over optimistic docs. That shaped review focus on whether persisted cancellation remains terminal across restart, env drift, and partial failure.
Test Coverage
- Reviewed changed unit tests for runtime controls, watchdog replay, and auto pipeline watchdog integration.
- Could not execute tests:
/usr/bin/python3 -m pytestreportsNo module named pytest. - Coverage now includes the prior same-budget replay gap, but misses the critical restart path where the persisted cancel event exists and current controls are disabled or raised above the current elapsed time.
Merge Recommendation
Do not merge until existing watchdog cancel events are honored before current disabled/under-budget gates. Add a regression test that pre-populates a cancel event created with a lower budget, resumes with a higher or disabled budget, and verifies the pipeline still transitions to BLOCKED without appending a duplicate event.
Review-Metadata:
verdict: REQUEST_CHANGES
head_sha: 5915c27
request_id: req_1779704190_9
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true
Reviewed by ouroboros-agent[bot] via Codex deep analysis
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Metadata
| Field | Value |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|---|
| PR | #1207 |
| HEAD checked | 807bdd96057a8878fb85b78301a9e5913cc3b0da |
| Request ID | req_1779710219_24 |
| Review record | acea355e-08d0-457e-b34c-7d25a6e070b1 |
What Improved
- Adds
OUROBOROS_SESSION_WALL_CLOCK_SECONDSas an Auto watchdog wall-clock override. - Preserves compatibility with existing config-owned
runtime_controlskeys. - Replays persisted watchdog cancel events before current budget/disable gates, preserving terminal cancellation across restart/control drift.
Issue Requirements
| Requirement | Status |
|---|---|
| No linked issue / PR body requirement captured | N/A |
Prior Findings Status
Prior bot blockers are withdrawn. Current HEAD replays existing runtime.watchdog.cancel events before checking watchdog_enabled or elapsed_seconds <= budget at src/ouroboros/runtime/watchdog.py:193, and the regression tests cover resumes with budget 0 and 10_000 at both watchdog and pipeline boundaries.
Blockers
No in-scope blocking findings remained after policy filtering.
Follow-up Findings
| # | File:Line | Priority | Confidence | Suggestion |
|---|---|---|---|---|
| None. |
Non-blocking Suggestions
| 1 | src/ouroboros/runtime/controls.py:79 | Documentation | The new public env override is only documented in the loader docstring/tests. Consider adding it to the CLI/config env-var reference so operators can discover it. |
Test Coverage Notes
- Reviewed changed unit coverage in
tests/unit/runtime/test_controls.py,tests/unit/runtime/test_watchdog.py, andtests/unit/auto/test_pipeline_watchdog_integration.py. - New coverage includes env override parsing, duplicate event suppression, same-budget replay, and replay before disabled/raised-budget gates.
- Could not execute tests:
pythonis unavailable andpython3 -m pytestreportsNo module named pytest.
Design Notes
The design now treats the EventStore cancel event as the durable source of truth before evaluating process-local runtime controls, which is the right boundary for restart recovery.
Design / Roadmap Gate
Affected boundary: Watchdog.check() persists cancel intent in EventStore while AutoPipeline._check_watchdog() separately saves BLOCKED state. Current HEAD handles the non-atomic gap by querying existing cancel events first, returning a WatchdogDecision without appending a duplicate, and then letting the pipeline persist BLOCKED. Production EventStore.query_events() matches the replay protocol.
Directional Notes
Maintainer memory emphasized runtime reality over optimistic docs, so review focused on whether persisted cancellation remains terminal across restart, env drift, and partial failure. No blocker remains on that boundary.
Test Coverage
- Reviewed changed unit coverage in
tests/unit/runtime/test_controls.py,tests/unit/runtime/test_watchdog.py, andtests/unit/auto/test_pipeline_watchdog_integration.py. - New coverage includes env override parsing, duplicate event suppression, same-budget replay, and replay before disabled/raised-budget gates.
- Could not execute tests:
pythonis unavailable andpython3 -m pytestreportsNo module named pytest.
Merge Recommendation
Approve. I found no blocking runtime, persistence, or API contract issue in the changed boundary. The only remaining item is discoverability documentation for the new env var.
Review-Metadata:
verdict: APPROVE
head_sha: 807bdd9
request_id: req_1779710219_24
review_profile: memory-aware-zero-trust-v2
advisory_memory_only: true
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Merge-readiness rationale (English)This PR is ready to merge. It is the watchdog/runtime-controls hardening blocker called out on #1178 and #1189, and it took two REQUEST_CHANGES iterations to land the right semantics. What it doesThree coordinated changes inside the Track B L2 watchdog/runtime-controls surface:
A pair of Why it aligns with the SSOT direction
Why it is not over-engineered
Why the bot's review trajectory mattersThe first review caught the right bug: an over-budget process could append Why it is mergeable
Risk assessment
Recommending merge. |
PR Review SummaryPosted via VerdictApprove Scope Reviewed
Blocking IssuesNone. WarningsNone. Mutation-Test Thinking
Complexity / CRAP-style Risk
Test Quality Assessment
Security / Operational Risk
Looks Good
Final RecommendationAPPROVE. The PR closes both prior REQUEST_CHANGES blockers with explicit regression tests, and the resulting semantics (persisted cancel event is durable, env override is strict, dual loaders coexist) match the SSOT Track B L2 watchdog contract. No blocking findings, no warnings. Review-Metadata: |
Summary
Original PR blocker coverage
Addresses watchdog/runtime-control blockers reported on #1178 and #1189.
Verification