Skip to content

fix(watcher): lossless check wakes via watcher-side suppression + enqueue-before-suppress#34

Open
e-jung wants to merge 2 commits into
kunchenguid:mainfrom
e-jung:fix/check-suppress-enqueue-before-suppress
Open

fix(watcher): lossless check wakes via watcher-side suppression + enqueue-before-suppress#34
e-jung wants to merge 2 commits into
kunchenguid:mainfrom
e-jung:fix/check-suppress-enqueue-before-suppress

Conversation

@e-jung

@e-jung e-jung commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Intent

Fix the check-suppression bug that made firstmate miss PR #3095's merge. The fm-git-events-s8 scout report root-caused it: per-PR state/*.check.sh babysits were self-baselining — they advanced their own .babysit-*.seen suppression marker inside the script before the wake could be delivered. If that one run's stdout never reached the watcher (race / crash / timeout-before-flush), the baseline was already advanced and no subsequent poll could re-emit. This violates #29's enqueue-before-suppress invariant that makes signal/stale wakes lossless. Checks were the only wake source whose suppression lived inside an opaque script the watcher couldn't protect.

What changed

1. Watcher-side check suppression with enqueue-before-suppress (bin/fm-watch.sh):

  • The check always prints its current state (idempotent); the watcher dedups against .seen-check-<name> and only wakes on a delta.
  • fm_wake_append (durable queue) runs before the .seen-check marker advances — exactly the pattern scan_signals already uses. A crash between detect and suppress leaves the wake in the queue (recovered next turn) and the marker un-advanced (re-detected next cycle). A lost check wake is now impossible.
  • Backward-compatible with old edge-triggered checks: empty stdout never produces a wake, so they keep their quiet behavior. No existing check breaks.

2. Catch-all backstop (bin/fm-watch.sh): force-escalates any .babysit-*.seen sidecar showing a terminal state (MERGED/CLOSED) the watcher never delivered a wake for — catches a swallowed transition within one sweep (300s). Deduped via .escalated-<sidecar> so each terminal state fires at most once. This is the belt-and-suspenders safety net for checks that haven't migrated to the lossless contract.

3. Tests (tests/fm-wake-queue.test.sh):

  • test_check_wake_survives_lost_delivery — the #3095 regression: a check wake survives a simulated crash between enqueue and suppress (queue recovery + re-detection).
  • test_check_dedup_suppresses_repeats — identical repeated output wakes exactly once.
  • test_catch_all_escalates_swallowed_transition — a self-suppressed check whose sidecar shows MERGED is force-escalated within one sweep.

4. Docs (AGENTS.md): the check contract now documents the stateless "always print current state" form as preferred; .seen-check-*/.escalated-* join the enqueue-before-suppress marker list.

Motivating incident

PR mvanhorn/cli-printing-press#3095 merged at 08:07Z; the .babysit-…-3095.seen sidecar advanced to MERGED at 08:08:17 (49s later), yet zero check: wake for this PR exists anywhere in the 32-hour daemon log. The watcher ran cleanly the entire window (no crashes, no errors). Full root-cause analysis and reproduction in data/fm-git-events-s8/report.md.

Deliberate decisions

  • Backward-compatible over migrate-everything: the watcher handles both old edge-triggered checks (empty stdout = no wake) and new stateless checks (always prints → watcher dedups). No existing check breaks; old checks gain the catch-all backstop.
  • Catch-all on the check-sweep cadence (300s), not the backoff-prone heartbeat (600s→7200s): ties the backstop to the only fixed-300s scan in the watcher, giving bounded catch-all latency regardless of fleet idle.
  • Test hook FM_WATCH_BREAK_AFTER_CHECK_ENQUEUE: env-gated, test-only, proves the enqueue-before-suppress ordering by simulating a crash between enqueue and suppress. Never set outside the test suite.

Verification

  • bash tests/fm-wake-queue.test.sh — all 14 tests pass (11 existing + 3 new, no regressions).
  • shellcheck bin/*.sh tests/*.sh — clean.
  • no-mistakes pipeline: review, test, document, lint all passed.

Pipeline

Updates from git push no-mistakes

Note: the no-mistakes pipeline ran the full review/test/document/lint gates on this change (all green; one review-round auto-fix applied — a comment inaccuracy in the test file, commit ae2d59d). The no-mistakes push/pr steps could not complete in this environment because no-mistakes is hard-bound to its recorded remote kunchenguid/firstmate (403 for the e-jung contributor) and cannot drive a fork→upstream cross-repo PR, so the branch was pushed to the e-jung/firstmate fork and this PR opened manually. The change is no-mistakes-validated; the CI gates below (Require no-mistakes, CI lint/tests/invariants) re-verify it on the PR.


AI disclosure: Human-reviewed. The implementation was generated by an AI agent (crewmate) and reviewed by a human (firstmate/captain) before merge.

🤖 Generated with AI. Reviewed by a human.

e-jung added 2 commits June 21, 2026 21:54
Checks were the only wake source whose suppression lived inside an opaque
script: an edge-triggered check advanced its own .babysit-*.seen marker
before the print could become a wake, so a lost stdout (timeout / concurrent
run / crash) permanently swallowed the transition. This is the root cause of
the missed PR #3095 merge (see data/fm-git-events-s8/report.md): the
.cli-printing-press-3095.seen sidecar advanced to MERGED but no check wake
was ever emitted.

Port the kunchenguid#29 enqueue-before-suppress invariant to checks:

1. Watcher-side suppression (bin/fm-watch.sh). The check always prints its
   current state (idempotent); the watcher dedups against .seen-check-<name>
   and calls fm_wake_append (durable queue) BEFORE advancing the marker. A
   crash between detect and suppress leaves the wake in the queue (recovered
   next turn) and the marker un-advanced (re-detected next cycle). A lost
   check wake is now impossible - exactly the guarantee signals enjoy.

2. Backward-compatible with old edge-triggered checks: empty stdout never
   produces a wake, so they keep their quiet behavior.

3. Catch-all backstop: force-escalate any .babysit-*.seen sidecar showing a
   terminal state (MERGED/CLOSED) the watcher never delivered a wake for.
   Catches a swallowed transition within one sweep; deduped via
   .escalated-<sidecar> so each terminal state fires at most once.

Tests (tests/fm-wake-queue.test.sh):
- test_check_wake_survives_lost_delivery: the #3095 regression - a check wake
  survives a simulated crash between enqueue and suppress (queue recovery +
  re-detection).
- test_check_dedup_suppresses_repeats: identical repeated output wakes once.
- test_catch_all_escalates_swallowed_transition: a self-suppressed check whose
  sidecar shows MERGED is force-escalated within one sweep.

AGENTS.md: the check contract now documents the stateless "always print
current state" form as preferred, and .seen-check-*/.escalated-* join the
enqueue-before-suppress marker list.

shellcheck clean; all 14 wake-queue tests + 5 spawn-batch tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant