Skip to content

feat(smithy): detect dispatch daemon poll-loop staleness (internal)#5

Open
komoreka wants to merge 13 commits into
masterfrom
feat/daemon-poll-staleness
Open

feat(smithy): detect dispatch daemon poll-loop staleness (internal)#5
komoreka wants to merge 13 commits into
masterfrom
feat/daemon-poll-staleness

Conversation

@komoreka
Copy link
Copy Markdown
Owner

@komoreka komoreka commented May 8, 2026

Internal fork PR — not for upstream yet. Builds on the still-open feat/dispatch-stuck-warning (upstream PR stoneforge-ai#98).

Symptom

In dogfooding today the dispatch daemon entered a state where:

  • Node process alive, HTTP server responsive
  • runPollCycle had stopped firing — no new dispatch, scheduling, or recovery work for hours
  • kill <pid>; sf serve smithy restart unblocked everything

Most likely path (per the existing re-entrancy guard at dispatch-daemon.ts): the setInterval handler keeps firing, but every cycle bails at if (this.polling) return because the previous cycle is wedged on an await somewhere inside the try block. The finally { this.polling = false } never runs.

Fix (detection only — no auto-recovery)

Three new fields on DispatchHealth (extending the blob added in stoneforge-ai#98):

  • lastPollStartedAt — set inside runPollCycle immediately after this.polling = true, before any await
  • lastPollCompletedAt — set in the finally block, before clearing polling
  • pollStale — derived flag, true when a cycle is in flight past the threshold OR the last completion is older than the threshold

Threshold defaults to max(60_000, 10 × pollIntervalMs), configurable via a new pollStaleThresholdMs config field. Generously above any healthy poll cycle's expected duration.

The DispatchHealthBanner in smithy-web renders a distinct red wedge-daemon banner when pollStale === true, advising the operator to restart. Stuck-queue banner remains amber. When both conditions hold, wedge banner takes priority (stuck queue unresolvable while daemon is dead).

Why detection-only

Force-clearing this.polling = false while a real async op is mid-flight creates concurrent runs on the same tasks (the existing comment at dispatch-daemon.ts:2199-2201 explicitly warns about this). Without a reproducer for the underlying hung-await, auto-recovery would paper over a real bug. This PR makes the next occurrence diagnosable.

Test plan

  • bun test src/services/dispatch-daemon.bun.test.ts -t 'pollStale' — 4/4 pass (fresh-after-tick, never-polled, wedged-mid-cycle, custom-threshold)
  • bun test packages/smithy/src — 1581 pass / 0 fail / 19 skip (full smithy suite)
  • turbo run typecheck — 16/16 pass
  • pnpm build — 13/13 pass
  • Live UI smoke deferred to merge-into-personal-main; confirm banner renders distinct copy when monkey-patching pollWorkerAvailability to hang.

Out of scope

  • Auto-recovery / force-clearing the polling flag — hazardous; needs reproducer
  • CLI footer surfacing in sf task list / sf agent list — banner alone covers v1
  • Stack-dump-on-stale (e.g., kill -USR1) — useful future diagnostic
  • Upstream PR — held until detection surfaces a real wedge in daily use, then promote

komoreka and others added 12 commits May 5, 2026 08:55
… detection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds DispatchHealth interface to api/types.ts, extends DaemonStatusResponse
in useDaemon.ts with an optional health field (also tightens poll to 5s),
and creates the DispatchHealthBanner component that renders an amber warning
when hasStuckQueue is true.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pages

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…pper

Banner now takes an optional className applied to its outer element so
each mount site controls its own page-specific padding. The workspaces
route previously wrapped the banner in a div with px-6 pt-4 which left
16px of empty padding when the banner self-hides (queue healthy or
dismissed). The wrapper is gone; the banner mounts directly with
className="mx-6 mt-4" passed in.
Replaces the per-tick rate-limited warn with state-transition logging:
- Healthy → Stuck: single STUCK warn line with the count and a clear hint.
- Stuck → Healthy: single RESUMED info line confirming dispatch is flowing.
- No periodic reminders. A long-running stuck state does not spam the log.

Drops the stuckWarnTickInterval config option (was 20 ticks, 100s spacing
in production). Distinctive STUCK/RESUMED prefixes make the lines findable
in a chatty log stream.

Tests:
- Verifies STUCK warn fires on first stuck tick AND does not fire again
  on subsequent stuck ticks (was the periodic-spam regression risk).
- Verifies RESUMED info fires on the next healthy tick after stuck.
…ch daemon

A dispatch daemon whose runPollCycle is wedged mid-await leaves
this.polling = true and never reaches the finally block, so the
re-entrancy guard at the top of runPollCycle silently bails on every
subsequent setInterval tick. The HTTP server stays responsive but the
loop is effectively dead. We observed this in production today.

Track both start and complete timestamps:
- lastPollStartedAt is set immediately after this.polling = true,
  before any await. A wedged cycle leaves this advancing only on the
  current (stuck) cycle's first entry.
- lastPollCompletedAt is set in the finally block. A wedged cycle
  never reaches this, so it stays old.

Together they form the foundation for a pollStale flag in
getDispatchHealth (added in the next commit).

This commit only adds tracking — no behavior change yet visible to
operators or callers.

Refs: detection-only response to observed daemon wedge in dogfooding.
Extend the existing DispatchHealth blob (added in PR stoneforge-ai#98) with three new
fields for daemon poll-loop health observability:

- lastPollStartedAt: when the most recent cycle entered the try block
- lastPollCompletedAt: when it reached the finally block (undefined if
  no cycle has completed yet)
- pollStale: derived flag — true when either (a) a cycle is in flight
  past the threshold, or (b) the last completed cycle is older than the
  threshold

A new pollStaleThresholdMs config field defaults to max(60_000,
10 × pollIntervalMs). Generously above any healthy cycle's expected
duration to avoid false-firing on long but legitimate orphan recovery,
inbox triage, or steward trigger work.

Surfaced automatically via /api/daemon/status (the existing route just
spreads the health blob).

Tests cover: fresh-after-tick, never-polled, wedged-mid-cycle (api.ready
stubbed to never resolve, sleep past threshold, observe stale=true and
completedAt undefined), and custom-threshold honored.

Refs: stoneforge-ai#98 (parent feature), detection-only response to observed daemon
wedge in dogfooding.
DispatchHealthBanner previously rendered only the amber stuck-queue
warning. Extend it with a red wedge-daemon variant that fires when
data.health.pollStale === true.

Why two variants:
- Stuck queue (amber): waiting on operator action — register/enable a
  worker. Daemon is healthy.
- Wedged daemon (red): the daemon itself is broken. Loop has stopped;
  HTTP responds but dispatch/scheduling/recovery don't. Operator must
  restart `sf serve smithy`.

When both conditions are true, the wedged-daemon banner takes priority
since stuck-queue is unresolvable while the daemon is dead.

The body copy quantifies how long the poll has been stuck using
lastPollCompletedAt (rounded to minutes), falling back to "for a while"
when the timestamp is missing.

Mirrors the new fields in apps/smithy-web/src/api/types.ts so the
banner can typecheck against the API response shape.
Three small fixes from self-review:

1. DispatchHealthBanner.tsx: Math.round → Math.floor in the stuck-duration
   formatter. With round, a 90-second wedge displayed "for over 2 minutes"
   which overstates the time. Floor matches the "for over N minutes"
   wording literally.

2. dispatch-daemon.ts: clarify the comment on lastPollCompletedAt. The
   prior wording "(it doesn't, by design)" parsed ambiguously. Replaced
   with explicit text: during a wedge both timestamps stay frozen because
   the re-entrancy guard at runPollCycle's top bails on every subsequent
   setInterval tick.

3. dispatch-daemon.bun.test.ts: replace the `as any` stub of api.ready
   with a typed cast. Same runtime behavior, eliminates the eslint-disable
   for no-explicit-any. Uses `Parameters<typeof realReady>[0]` and
   `typeof harness.api.ready` to preserve full type fidelity.

No behavior change. 4/4 pollStale tests pass; typecheck clean.
komoreka added a commit that referenced this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant