Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .changeset/daemon-poll-staleness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
"@stoneforge/smithy": minor
"@stoneforge/smithy-web": patch
---

feat(smithy): detect and surface dispatch daemon poll-loop staleness

A wedged dispatch daemon — process alive, HTTP responsive, but `runPollCycle` hung — would quietly stop dispatching, scheduling, and recovering with no signal to the operator. The daemon now tracks `lastPollStartedAt` and `lastPollCompletedAt`, and exposes a `pollStale` flag in `getDispatchHealth()` (true when either a cycle is in flight past the threshold, or the last completion is older than the threshold). Default threshold is `max(60_000, 10 × pollIntervalMs)`, configurable via the new `pollStaleThresholdMs` config field.

The `DispatchHealthBanner` in smithy-web renders a distinct red wedge-daemon banner when `pollStale` is true (vs. the existing amber stuck-queue banner), advising the operator to restart `sf serve smithy`. When both conditions are true, the wedge banner takes priority since a stuck queue is unresolvable while the daemon is dead.
12 changes: 12 additions & 0 deletions .changeset/dispatch-stuck-warning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
"@stoneforge/smithy": minor
---

Surface a dispatch-stuck warning when ready unassigned tasks have no available workers to take them.

- New `getDispatchHealth()` method on `DispatchDaemon` returning `{ readyUnassignedTasks, availableWorkers, stuck, hasStuckQueue, computedAt }`. A worker is "available" when it is registered, not disabled, and not terminated. At-capacity workers do not count as stuck (the queue is busy, not stuck).
- Per-tick CLI warn (rate-limited to once per 20 ticks, configurable via `DispatchDaemonConfig.stuckWarnTickInterval`): `[dispatch] N task(s) ready, no available workers...`. Re-warns immediately when the queue clears and re-stuckens.
- `GET /api/daemon/status` includes a `health` field with the snapshot.
- New smithy-web `DispatchHealthBanner` shown on the agents and workspaces pages when the queue is stuck. Dismissible per page-load.

Closes #59. The pool-routing observation in #59 is filed separately.
4 changes: 3 additions & 1 deletion apps/smithy-web/src/api/hooks/useDaemon.ts
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
*/

import { useQuery, useMutation, useQueryClient } from '@tanstack/react-query';
import type { DispatchHealth } from '../types';

// ============================================================================
// Types
Expand All @@ -28,6 +29,7 @@ export interface DaemonStatusResponse {
limits: Array<{ executable: string; resetsAt: string }>;
soonestReset?: string;
};
health?: DispatchHealth;
}

export interface DaemonStartResponse {
Expand Down Expand Up @@ -85,7 +87,7 @@ export function useDaemonStatus() {
return useQuery<DaemonStatusResponse, Error>({
queryKey: ['daemon-status'],
queryFn: () => fetchApi<DaemonStatusResponse>('/daemon/status'),
refetchInterval: 10000, // Poll every 10 seconds
refetchInterval: 5000, // Poll every 5 seconds
});
}

Expand Down
18 changes: 18 additions & 0 deletions apps/smithy-web/src/api/types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -622,6 +622,24 @@ export interface ApprovalRequestResponse {
request: ApprovalRequest;
}

// ============================================================================
// Daemon / Dispatch Types
// ============================================================================

export interface DispatchHealth {
readyUnassignedTasks: number;
availableWorkers: number;
stuck: boolean;
hasStuckQueue: boolean;
computedAt: string;
/** ISO timestamp when the most recent poll cycle started. Undefined if no cycle has ever started. */
lastPollStartedAt?: string;
/** ISO timestamp when the most recent poll cycle completed. Undefined if no cycle has completed. */
lastPollCompletedAt?: string;
/** True when the daemon is responsive but its poll loop is wedged or has not advanced past the staleness threshold. */
pollStale: boolean;
}

// ============================================================================
// Provider Metrics Types
// ============================================================================
Expand Down
76 changes: 76 additions & 0 deletions apps/smithy-web/src/components/dispatch/DispatchHealthBanner.tsx
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import { useState } from 'react';
import { AlertTriangle, X } from 'lucide-react';
import { useDaemonStatus } from '../../api/hooks/useDaemon';

interface DispatchHealthBannerProps {
/** Optional layout classes applied to the outer banner element. Lets each mount site control its own page-specific padding without leaving an empty wrapper when the banner self-hides. */
className?: string;
}

export function DispatchHealthBanner({ className }: DispatchHealthBannerProps = {}) {
const { data } = useDaemonStatus();
const [dismissed, setDismissed] = useState(false);

if (dismissed) return null;

const health = data?.health;
// pollStale is the more critical signal: the daemon's poll loop is wedged
// and dispatch/scheduling/recovery have all stopped. Render this banner
// first when both conditions hold; users can't act on a stuck queue if
// the daemon itself is broken.
const pollStale = health?.pollStale === true;
const hasStuckQueue = health?.hasStuckQueue === true;

if (!pollStale && !hasStuckQueue) return null;

// Red for poll-stale (daemon wedged); amber for stuck-queue (waiting on
// operator action). Different urgency, different colors.
const baseClasses = pollStale
? 'mb-4 flex items-start gap-3 px-4 py-3 rounded-md bg-red-50 dark:bg-red-950/30 border border-red-300 dark:border-red-700 text-red-900 dark:text-red-100'
: 'mb-4 flex items-start gap-3 px-4 py-3 rounded-md bg-amber-50 dark:bg-amber-950/30 border border-amber-300 dark:border-amber-700 text-amber-900 dark:text-amber-100';

const hoverClasses = pollStale
? 'p-1 rounded hover:bg-red-100 dark:hover:bg-red-900/50 transition-colors'
: 'p-1 rounded hover:bg-amber-100 dark:hover:bg-amber-900/50 transition-colors';

const headline = pollStale ? 'Dispatch daemon is wedged.' : 'Dispatch is stuck.';

// Compute approximate stuck duration for the message body. Falls back to
// "for a while" when timestamps are missing or the wedge is under a minute.
// Math.floor matches the "for over N minutes" wording: a 90-second wedge
// reads "for over 1 minute", not "for over 2 minutes".
const lastCompletedAt = health?.lastPollCompletedAt;
let stuckForCopy = 'for a while';
if (lastCompletedAt) {
const ageMin = Math.floor((Date.now() - new Date(lastCompletedAt).getTime()) / 60000);
if (ageMin >= 1) stuckForCopy = `for over ${ageMin} minute${ageMin === 1 ? '' : 's'}`;
}

const body = pollStale
? `Poll loop has not completed a cycle ${stuckForCopy}. The HTTP server is responsive but dispatch, scheduling, and recovery have stopped. Restart with \`sf serve smithy\` to recover.`
: `${health?.readyUnassignedTasks ?? 0} task(s) ready, no available workers. Register or enable a worker to start dispatching.`;

const testId = pollStale ? 'dispatch-health-banner-poll-stale' : 'dispatch-health-banner';

return (
<div
className={className ? `${className} ${baseClasses}` : baseClasses}
data-testid={testId}
role="alert"
>
<AlertTriangle className="w-5 h-5 mt-0.5 shrink-0" />
<div className="flex-1 min-w-0 text-sm">
<div className="font-medium">{headline}</div>
<div className="mt-1">{body}</div>
</div>
<button
type="button"
onClick={() => setDismissed(true)}
className={hoverClasses}
aria-label="Dismiss"
>
<X className="w-4 h-4" />
</button>
</div>
);
}
4 changes: 4 additions & 0 deletions apps/smithy-web/src/routes/agents/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import { AgentCard, CreateAgentDialog, DeleteAgentDialog, RenameAgentDialog, Sta
import { PoolCard, CreatePoolDialog, EditPoolDialog } from '../../components/pool';
import { AgentWorkspaceGraph } from '../../components/agent-graph';
import type { Agent, SessionStatus, AgentRole, StewardFocus } from '../../api/types';
import { DispatchHealthBanner } from '../../components/dispatch/DispatchHealthBanner';

type TabValue = 'agents' | 'stewards' | 'pools' | 'graph';

Expand Down Expand Up @@ -429,6 +430,9 @@ export function AgentsPage() {
</nav>
</div>

{/* Dispatch health banner */}
<DispatchHealthBanner />

{/* Content */}
{currentTab === 'graph' ? (
// Graph tab handles its own loading/error states
Expand Down
4 changes: 4 additions & 0 deletions apps/smithy-web/src/routes/workspaces/index.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ import {
type LayoutPreset,
} from '../../components/workspace';
import { useAgent, useResumeAgentSession } from '../../api/hooks/useAgents';
import { DispatchHealthBanner } from '../../components/dispatch/DispatchHealthBanner';

/** Layout preset configuration */
const layoutPresets: { id: LayoutPreset; icon: typeof Square; label: string }[] = [
Expand Down Expand Up @@ -331,6 +332,9 @@ export function WorkspacesPage() {
</div>
</div>

{/* Dispatch health banner */}
<DispatchHealthBanner className="mx-6 mt-4" />

{/* Main content area */}
<div className="flex-1 min-h-0 p-4 overflow-hidden">
{hasPanes ? (
Expand Down
78 changes: 78 additions & 0 deletions packages/smithy/src/server/routes/daemon.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
/**
* Daemon Routes Tests — GET /api/daemon/status health field
*
* Tests that the status endpoint includes dispatch health when available,
* omits it when the daemon is not configured, and degrades gracefully when
* getDispatchHealth() throws.
*/

import { describe, it, expect, vi } from 'vitest';
import type { Services } from '../services.js';
import { createDaemonRoutes } from './daemon.js';

// ============================================================================
// Test Fixtures
// ============================================================================

function createMockServices(opts: { withDaemon?: boolean; healthThrows?: boolean } = {}) {
const dispatchDaemon =
opts.withDaemon === false
? undefined
: {
isRunning: vi.fn().mockReturnValue(true),
getConfig: vi.fn().mockReturnValue({ pollIntervalMs: 500 }),
getRateLimitStatus: vi.fn().mockReturnValue({ active: false }),
getDispatchHealth: opts.healthThrows
? vi.fn().mockRejectedValue(new Error('db unreachable'))
: vi.fn().mockResolvedValue({
readyUnassignedTasks: 3,
availableWorkers: 0,
stuck: true,
hasStuckQueue: true,
computedAt: '2026-05-03T00:00:00.000Z',
}),
};

const services = { dispatchDaemon } as unknown as Services;
return { services, dispatchDaemon };
}

// ============================================================================
// Tests
// ============================================================================

describe('GET /api/daemon/status — health', () => {
it('includes the health snapshot when the daemon is available', async () => {
const { services } = createMockServices();
const app = createDaemonRoutes(services);
const res = await app.request('/api/daemon/status');
const body = await res.json();

expect(res.status).toBe(200);
expect(body.health.hasStuckQueue).toBe(true);
expect(body.health.readyUnassignedTasks).toBe(3);
expect(body.health.availableWorkers).toBe(0);
});

it('omits the health field when the daemon is unavailable', async () => {
const { services } = createMockServices({ withDaemon: false });
const app = createDaemonRoutes(services);
const res = await app.request('/api/daemon/status');
const body = await res.json();

expect(res.status).toBe(200);
expect(body.available).toBe(false);
expect(body.health).toBeUndefined();
});

it('still returns 200 with isRunning when getDispatchHealth throws', async () => {
const { services } = createMockServices({ healthThrows: true });
const app = createDaemonRoutes(services);
const res = await app.request('/api/daemon/status');
const body = await res.json();

expect(res.status).toBe(200);
expect(body.isRunning).toBe(true);
expect(body.health).toBeUndefined();
});
});
11 changes: 10 additions & 1 deletion packages/smithy/src/server/routes/daemon.ts
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ export function createDaemonRoutes(services: Services) {
const app = new Hono();

// GET /api/daemon/status
app.get('/api/daemon/status', (c) => {
app.get('/api/daemon/status', async (c) => {
if (!dispatchDaemon) {
return c.json({
isRunning: false,
Expand All @@ -42,6 +42,14 @@ export function createDaemonRoutes(services: Services) {

const config = dispatchDaemon.getConfig();
const rateLimitStatus = dispatchDaemon.getRateLimitStatus();

let health: import('../../services/dispatch-daemon.js').DispatchHealth | undefined;
try {
health = await dispatchDaemon.getDispatchHealth();
} catch (err) {
logger.warn('Failed to compute dispatch health for /api/daemon/status:', err);
}

return c.json({
isRunning: dispatchDaemon.isRunning(),
available: true,
Expand All @@ -55,6 +63,7 @@ export function createDaemonRoutes(services: Services) {
directorInboxForwardingEnabled: config.directorInboxForwardingEnabled,
},
rateLimit: rateLimitStatus,
health,
});
});

Expand Down
Loading