Skip to content

feat(health): restore /health endpoint (port from v1)#2619

Open
johnmathews wants to merge 1 commit into
nanocoai:mainfrom
johnmathews:feat/health-endpoint-port
Open

feat(health): restore /health endpoint (port from v1)#2619
johnmathews wants to merge 1 commit into
nanocoai:mainfrom
johnmathews:feat/health-endpoint-port

Conversation

@johnmathews
Copy link
Copy Markdown

Summary

Restores v1's /health endpoint that was dropped in the v2 rewrite. Loopback-only HTTP probe — no public surface — composing channel/queue/task/cursor-age status from existing runtime state. Production-hardened on my fork (multi-hour uptime). Closer to a regression fix than a new feature.

What's in the PR

New host modules:

  • src/health.ts — pure-function snapshot composer (collectHealth) plus a text formatter (formatHealthText) that's useful for any host-side status command. Takes injected dependencies (HealthDeps) so the pure-assembly half is trivially unit-testable.
  • src/health-snapshot.ts — supplies the runtime I/O (channel registry, delivery/sweep loop state, central DB, per-session inbound DBs) and hands it to collectHealth. v2 has no central task table, so it walks each active session's inbound.db once per request, counting kind='task' rows for active/paused/recent-failures and minimum process_after for the next scheduled run. Trivially fast at single-digit session counts.
  • src/health-server.tshttp.createServer bound to 127.0.0.1. Returns 200/503 from snapshot.healthy, no caching. Port from HEALTH_PORT env var, default 3002.

Modified host modules (additive only):

  • src/index.ts — wires startHealthServer(port, snapshotHealth) after the CLI socket server in main() and tears it down at the top of shutdown(). No other lifecycle changes.
  • src/host-sweep.ts — adds isHostSweepRunning(): boolean (5 LOC). Pure accessor for the existing running module flag, read by the snapshot so it can report messageLoopRunning honestly.
  • src/delivery.ts — adds getDeliveryPollsRunning(): boolean (5 LOC). Pure accessor for the existing activePolling && sweepPolling flags.

Why loopback-only

The webhook server on 0.0.0.0:3000 is the only externally reachable surface; /health is for local probes (process supervisor, systemd, future host-side status commands) and intentionally never public. Binding 127.0.0.1 makes that explicit at the kernel level rather than relying on documentation.

Stats

 src/delivery.ts           |   5 ++
 src/health-server.test.ts | 100 +++++++++++++++++++++++++++
 src/health-server.ts      |  40 +++++++++++
 src/health-snapshot.ts    | 102 +++++++++++++++++++++++++++
 src/health.test.ts        | 173 ++++++++++++++++++++++++++++++++++++++++++++++
 src/health.ts             | 170 +++++++++++++++++++++++++++++++++++++++++++++
 src/host-sweep.ts         |   5 ++
 src/index.ts              |  14 ++++
 8 files changed, 609 insertions(+)

Tests

  • 21 tests in src/health.test.ts (snapshot composition, age formatter, text formatter, edge cases)
  • 4 tests in src/health-server.test.ts (200 OK, 503 unhealthy, 404 non-/health, 500 on snapshot throw)
  • Full host suite: 34 test files, 357 tests, all passing on this branch
  • pnpm exec tsc --noEmit: clean

Paired-but-not-included: systemd watchdog

On my fork this commit was bundled with a src/watchdog.ts module that sends sd_notify READY=1 / WATCHDOG=1 / STOPPING=1, and a setup/service.ts change adding Type=notify + WatchdogSec=30s to the systemd unit. I'm deliberately not upstreaming that half:

  • The watchdog is a no-op without the unit-file change.
  • The unit-file change would push every installer onto systemd-notify semantics whether or not they're on systemd in the first place.

Whether to add sd_notify support is a separate design decision — it should land (or not) as its own PR with the unit-file change attached.

Restores v1's /health surface that was dropped in the v2 rewrite.

Three modules:
- src/health.ts — pure-function snapshot composer producing channel/queue/
  task/cursor-age status. Takes injected dependencies (HealthDeps) so the
  pure-assembly half is trivially testable; the I/O-bound side lives in
  health-snapshot.ts. Exposes formatHealthText() for chat-side reuse.
- src/health-snapshot.ts — supplies the runtime I/O (channel registry,
  delivery loop state, DB, per-session inbound DBs) and hands it to
  collectHealth(). Walks active session inbound DBs once per call to derive
  task counts; v2 has no central task table, so kind='task' rows in
  messages_in are the source of truth.
- src/health-server.ts — loopback HTTP server on 127.0.0.1, 200/503 from
  snapshot.healthy, no caching. Port from HEALTH_PORT env var, default 3002.

Wiring in src/index.ts: startHealthServer(port, snapshotHealth) on startup,
server.close() in shutdown(). delivery.ts and host-sweep.ts expose
getDeliveryPollsRunning() / isHostSweepRunning() — tiny additive accessors
the snapshot uses to report messageLoopRunning honestly.

The endpoint is loopback-only by design: the webhook server on 0.0.0.0:3000
is what's externally reachable; /health is for local probes and host-side
status commands, never public.

Tests: 19 in src/health.test.ts + 4 in src/health-server.test.ts.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant