Skip to content

fix(api): structured 404 envelope for unknown routes + docs drift guard#1406

Open
suisuss wants to merge 6 commits into
stagingfrom
feature/KEEP-476-api-docs-drift
Open

fix(api): structured 404 envelope for unknown routes + docs drift guard#1406
suisuss wants to merge 6 commits into
stagingfrom
feature/KEEP-476-api-docs-drift

Conversation

@suisuss
Copy link
Copy Markdown

@suisuss suisuss commented May 29, 2026

Problem

Hackathon builders reported a pattern where endpoints in docs/api/ either 404 in production or return a shape that diverges from the docs. Compounding that, Next.js's default 404 returns an HTML page, which makes an unknown URL look identical to a 401 to anyone reading status codes. The result: real time lost probing assumed-by-REST-convention paths that the docs never advertised, plus opaque debugging when docs and code disagreed.

What this changes

Structured JSON 404 for unknown /api/* paths

app/api/[...slug]/route.ts is a catch-all that exports every verb and returns the canonical {error, detail, request_id} envelope from lib/errors/api-envelope.ts. Cache-Control: no-store so a transient prod misconfig does not get cached as a permanent 404. More specific routes take precedence by App Router rules, so the existing app/api/auth/[...all] and app/api/execute/[...slug] keep their own behavior.

Reconcile documented response shapes

  • docs/api/chains.md: response is a bare array, not {data: [...]}. Field list aligned with what the route returns; documented why defaultPrimaryRpc/defaultFallbackRpc are intentionally server-only (provider API keys).
  • docs/api/user.md: GET /api/user/rpc-preferences returns {preferences, resolved}, not {data: [...]}. Removed phantom POST /api/user/rpc-preferences entry (no handler exists - writes go through the per-chain PUT, which is now properly documented with the correct body shape {primaryRpcUrl, fallbackRpcUrl?}).
  • docs/api/chains.md: removed phantom GET /api/web3/fetch-abi ("Alternative ABI Fetch" - the route is POST with a JSON body; the canonical /api/chains/{chainId}/abi is already documented above).
  • docs/api/workflows.md: removed phantom GET /api/workflows/taxonomy (no route file). If we want it, build it - I did not silently expand scope.

PR-time docs-vs-code guard

scripts/check-api-docs-routes.ts walks every (METHOD /api/path) line inside fenced http blocks under docs/api/*.md and asserts the corresponding app/api/<path>/route.ts exports that method. Handles all three export styles observed in the repo: export async function, export const = ..., and export { POST } from "...". Path-parameter names are allowed to differ between docs ({executionId}) and the filesystem ([id]) - only the URL pattern shape matters. Routes that exist in code but appear in no docs file are surfaced as warnings, never failures (/api/internal, /api/cron, /api/og, /api/auth, /api/oauth, etc. are on an explicit allowlist).

Wired into .github/workflows/pr-checks.yml as pnpm check:api-docs, with a follow-up step that fails CI if specs/api-coverage.json (committed) needs regenerating.

Post-deploy live HEAD probe

.github/workflows/deploy-keeperhub.yaml gets a step after the existing /api/health probe that walks specs/api-coverage.json and HEADs every documented GET endpoint on the deployed env. Reuses TEST_API_KEY (already fetched from SSM) and CF Access secrets. 200/401/403 count as reachable; 404/5xx are flagged. Initial mode is continue-on-error: true so we can observe what fires before it gates a release; flip after one clean run.

Verification

Local trifecta (worktree):

pnpm check                                       # green
NODE_OPTIONS=--max-old-space-size=8192 pnpm type-check  # green (needs pnpm discover-plugins first)
pnpm test tests/unit/api-not-found.test.ts       # 9/9
pnpm check:api-docs                              # 0 drift, 64 undocumented routes (warn-only)

Manual smoke:

docker compose --profile dev up
curl -s http://localhost:3000/api/does-not-exist | jq
# { "error": "not_found", "detail": "Route GET /api/does-not-exist not found", "request_id": "..." }

Post-merge staging probe (in the deploy job logs): every endpoint in specs/api-coverage.json should return 200/401/403. Any 404 is real drift and surfaces as a warn line during the first deploy.

Tradeoffs

  • No did-you-mean alias map. The structured 404 is already a step-change; alias maintenance cost is not worth the one-time builder confusion.
  • Not wrapping /api/chains in {data: [...]} to "fix" the drift. That would break every external consumer built against the current shape. /api/workflows is also bare-array; the house style is bare arrays.
  • No new ts-morph dep. Regex over the three observed export styles is sufficient; the failure mode (false-negative on unusual formatting) is easy to spot.
  • Live probe is warn-only initially. Flipping it to required is a one-line change once we see a clean run.

Follow-ups (separate tickets)

  • 64 routes are in code but undocumented. Decide which should become public docs vs. expand the allowlist in scripts/check-api-docs-routes.ts.
  • GET /api/workflows/taxonomy - if the "distinct categories and protocols from public workflows" endpoint is desired, build it (it would be useful for filter UIs).
  • OpenAPI spec at /openapi.json covering the full REST surface. The current app/api/openapi/route.ts only covers listed public workflows; the wider spec is its own piece of work.

Catch-all app/api/[...slug]/route.ts returns the canonical {error, detail, request_id} envelope (lib/errors/api-envelope.ts) instead of Next.js HTML 404, so unknown /api/* paths stop being mistakable for 401s. Specific routes take precedence by App Router rules; existing /api/auth/[...all] and /api/execute/[...slug] catch-alls keep their own behavior. Cache-Control no-store prevents edge caching of transient misconfigs.

Reconcile docs/api/chains.md with the bare-array shape /api/chains actually returns; align field list with what the route returns and document why defaultPrimaryRpc/defaultFallbackRpc are server-only (provider API keys).

Fix docs/api/user.md rpc-preferences: GET returns {preferences, resolved}; remove phantom POST /api/user/rpc-preferences entry (no handler exists); document per-chain PUT shape.

Remove phantom GET /api/web3/fetch-abi and GET /api/workflows/taxonomy entries from docs (no routes implement them).

scripts/check-api-docs-routes.ts parses every (METHOD, /api/path) line in docs/api/*.md inside fenced http blocks and asserts the corresponding route file exports that method, allowing param-name differences ({id} vs [executionId]); supports both export-function, export-const, and re-export styles. Emits specs/api-coverage.json. Wired into pr-checks.yml as check:api-docs plus a drift check on the artifact.

deploy-keeperhub.yaml gets a warn-only post-deploy HEAD probe that walks specs/api-coverage.json and reports any documented GET endpoint returning 404/5xx; reuses TEST_API_KEY from SSM and CF Access secrets already used by the health probe.
Self-review of #1406 surfaced three actionable bugs:

1. Live HEAD probe in deploy-keeperhub.yaml mixed -X HEAD with curl -L, which silently re-issues as GET after a redirect, and would mis-report 405 as drift for any GET-only handler whose middleware branches on request.method. Switched to a plain GET; HEAD probes were not buying us anything since the route bodies are already discarded via -o /dev/null.

2. The probe classified HTTP 400 as drift, but every path-param route returns 400 when the fixture string "probe-fixture" fails its address/UUID validator. That is the route working correctly. Added 400 to the reachable set.

3. The catch-all 404 in app/api/[...slug]/route.ts was silent, removing the diagnostic signal that motivated the ticket: we would never see in logs or Sentry that a real client kept polling a phantom path. Added a structured logSystemWarn with path and method labels; the comment notes that bots probe noisily and sampling can be added later if volume becomes a problem.

Also added a schema-sanity guard to the probe step: if zero endpoints get probed at all (artifact field renamed, jq filter silently producing nothing, file replaced with empty {endpoints: []}), the step now fails instead of greenlighting a deploy with no signal. The missing-file branch also now fails rather than silently exiting 0.
suisuss added 4 commits June 1, 2026 14:16
The catch-all logged every unmatched /api/* hit via logSystemWarn, which
unconditionally calls captureException and allocates a new Error per
request. As a public surface that bots probe with random paths, this
floods Sentry with warning events and wastes quota.

Switch to logUserError with no error argument: it still emits the
structured Loki line and a bounded Prometheus counter but skips Sentry
entirely. The request path moves into the log message (not a metric
label) so Prometheus cardinality stays flat. A 404 on an unknown route
is a client error, so the user-level log is also the correct severity.
The HEAD handler returned the full JSON envelope. RFC 9110 requires HEAD
responses to carry no body. Reuse the GET 404 to keep the status,
correlation id, and cache headers identical, then return a body-less
NextResponse. Split HEAD out of the envelope-asserting test loop into a
dedicated test that asserts an empty body with the same headers.
…c routes

Two issues in the post-deploy endpoint probe:

1. curl -L resends custom -H headers (X-API-Key, CF Access secret) across
   redirects, including cross-host, which would leak credentials if
   BASE_URL ever 30x'd off-origin. The probe does not need to follow
   redirects, so drop -L.

2. Probing a path-param route (/api/workflows/{workflowId}) with a fixture
   id legitimately returns 404 'resource not found' even though the route
   exists - indistinguishable from a missing route file. Treating that as
   drift would block releases on healthy routes once continue-on-error is
   flipped off. Now 404 is drift only for static (parameter-free) paths;
   param routes count any 2xx/4xx as reachable and gate only on 5xx/000.
writeCoverageArtifact sorted endpoints with localeCompare, whose default
collation depends on the runtime locale/ICU. Because CI byte-compares the
committed specs/api-coverage.json against a fresh run, a contributor whose
locale collates differently from CI could produce a 'stale' artifact that
fails the build. Switch to a raw code-point comparator and regenerate the
artifact (param segments now sort after sibling static/letters, matching
byte order). Also remove docsPathToRouteFile, which was dead - validate()
inlines the same shapeIndex lookup.
@suisuss
Copy link
Copy Markdown
Author

suisuss commented Jun 1, 2026

Pushed 4 commits addressing review findings (each scoped to one concern):

Catch-all 404 no longer floods Sentry (94b80e2a)
The handler logged every unmatched /api/* hit via logSystemWarn, which unconditionally calls captureException and allocates a new Error per request. On a public surface that bots probe with random paths, that floods Sentry and wastes quota. Switched to logUserError with no error argument: still emits the structured Loki line and a bounded Prometheus counter, but skips Sentry entirely. The request path moved into the log message (not a metric label) so Prometheus cardinality stays flat. A 404 on an unknown route is a client error, so user-level is also the correct severity.

Catch-all HEAD returns no body (ce1cd717)
HEAD returned the full JSON envelope; RFC 9110 requires HEAD responses to carry no body. It now reuses the GET 404 (same status, correlation id, and cache headers) and returns a body-less response. Test split out accordingly.

Deploy probe hardened (bd2982cd)

  • Dropped curl -L: curl resends custom -H headers (including X-API-Key and the CF Access secret) across redirects, even cross-host, which would leak credentials if BASE_URL ever 30x-ed off-origin. The probe does not need to follow redirects.
  • 404 is now treated as drift only for static (parameter-free) paths. A path-param route (e.g. /api/workflows/{workflowId}) probed with a fixture id legitimately returns 404 resource-not-found while the route itself exists, which is indistinguishable from a missing route file. Treating that as drift would block releases on healthy routes once continue-on-error is flipped off. Param routes now count any 2xx/4xx as reachable and gate only on 5xx/000.

Deterministic coverage sort + dead code removal (69036012)
writeCoverageArtifact sorted with localeCompare, whose default collation depends on runtime locale/ICU. Because CI byte-compares the committed specs/api-coverage.json against a fresh run, a contributor whose locale collates differently from CI could produce a stale artifact that fails the build. Switched to a raw code-point comparator and regenerated the artifact (69 entries, reordered only). Also removed docsPathToRouteFile, which was dead, validate() inlines the same lookup.

Not changed, with rationale:

  • The guard validates route existence and method export, not response shape. Catching field-level drift needs the OpenAPI spec already deferred to a follow-up, so the manual shape reconciliation in this PR stands but is not auto-guarded going forward.
  • The method-export regex already excludes line-commented exports; only a block-commented export at column 0 would false-match, and the failure mode is a benign false-pass. A real fix needs AST parsing, which this PR deliberately avoided.

Local verification: type-check clean, check:api-docs 0 drift, catch-all unit test 9/9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant