fix(api): structured 404 envelope for unknown routes + docs drift guard#1406
fix(api): structured 404 envelope for unknown routes + docs drift guard#1406suisuss wants to merge 6 commits into
Conversation
Catch-all app/api/[...slug]/route.ts returns the canonical {error, detail, request_id} envelope (lib/errors/api-envelope.ts) instead of Next.js HTML 404, so unknown /api/* paths stop being mistakable for 401s. Specific routes take precedence by App Router rules; existing /api/auth/[...all] and /api/execute/[...slug] catch-alls keep their own behavior. Cache-Control no-store prevents edge caching of transient misconfigs.
Reconcile docs/api/chains.md with the bare-array shape /api/chains actually returns; align field list with what the route returns and document why defaultPrimaryRpc/defaultFallbackRpc are server-only (provider API keys).
Fix docs/api/user.md rpc-preferences: GET returns {preferences, resolved}; remove phantom POST /api/user/rpc-preferences entry (no handler exists); document per-chain PUT shape.
Remove phantom GET /api/web3/fetch-abi and GET /api/workflows/taxonomy entries from docs (no routes implement them).
scripts/check-api-docs-routes.ts parses every (METHOD, /api/path) line in docs/api/*.md inside fenced http blocks and asserts the corresponding route file exports that method, allowing param-name differences ({id} vs [executionId]); supports both export-function, export-const, and re-export styles. Emits specs/api-coverage.json. Wired into pr-checks.yml as check:api-docs plus a drift check on the artifact.
deploy-keeperhub.yaml gets a warn-only post-deploy HEAD probe that walks specs/api-coverage.json and reports any documented GET endpoint returning 404/5xx; reuses TEST_API_KEY from SSM and CF Access secrets already used by the health probe.
Self-review of #1406 surfaced three actionable bugs: 1. Live HEAD probe in deploy-keeperhub.yaml mixed -X HEAD with curl -L, which silently re-issues as GET after a redirect, and would mis-report 405 as drift for any GET-only handler whose middleware branches on request.method. Switched to a plain GET; HEAD probes were not buying us anything since the route bodies are already discarded via -o /dev/null. 2. The probe classified HTTP 400 as drift, but every path-param route returns 400 when the fixture string "probe-fixture" fails its address/UUID validator. That is the route working correctly. Added 400 to the reachable set. 3. The catch-all 404 in app/api/[...slug]/route.ts was silent, removing the diagnostic signal that motivated the ticket: we would never see in logs or Sentry that a real client kept polling a phantom path. Added a structured logSystemWarn with path and method labels; the comment notes that bots probe noisily and sampling can be added later if volume becomes a problem. Also added a schema-sanity guard to the probe step: if zero endpoints get probed at all (artifact field renamed, jq filter silently producing nothing, file replaced with empty {endpoints: []}), the step now fails instead of greenlighting a deploy with no signal. The missing-file branch also now fails rather than silently exiting 0.
The catch-all logged every unmatched /api/* hit via logSystemWarn, which unconditionally calls captureException and allocates a new Error per request. As a public surface that bots probe with random paths, this floods Sentry with warning events and wastes quota. Switch to logUserError with no error argument: it still emits the structured Loki line and a bounded Prometheus counter but skips Sentry entirely. The request path moves into the log message (not a metric label) so Prometheus cardinality stays flat. A 404 on an unknown route is a client error, so the user-level log is also the correct severity.
The HEAD handler returned the full JSON envelope. RFC 9110 requires HEAD responses to carry no body. Reuse the GET 404 to keep the status, correlation id, and cache headers identical, then return a body-less NextResponse. Split HEAD out of the envelope-asserting test loop into a dedicated test that asserts an empty body with the same headers.
…c routes
Two issues in the post-deploy endpoint probe:
1. curl -L resends custom -H headers (X-API-Key, CF Access secret) across
redirects, including cross-host, which would leak credentials if
BASE_URL ever 30x'd off-origin. The probe does not need to follow
redirects, so drop -L.
2. Probing a path-param route (/api/workflows/{workflowId}) with a fixture
id legitimately returns 404 'resource not found' even though the route
exists - indistinguishable from a missing route file. Treating that as
drift would block releases on healthy routes once continue-on-error is
flipped off. Now 404 is drift only for static (parameter-free) paths;
param routes count any 2xx/4xx as reachable and gate only on 5xx/000.
writeCoverageArtifact sorted endpoints with localeCompare, whose default collation depends on the runtime locale/ICU. Because CI byte-compares the committed specs/api-coverage.json against a fresh run, a contributor whose locale collates differently from CI could produce a 'stale' artifact that fails the build. Switch to a raw code-point comparator and regenerate the artifact (param segments now sort after sibling static/letters, matching byte order). Also remove docsPathToRouteFile, which was dead - validate() inlines the same shapeIndex lookup.
|
Pushed 4 commits addressing review findings (each scoped to one concern): Catch-all 404 no longer floods Sentry ( Catch-all HEAD returns no body ( Deploy probe hardened (
Deterministic coverage sort + dead code removal ( Not changed, with rationale:
Local verification: |
Problem
Hackathon builders reported a pattern where endpoints in
docs/api/either 404 in production or return a shape that diverges from the docs. Compounding that, Next.js's default 404 returns an HTML page, which makes an unknown URL look identical to a 401 to anyone reading status codes. The result: real time lost probing assumed-by-REST-convention paths that the docs never advertised, plus opaque debugging when docs and code disagreed.What this changes
Structured JSON 404 for unknown
/api/*pathsapp/api/[...slug]/route.tsis a catch-all that exports every verb and returns the canonical{error, detail, request_id}envelope fromlib/errors/api-envelope.ts.Cache-Control: no-storeso a transient prod misconfig does not get cached as a permanent 404. More specific routes take precedence by App Router rules, so the existingapp/api/auth/[...all]andapp/api/execute/[...slug]keep their own behavior.Reconcile documented response shapes
docs/api/chains.md: response is a bare array, not{data: [...]}. Field list aligned with what the route returns; documented whydefaultPrimaryRpc/defaultFallbackRpcare intentionally server-only (provider API keys).docs/api/user.md:GET /api/user/rpc-preferencesreturns{preferences, resolved}, not{data: [...]}. Removed phantomPOST /api/user/rpc-preferencesentry (no handler exists - writes go through the per-chain PUT, which is now properly documented with the correct body shape{primaryRpcUrl, fallbackRpcUrl?}).docs/api/chains.md: removed phantomGET /api/web3/fetch-abi("Alternative ABI Fetch" - the route is POST with a JSON body; the canonical/api/chains/{chainId}/abiis already documented above).docs/api/workflows.md: removed phantomGET /api/workflows/taxonomy(no route file). If we want it, build it - I did not silently expand scope.PR-time docs-vs-code guard
scripts/check-api-docs-routes.tswalks every(METHOD /api/path)line inside fencedhttpblocks underdocs/api/*.mdand asserts the correspondingapp/api/<path>/route.tsexports that method. Handles all three export styles observed in the repo:export async function,export const = ..., andexport { POST } from "...". Path-parameter names are allowed to differ between docs ({executionId}) and the filesystem ([id]) - only the URL pattern shape matters. Routes that exist in code but appear in no docs file are surfaced as warnings, never failures (/api/internal,/api/cron,/api/og,/api/auth,/api/oauth, etc. are on an explicit allowlist).Wired into
.github/workflows/pr-checks.ymlaspnpm check:api-docs, with a follow-up step that fails CI ifspecs/api-coverage.json(committed) needs regenerating.Post-deploy live HEAD probe
.github/workflows/deploy-keeperhub.yamlgets a step after the existing/api/healthprobe that walksspecs/api-coverage.jsonand HEADs every documented GET endpoint on the deployed env. ReusesTEST_API_KEY(already fetched from SSM) and CF Access secrets.200/401/403count as reachable;404/5xxare flagged. Initial mode iscontinue-on-error: trueso we can observe what fires before it gates a release; flip after one clean run.Verification
Local trifecta (worktree):
Manual smoke:
Post-merge staging probe (in the deploy job logs): every endpoint in
specs/api-coverage.jsonshould return200/401/403. Any404is real drift and surfaces as a warn line during the first deploy.Tradeoffs
/api/chainsin{data: [...]}to "fix" the drift. That would break every external consumer built against the current shape./api/workflowsis also bare-array; the house style is bare arrays.Follow-ups (separate tickets)
scripts/check-api-docs-routes.ts.GET /api/workflows/taxonomy- if the "distinct categories and protocols from public workflows" endpoint is desired, build it (it would be useful for filter UIs)./openapi.jsoncovering the full REST surface. The currentapp/api/openapi/route.tsonly covers listed public workflows; the wider spec is its own piece of work.