hotfix(security): harden internal service auth with HMAC#1405
Conversation
Replaces the bearer-token-only X-Service-Key scheme with HMAC-SHA256
request authentication bound to method, path, caller, body digest, and
unix timestamp. 300-second replay window matches the agentic-wallet
precedent; no server-side nonce cache for the same latency-cost reason.
Signing secrets move into a new versioned internal_service_hmac_secrets
table modelled on agentic_wallet_hmac_secrets: AES-256-GCM at rest,
AAD-bound to (caller, key_version), 24h rotation grace via expires_at.
v1 reuses AGENTIC_WALLET_HMAC_KMS_KEY (a dedicated key is a documented
follow-up) and stores one shared secret under caller='*shared*' (per-
producer split is a one-row follow-up).
Verifier accepts both schemes during the dual-accept rollout window:
- presence of any X-KH-* header commits to the HMAC path with NO
fallback (prevents an attacker who knows the legacy bearer from
appending a junk signature to slip through the looser path),
- INTERNAL_AUTH_REQUIRE_HMAC=true rejects all legacy bearers and is
the per-environment kill-switch,
- X-Service-Key or X-Internal-Token falls back to the existing
timingSafeEqual compare against the four *_SERVICE_API_KEY env vars
plus KEEPERHUB_API_KEY.
Consolidates the /api/workflows/events bespoke X-Internal-Token check
into authenticateInternalService. The 500-on-misconfigured-env path is
intentionally dropped because it leaks operational state to clients.
All 12 internal-service consumer routes migrated to a raw-body pattern
so the HMAC verifier hashes the same bytes that arrive over the wire.
The workflow execute route was the highest-risk migration because it
consumes the body twice; raw body is captured once at the top and
parsed from the captured string thereafter.
Audit log via logInternalAuthEvent fires once per request with caller,
scheme, route, ip, key_version, latency. Caller field sanitized to the
InternalCaller enum or 'unknown' to bound log cardinality. Logs-only
for v1; Prometheus counter and Sentry breadcrumb deferred.
Drizzle migration 0094 hand-written since drizzle-kit generate is
broken in this repo; journal idx 94 appended with when=1779843900001
(monotonic).
Verified locally:
- pnpm exec biome check: clean on touched files
- pnpm type-check: clean across the whole codebase
- pnpm vitest run tests/unit tests/integration: 6132/6132 pass
- pnpm db:migrate against the dev DB: applies cleanly, table exists
with PK on (caller, key_version)
Resolve drizzle/meta/_journal.json conflict: staging owns 0094_chain_status and 0095_keep_669_reconcile_incident_indexes; renumber the internal-service HMAC migration to 0096 (when=1779843900003) to preserve monotonic ordering. events/route.ts auto-merge verified clean (non-overlapping auth vs chains.status). type-check passes on the merged tree.
PR Environment DeployedYour PR environment has been deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
|
Tooling follow-up: the kh CLI cannot exercise the hardened internal-service endpoints While validating this change in a PR environment, a tooling gap surfaced that is worth tracking separately: the kh CLI cannot be used to test or operate the endpoints this PR protects. kh authenticates against the user-facing API (user API key / better-auth session). The routes hardened here - /api/internal/, /api/workflow/[id]/execute, /api/workflows/events, /api/billing/overage, /api/billing/debt-scan, /api/hub/featured - accept only the legacy service key (X-Service-Key) or the new HMAC scheme (X-KH-). kh sends neither, so it returns 401 on all of them. Confirmed against the PR env: user-layer auth is rejected; only service-key / HMAC requests pass the verifier. If we want CLI-driven testing or ops of internal-service endpoints (sending a service key, or signing an X-KH-* request with the shared HMAC secret), kh needs a follow-up - e.g. an internal command group that attaches the service key or computes the X-KH-* signature. Not a blocker for this PR; raising as a separate tooling task. |
The internal-service HMAC store decrypts seeded secrets with AGENTIC_WALLET_HMAC_KMS_KEY (shared with agentic-wallet for v1). Staging and prod app values set it; the PR-environment app template did not, so a seeded internal_service_hmac_secrets row could never be decrypted and every HMAC request fell through to "No active signing secret". Mirror the staging entry so PR envs can validate the HMAC accept-path once INTERNAL_AUTH_REQUIRE_HMAC is flipped.
PR Environment DeployedYour PR environment has been deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
|
@OleksandrUA @joelorzet please review |
Staging advanced 40 commits and added 0096_keep_683_workflow_gas_columns, colliding with our hand-numbered migration. Renumber the internal-service HMAC migration to 0097 (when=1779843900004) and keep staging's 0096. type-check passes on the merged tree.
🧹 PR Environment Cleaned UpThe PR environment has been successfully deleted. Deleted Resources:
All resources have been cleaned up and will no longer incur costs. |
PR Environment DeployedYour PR environment has been deployed! Environment Details:
Components:
The environment will be automatically cleaned up when this PR is closed or merged. |
Summary
Replaces the bearer-token-only
X-Service-Keyscheme on every internal service endpoint with HMAC-SHA256 request authentication bound to method, path, caller identity, body digest, and a 300-second timestamp window. A capturedX-Service-Keyheader alone can no longer trigger any state-changing internal endpoint once the rollout completes.Closes the post-mortemed shared-key replay surface (rotation alone left the design unchanged; this changes the design).
Design
New
internal_service_hmac_secretstable, versioned, AES-256-GCM at rest, AAD-bound to(caller, key_version), 24h rotation grace viaexpires_at. Modelled directly onagentic_wallet_hmac_secretsand reuses its encrypt/decrypt helpers (sharesAGENTIC_WALLET_HMAC_KMS_KEYfor v1; a dedicated key is a documented follow-up).Dual-accept verifier: presence of any
X-KH-*header commits to the HMAC path with no fallback to the legacy bearer (prevents an attacker who knows the legacy bearer from appending a junk signature to slip through the looser path).INTERNAL_AUTH_REQUIRE_HMAC=trueis the per-environment kill-switch for the legacy path.Consolidates the
/api/workflows/eventsbespokeX-Internal-Tokencheck into the shared verifier. Drops the 500-on-misconfigured-env behaviour (leaks operational state to clients).12 internal-service consumer routes migrated to a raw-body pattern so the verifier hashes the same bytes that arrive over the wire. The workflow execute route was the highest-risk migration because it consumes the body twice; the raw body is captured once at the top.
logInternalAuthEventfires once per request with caller, scheme, route, ip, key_version, latency. Caller field sanitized to the InternalCaller enum or"unknown"to bound log cardinality. Logs-only for v1 (Prometheus counter and Sentry breadcrumb deferred).Verification
pnpm exec biome checkclean on the 21 touched filespnpm type-checkclean across the whole codebasepnpm vitest run tests/unit tests/integration: 6132/6132 pass (15 new tests on the verifier + 31 updated integration mocks)pnpm db:migrateagainst the dev DB applies cleanly; the table exists with the expected PK on(caller, key_version)Rollout (not in this PR)
This PR lands the server side only. To actually close the hole:
internal_service_hmac_secretsrow in staging viascripts/seed-internal-service-hmac.ts.keeperhub-executor,keeperhub-scheduler,keeperhub-events.INTERNAL_AUTH_REQUIRE_HMAC=truein staging once all three clients ship; watch the audit-log reject counter for a day.Deliberate scope choices to flag
drizzle-kit generatebreakage). The_journal.jsonentry useswhen=1779843900001to preserve monotonicity. No snapshot file - the repo already operates without snapshots beyond 0089.caller="*shared*"for v1 instead of per-producer rows. Per-producer split is a one-INSERT-plus-env-update follow-up; the audit log already distinguishes callers because the identity is signed into the HMAC.