Skip to content

hotfix(security): harden internal service auth with HMAC#1405

Merged
suisuss merged 4 commits into
stagingfrom
hotfix/KEEP-665-harden-internal-service-auth
Jun 3, 2026
Merged

hotfix(security): harden internal service auth with HMAC#1405
suisuss merged 4 commits into
stagingfrom
hotfix/KEEP-665-harden-internal-service-auth

Conversation

@suisuss
Copy link
Copy Markdown

@suisuss suisuss commented May 29, 2026

Summary

Replaces the bearer-token-only X-Service-Key scheme on every internal service endpoint with HMAC-SHA256 request authentication bound to method, path, caller identity, body digest, and a 300-second timestamp window. A captured X-Service-Key header alone can no longer trigger any state-changing internal endpoint once the rollout completes.

Closes the post-mortemed shared-key replay surface (rotation alone left the design unchanged; this changes the design).

Design

  • New internal_service_hmac_secrets table, versioned, AES-256-GCM at rest, AAD-bound to (caller, key_version), 24h rotation grace via expires_at. Modelled directly on agentic_wallet_hmac_secrets and reuses its encrypt/decrypt helpers (shares AGENTIC_WALLET_HMAC_KMS_KEY for v1; a dedicated key is a documented follow-up).

  • Dual-accept verifier: presence of any X-KH-* header commits to the HMAC path with no fallback to the legacy bearer (prevents an attacker who knows the legacy bearer from appending a junk signature to slip through the looser path). INTERNAL_AUTH_REQUIRE_HMAC=true is the per-environment kill-switch for the legacy path.

  • Consolidates the /api/workflows/events bespoke X-Internal-Token check into the shared verifier. Drops the 500-on-misconfigured-env behaviour (leaks operational state to clients).

  • 12 internal-service consumer routes migrated to a raw-body pattern so the verifier hashes the same bytes that arrive over the wire. The workflow execute route was the highest-risk migration because it consumes the body twice; the raw body is captured once at the top.

  • logInternalAuthEvent fires once per request with caller, scheme, route, ip, key_version, latency. Caller field sanitized to the InternalCaller enum or "unknown" to bound log cardinality. Logs-only for v1 (Prometheus counter and Sentry breadcrumb deferred).

Verification

  • pnpm exec biome check clean on the 21 touched files
  • pnpm type-check clean across the whole codebase
  • pnpm vitest run tests/unit tests/integration: 6132/6132 pass (15 new tests on the verifier + 31 updated integration mocks)
  • pnpm db:migrate against the dev DB applies cleanly; the table exists with the expected PK on (caller, key_version)

Rollout (not in this PR)

This PR lands the server side only. To actually close the hole:

  1. Seed internal_service_hmac_secrets row in staging via scripts/seed-internal-service-hmac.ts.
  2. Migrate the three client repos to send HMAC headers: keeperhub-executor, keeperhub-scheduler, keeperhub-events.
  3. Flip INTERNAL_AUTH_REQUIRE_HMAC=true in staging once all three clients ship; watch the audit-log reject counter for a day.
  4. Repeat in prod.
  5. Follow-up cleanup PR removes the legacy bearer code at least 7 days after the prod flip.

Deliberate scope choices to flag

  • Drizzle migration 0094 is hand-written (per the repo's known drizzle-kit generate breakage). The _journal.json entry uses when=1779843900001 to preserve monotonicity. No snapshot file - the repo already operates without snapshots beyond 0089.
  • Shared signing secret under caller="*shared*" for v1 instead of per-producer rows. Per-producer split is a one-INSERT-plus-env-update follow-up; the audit log already distinguishes callers because the identity is signed into the HMAC.
  • No nonce cache; matches the agentic-wallet precedent.
  • No DB-backed audit table; logs-only via Loki for v1.

suisuss added 2 commits May 29, 2026 13:36
Replaces the bearer-token-only X-Service-Key scheme with HMAC-SHA256
request authentication bound to method, path, caller, body digest, and
unix timestamp. 300-second replay window matches the agentic-wallet
precedent; no server-side nonce cache for the same latency-cost reason.

Signing secrets move into a new versioned internal_service_hmac_secrets
table modelled on agentic_wallet_hmac_secrets: AES-256-GCM at rest,
AAD-bound to (caller, key_version), 24h rotation grace via expires_at.
v1 reuses AGENTIC_WALLET_HMAC_KMS_KEY (a dedicated key is a documented
follow-up) and stores one shared secret under caller='*shared*' (per-
producer split is a one-row follow-up).

Verifier accepts both schemes during the dual-accept rollout window:

  - presence of any X-KH-* header commits to the HMAC path with NO
    fallback (prevents an attacker who knows the legacy bearer from
    appending a junk signature to slip through the looser path),
  - INTERNAL_AUTH_REQUIRE_HMAC=true rejects all legacy bearers and is
    the per-environment kill-switch,
  - X-Service-Key or X-Internal-Token falls back to the existing
    timingSafeEqual compare against the four *_SERVICE_API_KEY env vars
    plus KEEPERHUB_API_KEY.

Consolidates the /api/workflows/events bespoke X-Internal-Token check
into authenticateInternalService. The 500-on-misconfigured-env path is
intentionally dropped because it leaks operational state to clients.

All 12 internal-service consumer routes migrated to a raw-body pattern
so the HMAC verifier hashes the same bytes that arrive over the wire.
The workflow execute route was the highest-risk migration because it
consumes the body twice; raw body is captured once at the top and
parsed from the captured string thereafter.

Audit log via logInternalAuthEvent fires once per request with caller,
scheme, route, ip, key_version, latency. Caller field sanitized to the
InternalCaller enum or 'unknown' to bound log cardinality. Logs-only
for v1; Prometheus counter and Sentry breadcrumb deferred.

Drizzle migration 0094 hand-written since drizzle-kit generate is
broken in this repo; journal idx 94 appended with when=1779843900001
(monotonic).

Verified locally:
  - pnpm exec biome check: clean on touched files
  - pnpm type-check: clean across the whole codebase
  - pnpm vitest run tests/unit tests/integration: 6132/6132 pass
  - pnpm db:migrate against the dev DB: applies cleanly, table exists
    with PK on (caller, key_version)
Resolve drizzle/meta/_journal.json conflict: staging owns 0094_chain_status
and 0095_keep_669_reconcile_incident_indexes; renumber the internal-service
HMAC migration to 0096 (when=1779843900003) to preserve monotonic ordering.
events/route.ts auto-merge verified clean (non-overlapping auth vs chains.status).
type-check passes on the merged tree.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@suisuss
Copy link
Copy Markdown
Author

suisuss commented Jun 2, 2026

Tooling follow-up: the kh CLI cannot exercise the hardened internal-service endpoints

While validating this change in a PR environment, a tooling gap surfaced that is worth tracking separately: the kh CLI cannot be used to test or operate the endpoints this PR protects.

kh authenticates against the user-facing API (user API key / better-auth session). The routes hardened here - /api/internal/, /api/workflow/[id]/execute, /api/workflows/events, /api/billing/overage, /api/billing/debt-scan, /api/hub/featured - accept only the legacy service key (X-Service-Key) or the new HMAC scheme (X-KH-). kh sends neither, so it returns 401 on all of them. Confirmed against the PR env: user-layer auth is rejected; only service-key / HMAC requests pass the verifier.

If we want CLI-driven testing or ops of internal-service endpoints (sending a service key, or signing an X-KH-* request with the shared HMAC secret), kh needs a follow-up - e.g. an internal command group that attaches the service key or computes the X-KH-* signature. Not a blocker for this PR; raising as a separate tooling task.

The internal-service HMAC store decrypts seeded secrets with
AGENTIC_WALLET_HMAC_KMS_KEY (shared with agentic-wallet for v1). Staging and
prod app values set it; the PR-environment app template did not, so a seeded
internal_service_hmac_secrets row could never be decrypted and every HMAC
request fell through to "No active signing secret". Mirror the staging entry
so PR envs can validate the HMAC accept-path once INTERNAL_AUTH_REQUIRE_HMAC
is flipped.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 2, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@eskp
Copy link
Copy Markdown

eskp commented Jun 2, 2026

@OleksandrUA @joelorzet please review

Staging advanced 40 commits and added 0096_keep_683_workflow_gas_columns,
colliding with our hand-numbered migration. Renumber the internal-service HMAC
migration to 0097 (when=1779843900004) and keep staging's 0096. type-check
passes on the merged tree.
@suisuss suisuss merged commit 5fb5b3e into staging Jun 3, 2026
39 of 40 checks passed
@suisuss suisuss deleted the hotfix/KEEP-665-harden-internal-service-auth branch June 3, 2026 03:14
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1405
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants