Skip to content

release: promote staging to prod#1435

Merged
suisuss merged 35 commits into
prodfrom
staging
Jun 2, 2026
Merged

release: promote staging to prod#1435
suisuss merged 35 commits into
prodfrom
staging

Conversation

@suisuss
Copy link
Copy Markdown

@suisuss suisuss commented Jun 2, 2026

Promotes staging to prod (35 commits since the last sync).

KEEP-683 (wave 1 of 2) - the piece needing operator follow-up

Includes the gas-denormalization write side only: migration 0096 (adds nullable workflow_executions.gas_used_wei), the finalize writer, and the backfill script. The read-switch (#1434) is intentionally NOT in this promotion - it must not go live on prod until the prod backfill has run and the equivalence check passes, or the dashboard would read ~0 gas for historical rows.

Migration 0096 is the only schema change in this batch and is safe to run inline: nullable ADD COLUMN, no default, catalog-only in PG11+ (no rewrite, no @requires-db-prep).

Required operator steps AFTER this deploys to prod

  1. Run the backfill against prod (large - millions of rows; resumable via --after-id):
    # via the prod migrator image, DATABASE_URL from the prod db-url secret
    pnpm tsx scripts/backfill-workflow-gas.ts --dry-run
    pnpm tsx scripts/backfill-workflow-gas.ts --yes
    
  2. Run the equivalence gate on prod (per-org + grand-total SUM(gas_used_wei) == old JSON sum). Validated on staging: 0 mismatches.
  3. Only then promote perf: switch analytics gas reads to the denormalized column #1434 (the read-switch) in a later wave.

During the window between this deploy and the backfill completing, prod reads are unchanged (still JSON), so there is no user-visible regression - the column is write-only until #1434.

Also included (accumulated staging changes)

KEEP-493 (workflow editor UX), KEEP-612 (detection follow-ups), KEEP-679 (metrics collector serialization + timeouts), TECH-6480 (metrics scrape cache), and a batch of security hardening fixes. These are independent of KEEP-683 and were already merged + deployed on staging.

Verification done on staging

  • app-7b8ffcd rolled out, migration 0096 applied via Helm hook.
  • Backfill ran clean (43 gas-bearing rows, cursor exhausted).
  • Equivalence gate: 0 per-org mismatches, exact grand-total match.

chong-techops and others added 30 commits May 29, 2026 18:42
Hardens the KEEP-612 content scanner against attacker-controlled
triggerInput (webhook bodies reach scanValue via request.json()).

- scanAndReport is now total: wraps the scan+emit in try/catch so a
  throw can never escape into executeWorkflow. The call site at
  executor.workflow.ts:1598 is outside any try/catch, so an unhandled
  throw there aborted the whole run before any node executed --
  violating the documented alert-only/never-blocks contract.
- scanValue takes a depth arg and stops at MAX_SCAN_DEPTH (64); a
  deeply nested payload previously stack-overflowed (RangeError)
  through the executor. Also caps total hits at MAX_HITS (1000).
- database_url pattern is now case-insensitive; env templates use
  lowercase database_url, which the case-sensitive form missed.

Tests: deep-nesting no-throw (5000 levels), depth-cap bounds the walk,
lowercase database_url match.

Found during KEEP-612 staging verification.
…r inserts

The executor pre-creates the workflow_executions row for every
SQS-dispatched run (schedule / block / event) and downstream dispatch
reuses it, so the app-route attribution code never ran for these.
Staging verification found 255/255 post-deploy rows with NULL
attribution -- i.e. 100% of automated traffic was unattributed, the
exact paths the behavioral detection cares about.

Both executor inserts now:
- set trigger_source via buildAttribution({ source: triggerType })
  (schedule|block|event). ip/country/api-key correctly stay null --
  SQS dispatch has no inbound client request to attribute.
- are wrapped in withBackstopCapture so a 42501 reject from the 0082
  block_executions trigger (e.g. owner deactivated in the
  check->insert race) emits security.backstop_execution_blocked
  instead of vanishing. This closes the original-incident shape:
  a deactivated owner's scheduled workflow hitting the DB backstop
  with zero alerting.

Coverage: composes buildAttribution + withBackstopCapture, both unit-
tested. processExecutorMessage has no unit harness (heavy db/SQS deps);
re-verified against the staging DB after deploy (trigger_source
populates on new scheduled rows).

Found during KEEP-612 staging verification.
Two uncovered detection surfaces found during KEEP-612 verification.

MCP OAuth deactivated-login (lib/mcp/oauth-auth.ts): a deactivated user
with a still-valid 1h MCP OAuth JWT was rejected silently. This is the
4th deactivated-login surface (session, account, api-key were already
covered). Now emits security.deactivated_login_attempt with
surface=mcp_oauth (Sentry + structured stdout), best-effort. The
existing Loki rule keys on the event string, so no infra rule change.

KH001 sessions backstop (lib/auth.ts onAPIError): migration 0090's
sessions trigger raises SQLSTATE KH001 when a session insert is
attempted for a deactivated user. Better Auth owns the session insert,
so the central onAPIError handler is the only seam to observe it. The
session.create.before hook is the primary gate; KH001 only fires on an
app-path bypass (future OAuth provider, direct insert, hook
regression) -- a fired backstop = bug or attack, page-worthy. Emits
security.backstop_session_blocked (needs a new Loki rule in infra).

Tests: oauth-auth-deactivated asserts the mcp_oauth capture fires with
the right tags.

Found during KEEP-612 staging verification.
…ivergence

Resolves the schedule/scheduled split flagged in KEEP-612 verification.

The execute route's internal fallback emitted trigger_type 'scheduled'
while the executor emits 'schedule', fragmenting the
keeperhub_workflow_executions_started_total series -- papered over by
the Grafana rule's =~"schedule|scheduled" regex. Converge the route
fallback on 'schedule' (the executor's canonical value); the Grafana
regex already matches it, and internal callers always send an explicit
x-trigger-type header so this fallback is defensive anyway.

Also documents that trigger_source intentionally keeps more precision
than the metric's trigger_type (it records the auth path: 'internal',
'mcp'), so the trigger_source='internal' vs trigger_type='schedule'
divergence for the header-less internal case is by design, not drift.

No infra change: the Grafana regex stays tolerant of historical
'scheduled' series.

Found during KEEP-612 staging verification.
… fallback)

Self-review of the follow-up fixes flagged that the KH001 detection in
onAPIError assumed the postgres SQLSTATE sits on the top-level error.
Better Auth wraps adapter errors, so the SQLSTATE typically lands on a
nested cause -- the original top-level-only check would silently never
fire.

Extract the detection into lib/security/session-backstop.ts
(dependency-free, unit-testable without building the Better Auth
config) and harden it:
- walk the error.cause chain, bounded to 5 levels (terminates on a
  self-referential cycle)
- fall back to the trigger's fixed RAISE message fragment in case a
  wrapper normalised the SQLSTATE away entirely

9 unit tests: top-level code, nested cause (1 and 2 deep), message
fallback at top and nested, unrelated code/message negatives,
null/primitive safety, cycle termination.

Still needs one staging confirmation that a real KH001 reject reaches
onAPIError in a shape this matches (documented in the review file).

Found during KEEP-612 follow-up self-review.
Codex flagged no critical/high in the diff; this resolves the medium +
low findings it did raise.

content-scanner (low): MAX_HITS is now a strict cap enforced at the
push site, and on reaching it a single observable 'scan_truncated'
marker is appended (the comment had claimed a marker that never
existed). Saturation test added; the duplicate lowercase-database_url
test removed.

execute route (medium): the trigger-label resolution moves into a pure,
exported resolveTriggerLabels(headerTrigger, isInternal) in
request-attribution, unit-tested for every branch -- so a value typo
can't silently re-fragment the started_total metric again. The route
just destructures it.

KH001 session backstop (medium): the onAPIError emit wiring is extracted
into reportSessionBackstop() in lib/security/session-backstop.ts and
unit-tested directly (fires on wrapped KH001, no-ops on unrelated
errors, survives a Sentry throw) -- the integration seam Codex noted
was previously only predicate-tested.

mcp oauth (low): tests now assert the structured console.warn signal and
resilience when captureMessage throws, not just the Sentry call.

98 security tests pass; lint + typecheck clean.

Found during KEEP-612 follow-up Codex review.
The Loki security rules match pod=~".*keeperhub.*". The k8s-job
workflow runner pod was named "workflow-<id>-<ts>" -- no "keeperhub"
substring -- so it fell outside the matcher. executeWorkflow (and its
security.content_scanner_hit emit) runs inside that pod for web3-write,
k8s-job-dispatched workflows, so scanner hits on those executions were
emitted into a log stream no alert rule watched.

Prefix the job name with "keeperhub-" so the runner pod is captured by
the existing rules (and is clearly attributable in logs). The main app
and the executor pods already contain "keeperhub"; this closes the
last execution mode. namespace is already "keeperhub" (matches
namespace=~"keeperhub.*").

Found during KEEP-612 two-PR alignment review.
…d' header

Two medium findings from the Codex alignment review.

content-scanner: scanNodes/scanTriggerInput now share one hits
accumulator, so config + trigger-input scans draw from a single
MAX_HITS budget. Previously each capped independently and scanAndReport
concatenated, so a saturating execution could emit ~2x the intended
payload. Regression test saturates both sources together.

request-attribution: resolveTriggerLabels now normalizes an explicit
'scheduled' header to canonical 'schedule'. TriggerType still accepts
'scheduled' for back-compat, so isTriggerType passed it straight
through -- re-fragmenting the metric/audit series the header-less path
already converged. Regression test covers the legacy header.

112 security tests pass; lint + typecheck clean.

Found during KEEP-612 two-PR alignment Codex review.
…dedicated pool

The /api/metrics/db scrape issues ~38 aggregations (13 functions via
Promise.all, several fanning out further) over workflow_executions and
workflow_execution_logs. Up to ~10 stacking concurrently - the app pool's
max - is what saturated DB CPU in the 2026-05-29 incident.

Route every db-metrics query through a dedicated postgres pool (metricsDb,
max:2) so at most two run on the DB at once. Measured end-to-end, the refresh
cost is dominated by 5 index-only heavy queries (~1.2s at staging, ~5-7s
extrapolated at prod scale); capping at 2 lets them pair up (~3-4s prod)
rather than fully serialise, staying under the 12s scrape timeout while
bounding concurrency far below the incident regime. Composes with the
existing updateDbMetrics cache (in-flight dedup + TTL): the cache gates
whether a refresh runs, the pool bounds the queries inside it.
A cache-miss /api/metrics/db scrape blocks on the metrics refresh, which is
~3-4s at prod scale with the max:2 metrics pool. The ServiceMonitor
scrapeTimeout was 10s; bump db-metrics to 12s in prod and staging (still well
under the 30s interval) for headroom. KEEP-682 adds a refresh-duration metric
to confirm and to alert before it ever threatens this timeout.
Two reviews/*.md Codex scratch files were committed by mistake -- they
carry absolute local paths (and a local username) and are not something
we want in history. Untrack both and add reviews/ to .gitignore so the
review-loop scratch output can't be re-committed.

Addresses PR review (must-fix).
The outer try/catch keeps the scanner from ever throwing into the
executor, but a bare swallow meant a scanner bug would drop detection to
zero with no signal -- the exact 'controls run but nobody is watching'
gap this layer exists to close. Emit one structured
security.content_scanner_error line (self-guarded so the failure signal
can't escape either) so a probe that quietly stopped working is itself
detectable. Test forces an internal throw and asserts the line.

Addresses PR review.
session-backstop.ts walks the cause chain for KH001_SQLSTATE and falls
back to matching KH001_MESSAGE_FRAGMENT in the error text; both are owned
by migration 0090's RAISE. The existing assertions pinned the literal
SQL; this one pins the runtime constants against the migration, so
editing the trigger text without updating the constants (or vice versa)
fails the test instead of silently rotting the fallback path.

Addresses PR review.
Drop the non-author requirement from the metrics-db-review-gate: the
`metrics-db-reviewed` label now only needs to be present, and the PR author
may apply it. This reverts the second-person enforcement added by KEEP-680 at
the repo owner's request, making the label an attestation rather than an
enforced review. Removes the timeline-based labeler resolution and the
author-vs-labeler check; comments updated to state the weaker policy.
Found by live-firing the backstop in the pr-1403 deployment: the
block_executions trigger rejected an executor insert with ERRCODE 42501,
but security.backstop_execution_blocked never emitted. Drizzle wraps the
driver PostgresError in a DrizzleQueryError, so the 42501 SQLSTATE sits
on error.cause.code, not the top-level error.code. The bare top-level
check missed every real reject -- bullet 4's headline alert would never
have fired in production.

Walk the cause chain (bounded, depth 5) exactly like isKh001SessionBackstop
already does for the sessions backstop. The existing unit test only passed
a top-level .code error, so it never exercised the wrapped shape; add
regression tests for a Drizzle-wrapped reject, a reject nested two levels
deep, and a wrapped non-42501 (must not capture).

11 backstop-capture tests pass.
…cs-collector-queries

refactor: cap metrics collector queries at 2 concurrent via dedicated pool
Set statement_timeout once on every metricsDb connection via the pool's
connection options, so each /api/metrics/db aggregation is bounded without
per-query wrapping. A runaway statement (e.g. a query-plan regression as
workflow_executions grows) is cancelled with 57014 and caught per section,
degrading that section's gauges to defaults rather than hanging the scrape
past its timeout. Configurable via METRICS_STATEMENT_TIMEOUT_MS (default
8000), kept below the db-metrics scrapeTimeout.

This is the slimmed remainder of TECH-6480. The cache (KEEP-669), the
dedicated metrics pool (KEEP-679), and the index-only-scan migration
(0095) all landed on staging independently and address the 2026-05-29
incident; statement_timeout is the complementary always-on backstop they
do not provide, now expressed as a single pool-connection setting on top
of the KEEP-679 pool.
…ape-cache

perf(metrics): add a statement_timeout backstop on the metrics DB pool
Add gas_used_wei and network columns to workflow_executions, populated at
terminal finalize alongside transaction_hashes, plus a batched backfill for
historical rows. This lets the /analytics summary and spend-cap reads
aggregate a first-class column instead of full-scanning and detoasting the
per-step logs JSONB - the path behind the 2026-05-29 RDS saturation.

Gas is resolved on error finalizes too, since the current gas total counts
gas-bearing steps from runs that later errored. Extraction is centralized in
a shared helper so the writer, backfill, and analytics reads cannot drift.

No read paths change here; reads switch to the column in a follow-up PR.
…-ux-polish-reapply

feat: KEEP-493 re-apply workflow editor UX polish (reapply PR #1370)
suisuss and others added 5 commits June 2, 2026 14:03
Network is a per-step property: each web3 node targets its own chain, so a run
can span multiple networks and no single run-level value is correct (the writer
was picking an arbitrary MIN). The column was also read by nothing - the
per-network breakdown needs step-level data. Keep only gas_used_wei, which is a
legitimate network-agnostic run total.

Also corrects the schema/migration/writer comments that said gas is populated
at the success finalize - it is resolved on error finalizes too - and documents
why gas is sourced from the DB extraction rather than the in-memory step tracker
(one shared extraction path with the backfill and reads, avoiding drift).
…e runs

Add a we.gas_used_wei IS NULL guard to the backfill UPDATE so it only fills
empty rows, never overwrites the live finalize writer's value. This keeps it off
the hot recent rows the writer owns (no clobber race), makes re-runs cheap, and
avoids redundant write churn on the multi-GB table. The dry-run count uses the
same predicate so it matches what a live run would write.

Also require --yes for a LIVE run against a non-local DB so a staging/prod
backfill cannot be started by reflex; dry-runs and local DBs are unguarded.
The writer's resolveGasTotal issues db.select(), which the workflow-logging mock
did not provide, so the new code path threw (caught and logged) and tripped an
existing 'logSystemError not called' assertion. Mock db.select and add coverage:
gas is set on success, on error finalizes too, and null when the run produced no
gas-bearing step.
…llowups

fix(security): close KEEP-612 detection gaps found in staging verification
…workflow-gas

feat: denormalize workflow gas onto workflow_executions
@suisuss suisuss added the metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680) label Jun 2, 2026
@suisuss
Copy link
Copy Markdown
Author

suisuss commented Jun 2, 2026

Post-deploy operator runbook (prod gas backfill + verification)

Run these AFTER this promotion deploys to prod. This is the write side only; the read-switch (#1434) stays out until the gate below passes. Until then prod reads are unchanged (still JSON), so the column being partially populated is not user-visible.

Setup

cd <TechOps workspace>
export KUBECONFIG=<path to the prod kubeconfig>
REG=020732533267.dkr.ecr.us-east-2.amazonaws.com/keeperhub-prod   # note: us-east-2, not us-east-1
# Derive the deployed SHA from the running prod image (no guessing):
SHA=$(kubectl get deploy keeperhub-common -n keeperhub \
  -o jsonpath='{.spec.template.spec.containers[0].image}' | sed 's/.*:app-//')
IMG=$REG:migrator-$SHA

DB secret: keeperhub-common-db-url (key keeperhub-common-db-url) -> DATABASE_URL. Namespace: keeperhub.

1. Dry-run (read-only; confirms column exists + total scope)

kubectl run kh-backfill-dryrun -n keeperhub --image="$IMG" --restart=Never \
  --overrides='{"spec":{"restartPolicy":"Never","containers":[{"name":"backfill","image":"'"$IMG"'","command":["sh","-c","cd /app && pnpm tsx scripts/backfill-workflow-gas.ts --dry-run"],"env":[{"name":"DATABASE_URL","valueFrom":{"secretKeyRef":{"name":"keeperhub-common-db-url","key":"keeperhub-common-db-url"}}}]}]}}'
kubectl logs kh-backfill-dryrun -n keeperhub | tail
kubectl delete pod kh-backfill-dryrun -n keeperhub

2. Controlled probe, then full backfill (writes; --yes required on non-local DB)

# 10-batch probe first to confirm writes + measure per-batch time
... scripts/backfill-workflow-gas.ts --yes --max-batches 10
# full run
... scripts/backfill-workflow-gas.ts --yes
# resume after interruption (cursor is logged each batch):
... scripts/backfill-workflow-gas.ts --yes --after-id <last_cursor>

Same pod pattern as step 1 with the command swapped. Resume is idempotent: the gas_used_wei IS NULL guard means a re-run never re-touches populated rows and never fights the live finalize writer.

3. Equivalence gate (read-only) - run only after the backfill completes

kubectl run kh-psql-verify -n keeperhub --image=postgres:16-alpine --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"psql","image":"postgres:16-alpine","command":["sleep","900"],"env":[{"name":"DATABASE_URL","valueFrom":{"secretKeyRef":{"name":"keeperhub-common-db-url","key":"keeperhub-common-db-url"}}}]}]}}'
kubectl wait --for=condition=Ready pod/kh-psql-verify -n keeperhub --timeout=60s
WITH col AS (
  SELECT w.organization_id AS org, COALESCE(SUM(we.gas_used_wei),0) AS col_total
  FROM workflow_executions we JOIN workflows w ON we.workflow_id = w.id
  GROUP BY w.organization_id
), js AS (
  SELECT w.organization_id AS org,
    COALESCE(SUM(CAST(CASE WHEN jsonb_typeof(wel.output)='string'
      THEN (wel.output #>> '{}')::jsonb->>'gasUsed' ELSE wel.output->>'gasUsed' END AS NUMERIC)),0) AS json_total
  FROM workflow_execution_logs wel
  JOIN workflow_executions we ON wel.execution_id = we.id
  JOIN workflows w ON we.workflow_id = w.id
  WHERE (CASE WHEN jsonb_typeof(wel.output)='string'
      THEN (wel.output #>> '{}')::jsonb->>'gasUsed' ELSE wel.output->>'gasUsed' END) IS NOT NULL
  GROUP BY w.organization_id
)
SELECT COALESCE(col.org,js.org) AS org, COALESCE(col.col_total,0) col_total,
       COALESCE(js.json_total,0) json_total,
       COALESCE(col.col_total,0)-COALESCE(js.json_total,0) AS diff
FROM col FULL OUTER JOIN js ON col.org=js.org
WHERE COALESCE(col.col_total,0) <> COALESCE(js.json_total,0)
ORDER BY abs(COALESCE(col.col_total,0)-COALESCE(js.json_total,0)) DESC;

Expect 0 rows. Validated on staging: 0 per-org mismatches, exact grand-total match.


Two notes that matter

  1. The verification query reproduces the heavy scan this ticket exists to kill. The js (JSON) side is the same unindexed per-row JSONB aggregation that touched ~33 GB on Sky and drove the 2026-05-29 incident. Running the gate on prod briefly reproduces that load, so run it off-peak. It is a one-time verification and the last time we ever pay that cost - once perf: switch analytics gas reads to the denormalized column #1434 ships, nothing reads gas that way again.

  2. Run the gate only after the backfill has fully completed (cursor exhausted). If run mid-backfill it will report false mismatches for rows not yet filled (column < JSON), which looks like a correctness failure but is just incomplete progress. Confirm the full run logged done ... last_cursor=<end> with the final batch smaller than the batch size before trusting the gate. Do not promote perf: switch analytics gas reads to the denormalized column #1434 until this passes.

@suisuss suisuss merged commit 3021283 into prod Jun 2, 2026
37 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants