release: promote staging to prod by suisuss · Pull Request #1435 · KeeperHub/keeperhub

suisuss · 2026-06-02T04:58:23Z

Promotes staging to prod (35 commits since the last sync).

KEEP-683 (wave 1 of 2) - the piece needing operator follow-up

Includes the gas-denormalization write side only: migration 0096 (adds nullable workflow_executions.gas_used_wei), the finalize writer, and the backfill script. The read-switch (#1434) is intentionally NOT in this promotion - it must not go live on prod until the prod backfill has run and the equivalence check passes, or the dashboard would read ~0 gas for historical rows.

Migration 0096 is the only schema change in this batch and is safe to run inline: nullable ADD COLUMN, no default, catalog-only in PG11+ (no rewrite, no @requires-db-prep).

Required operator steps AFTER this deploys to prod

Run the backfill against prod (large - millions of rows; resumable via --after-id):

# via the prod migrator image, DATABASE_URL from the prod db-url secret
pnpm tsx scripts/backfill-workflow-gas.ts --dry-run
pnpm tsx scripts/backfill-workflow-gas.ts --yes

Run the equivalence gate on prod (per-org + grand-total SUM(gas_used_wei) == old JSON sum). Validated on staging: 0 mismatches.
Only then promote perf: switch analytics gas reads to the denormalized column #1434 (the read-switch) in a later wave.

During the window between this deploy and the backfill completing, prod reads are unchanged (still JSON), so there is no user-visible regression - the column is write-only until #1434.

Also included (accumulated staging changes)

KEEP-493 (workflow editor UX), KEEP-612 (detection follow-ups), KEEP-679 (metrics collector serialization + timeouts), TECH-6480 (metrics scrape cache), and a batch of security hardening fixes. These are independent of KEEP-683 and were already merged + deployed on staging.

Verification done on staging

app-7b8ffcd rolled out, migration 0096 applied via Helm hook.
Backfill ran clean (43 gas-bearing rows, cursor exhausted).
Equivalence gate: 0 per-org mismatches, exact grand-total match.

Hardens the KEEP-612 content scanner against attacker-controlled triggerInput (webhook bodies reach scanValue via request.json()). - scanAndReport is now total: wraps the scan+emit in try/catch so a throw can never escape into executeWorkflow. The call site at executor.workflow.ts:1598 is outside any try/catch, so an unhandled throw there aborted the whole run before any node executed -- violating the documented alert-only/never-blocks contract. - scanValue takes a depth arg and stops at MAX_SCAN_DEPTH (64); a deeply nested payload previously stack-overflowed (RangeError) through the executor. Also caps total hits at MAX_HITS (1000). - database_url pattern is now case-insensitive; env templates use lowercase database_url, which the case-sensitive form missed. Tests: deep-nesting no-throw (5000 levels), depth-cap bounds the walk, lowercase database_url match. Found during KEEP-612 staging verification.

…r inserts The executor pre-creates the workflow_executions row for every SQS-dispatched run (schedule / block / event) and downstream dispatch reuses it, so the app-route attribution code never ran for these. Staging verification found 255/255 post-deploy rows with NULL attribution -- i.e. 100% of automated traffic was unattributed, the exact paths the behavioral detection cares about. Both executor inserts now: - set trigger_source via buildAttribution({ source: triggerType }) (schedule|block|event). ip/country/api-key correctly stay null -- SQS dispatch has no inbound client request to attribute. - are wrapped in withBackstopCapture so a 42501 reject from the 0082 block_executions trigger (e.g. owner deactivated in the check->insert race) emits security.backstop_execution_blocked instead of vanishing. This closes the original-incident shape: a deactivated owner's scheduled workflow hitting the DB backstop with zero alerting. Coverage: composes buildAttribution + withBackstopCapture, both unit- tested. processExecutorMessage has no unit harness (heavy db/SQS deps); re-verified against the staging DB after deploy (trigger_source populates on new scheduled rows). Found during KEEP-612 staging verification.

Two uncovered detection surfaces found during KEEP-612 verification. MCP OAuth deactivated-login (lib/mcp/oauth-auth.ts): a deactivated user with a still-valid 1h MCP OAuth JWT was rejected silently. This is the 4th deactivated-login surface (session, account, api-key were already covered). Now emits security.deactivated_login_attempt with surface=mcp_oauth (Sentry + structured stdout), best-effort. The existing Loki rule keys on the event string, so no infra rule change. KH001 sessions backstop (lib/auth.ts onAPIError): migration 0090's sessions trigger raises SQLSTATE KH001 when a session insert is attempted for a deactivated user. Better Auth owns the session insert, so the central onAPIError handler is the only seam to observe it. The session.create.before hook is the primary gate; KH001 only fires on an app-path bypass (future OAuth provider, direct insert, hook regression) -- a fired backstop = bug or attack, page-worthy. Emits security.backstop_session_blocked (needs a new Loki rule in infra). Tests: oauth-auth-deactivated asserts the mcp_oauth capture fires with the right tags. Found during KEEP-612 staging verification.

…ivergence Resolves the schedule/scheduled split flagged in KEEP-612 verification. The execute route's internal fallback emitted trigger_type 'scheduled' while the executor emits 'schedule', fragmenting the keeperhub_workflow_executions_started_total series -- papered over by the Grafana rule's =~"schedule|scheduled" regex. Converge the route fallback on 'schedule' (the executor's canonical value); the Grafana regex already matches it, and internal callers always send an explicit x-trigger-type header so this fallback is defensive anyway. Also documents that trigger_source intentionally keeps more precision than the metric's trigger_type (it records the auth path: 'internal', 'mcp'), so the trigger_source='internal' vs trigger_type='schedule' divergence for the header-less internal case is by design, not drift. No infra change: the Grafana regex stays tolerant of historical 'scheduled' series. Found during KEEP-612 staging verification.

… fallback) Self-review of the follow-up fixes flagged that the KH001 detection in onAPIError assumed the postgres SQLSTATE sits on the top-level error. Better Auth wraps adapter errors, so the SQLSTATE typically lands on a nested cause -- the original top-level-only check would silently never fire. Extract the detection into lib/security/session-backstop.ts (dependency-free, unit-testable without building the Better Auth config) and harden it: - walk the error.cause chain, bounded to 5 levels (terminates on a self-referential cycle) - fall back to the trigger's fixed RAISE message fragment in case a wrapper normalised the SQLSTATE away entirely 9 unit tests: top-level code, nested cause (1 and 2 deep), message fallback at top and nested, unrelated code/message negatives, null/primitive safety, cycle termination. Still needs one staging confirmation that a real KH001 reject reaches onAPIError in a shape this matches (documented in the review file). Found during KEEP-612 follow-up self-review.

Codex flagged no critical/high in the diff; this resolves the medium + low findings it did raise. content-scanner (low): MAX_HITS is now a strict cap enforced at the push site, and on reaching it a single observable 'scan_truncated' marker is appended (the comment had claimed a marker that never existed). Saturation test added; the duplicate lowercase-database_url test removed. execute route (medium): the trigger-label resolution moves into a pure, exported resolveTriggerLabels(headerTrigger, isInternal) in request-attribution, unit-tested for every branch -- so a value typo can't silently re-fragment the started_total metric again. The route just destructures it. KH001 session backstop (medium): the onAPIError emit wiring is extracted into reportSessionBackstop() in lib/security/session-backstop.ts and unit-tested directly (fires on wrapped KH001, no-ops on unrelated errors, survives a Sentry throw) -- the integration seam Codex noted was previously only predicate-tested. mcp oauth (low): tests now assert the structured console.warn signal and resilience when captureMessage throws, not just the Sentry call. 98 security tests pass; lint + typecheck clean. Found during KEEP-612 follow-up Codex review.

The Loki security rules match pod=~".*keeperhub.*". The k8s-job workflow runner pod was named "workflow-<id>-<ts>" -- no "keeperhub" substring -- so it fell outside the matcher. executeWorkflow (and its security.content_scanner_hit emit) runs inside that pod for web3-write, k8s-job-dispatched workflows, so scanner hits on those executions were emitted into a log stream no alert rule watched. Prefix the job name with "keeperhub-" so the runner pod is captured by the existing rules (and is clearly attributable in logs). The main app and the executor pods already contain "keeperhub"; this closes the last execution mode. namespace is already "keeperhub" (matches namespace=~"keeperhub.*"). Found during KEEP-612 two-PR alignment review.

…d' header Two medium findings from the Codex alignment review. content-scanner: scanNodes/scanTriggerInput now share one hits accumulator, so config + trigger-input scans draw from a single MAX_HITS budget. Previously each capped independently and scanAndReport concatenated, so a saturating execution could emit ~2x the intended payload. Regression test saturates both sources together. request-attribution: resolveTriggerLabels now normalizes an explicit 'scheduled' header to canonical 'schedule'. TriggerType still accepts 'scheduled' for back-compat, so isTriggerType passed it straight through -- re-fragmenting the metric/audit series the header-less path already converged. Regression test covers the legacy header. 112 security tests pass; lint + typecheck clean. Found during KEEP-612 two-PR alignment Codex review.

…isting

…o-flip

…dedicated pool The /api/metrics/db scrape issues ~38 aggregations (13 functions via Promise.all, several fanning out further) over workflow_executions and workflow_execution_logs. Up to ~10 stacking concurrently - the app pool's max - is what saturated DB CPU in the 2026-05-29 incident. Route every db-metrics query through a dedicated postgres pool (metricsDb, max:2) so at most two run on the DB at once. Measured end-to-end, the refresh cost is dominated by 5 index-only heavy queries (~1.2s at staging, ~5-7s extrapolated at prod scale); capping at 2 lets them pair up (~3-4s prod) rather than fully serialise, staying under the 12s scrape timeout while bounding concurrency far below the incident regime. Composes with the existing updateDbMetrics cache (in-flight dedup + TTL): the cache gates whether a refresh runs, the pool bounds the queries inside it.

A cache-miss /api/metrics/db scrape blocks on the metrics refresh, which is ~3-4s at prod scale with the max:2 metrics pool. The ServiceMonitor scrapeTimeout was 10s; bump db-metrics to 12s in prod and staging (still well under the 30s interval) for headroom. KEEP-682 adds a refresh-duration metric to confirm and to alert before it ever threatens this timeout.

Two reviews/*.md Codex scratch files were committed by mistake -- they carry absolute local paths (and a local username) and are not something we want in history. Untrack both and add reviews/ to .gitignore so the review-loop scratch output can't be re-committed. Addresses PR review (must-fix).

The outer try/catch keeps the scanner from ever throwing into the executor, but a bare swallow meant a scanner bug would drop detection to zero with no signal -- the exact 'controls run but nobody is watching' gap this layer exists to close. Emit one structured security.content_scanner_error line (self-guarded so the failure signal can't escape either) so a probe that quietly stopped working is itself detectable. Test forces an internal throw and asserts the line. Addresses PR review.

session-backstop.ts walks the cause chain for KH001_SQLSTATE and falls back to matching KH001_MESSAGE_FRAGMENT in the error text; both are owned by migration 0090's RAISE. The existing assertions pinned the literal SQL; this one pins the runtime constants against the migration, so editing the trigger text without updating the constants (or vice versa) fails the test instead of silently rotting the fallback path. Addresses PR review.

Drop the non-author requirement from the metrics-db-review-gate: the `metrics-db-reviewed` label now only needs to be present, and the PR author may apply it. This reverts the second-person enforcement added by KEEP-680 at the repo owner's request, making the label an attestation rather than an enforced review. Removes the timeline-based labeler resolution and the author-vs-labeler check; comments updated to state the weaker policy.

Found by live-firing the backstop in the pr-1403 deployment: the block_executions trigger rejected an executor insert with ERRCODE 42501, but security.backstop_execution_blocked never emitted. Drizzle wraps the driver PostgresError in a DrizzleQueryError, so the 42501 SQLSTATE sits on error.cause.code, not the top-level error.code. The bare top-level check missed every real reject -- bullet 4's headline alert would never have fired in production. Walk the cause chain (bounded, depth 5) exactly like isKh001SessionBackstop already does for the sessions backstop. The existing unit test only passed a top-level .code error, so it never exercised the wrapped shape; add regression tests for a Drizzle-wrapped reject, a reject nested two levels deep, and a wrapped non-42501 (must not capture). 11 backstop-capture tests pass.

…cs-collector-queries refactor: cap metrics collector queries at 2 concurrent via dedicated pool

Set statement_timeout once on every metricsDb connection via the pool's connection options, so each /api/metrics/db aggregation is bounded without per-query wrapping. A runaway statement (e.g. a query-plan regression as workflow_executions grows) is cancelled with 57014 and caught per section, degrading that section's gauges to defaults rather than hanging the scrape past its timeout. Configurable via METRICS_STATEMENT_TIMEOUT_MS (default 8000), kept below the db-metrics scrapeTimeout. This is the slimmed remainder of TECH-6480. The cache (KEEP-669), the dedicated metrics pool (KEEP-679), and the index-only-scan migration (0095) all landed on staging independently and address the 2026-05-29 incident; statement_timeout is the complementary always-on backstop they do not provide, now expressed as a single pool-connection setting on top of the KEEP-679 pool.

…er literally

…ape-cache perf(metrics): add a statement_timeout backstop on the metrics DB pool

Add gas_used_wei and network columns to workflow_executions, populated at terminal finalize alongside transaction_hashes, plus a batched backfill for historical rows. This lets the /analytics summary and spend-cap reads aggregate a first-class column instead of full-scanning and detoasting the per-step logs JSONB - the path behind the 2026-05-29 RDS saturation. Gas is resolved on error finalizes too, since the current gas total counts gas-bearing steps from runs that later errored. Extraction is centralized in a shared helper so the writer, backfill, and analytics reads cannot drift. No read paths change here; reads switch to the column in a follow-up PR.

…-ux-polish-reapply feat: KEEP-493 re-apply workflow editor UX polish (reapply PR #1370)

Network is a per-step property: each web3 node targets its own chain, so a run can span multiple networks and no single run-level value is correct (the writer was picking an arbitrary MIN). The column was also read by nothing - the per-network breakdown needs step-level data. Keep only gas_used_wei, which is a legitimate network-agnostic run total. Also corrects the schema/migration/writer comments that said gas is populated at the success finalize - it is resolved on error finalizes too - and documents why gas is sourced from the DB extraction rather than the in-memory step tracker (one shared extraction path with the backfill and reads, avoiding drift).

…e runs Add a we.gas_used_wei IS NULL guard to the backfill UPDATE so it only fills empty rows, never overwrites the live finalize writer's value. This keeps it off the hot recent rows the writer owns (no clobber race), makes re-runs cheap, and avoids redundant write churn on the multi-GB table. The dry-run count uses the same predicate so it matches what a live run would write. Also require --yes for a LIVE run against a non-local DB so a staging/prod backfill cannot be started by reflex; dry-runs and local DBs are unguarded.

The writer's resolveGasTotal issues db.select(), which the workflow-logging mock did not provide, so the new code path threw (caught and logged) and tripped an existing 'logSystemError not called' assertion. Mock db.select and add coverage: gas is set on success, on error finalizes too, and null when the run produced no gas-bearing step.

…llowups fix(security): close KEEP-612 detection gaps found in staging verification

…workflow-gas feat: denormalize workflow gas onto workflow_executions

suisuss · 2026-06-02T05:19:11Z

Post-deploy operator runbook (prod gas backfill + verification)

Run these AFTER this promotion deploys to prod. This is the write side only; the read-switch (#1434) stays out until the gate below passes. Until then prod reads are unchanged (still JSON), so the column being partially populated is not user-visible.

Setup

cd <TechOps workspace>
export KUBECONFIG=<path to the prod kubeconfig>
REG=020732533267.dkr.ecr.us-east-2.amazonaws.com/keeperhub-prod   # note: us-east-2, not us-east-1
# Derive the deployed SHA from the running prod image (no guessing):
SHA=$(kubectl get deploy keeperhub-common -n keeperhub \
  -o jsonpath='{.spec.template.spec.containers[0].image}' | sed 's/.*:app-//')
IMG=$REG:migrator-$SHA

DB secret: keeperhub-common-db-url (key keeperhub-common-db-url) -> DATABASE_URL. Namespace: keeperhub.

1. Dry-run (read-only; confirms column exists + total scope)

kubectl run kh-backfill-dryrun -n keeperhub --image="$IMG" --restart=Never \
  --overrides='{"spec":{"restartPolicy":"Never","containers":[{"name":"backfill","image":"'"$IMG"'","command":["sh","-c","cd /app && pnpm tsx scripts/backfill-workflow-gas.ts --dry-run"],"env":[{"name":"DATABASE_URL","valueFrom":{"secretKeyRef":{"name":"keeperhub-common-db-url","key":"keeperhub-common-db-url"}}}]}]}}'
kubectl logs kh-backfill-dryrun -n keeperhub | tail
kubectl delete pod kh-backfill-dryrun -n keeperhub

2. Controlled probe, then full backfill (writes; `--yes` required on non-local DB)

# 10-batch probe first to confirm writes + measure per-batch time
... scripts/backfill-workflow-gas.ts --yes --max-batches 10
# full run
... scripts/backfill-workflow-gas.ts --yes
# resume after interruption (cursor is logged each batch):
... scripts/backfill-workflow-gas.ts --yes --after-id <last_cursor>

Same pod pattern as step 1 with the command swapped. Resume is idempotent: the gas_used_wei IS NULL guard means a re-run never re-touches populated rows and never fights the live finalize writer.

3. Equivalence gate (read-only) - run only after the backfill completes

kubectl run kh-psql-verify -n keeperhub --image=postgres:16-alpine --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"psql","image":"postgres:16-alpine","command":["sleep","900"],"env":[{"name":"DATABASE_URL","valueFrom":{"secretKeyRef":{"name":"keeperhub-common-db-url","key":"keeperhub-common-db-url"}}}]}]}}'
kubectl wait --for=condition=Ready pod/kh-psql-verify -n keeperhub --timeout=60s

WITH col AS (
  SELECT w.organization_id AS org, COALESCE(SUM(we.gas_used_wei),0) AS col_total
  FROM workflow_executions we JOIN workflows w ON we.workflow_id = w.id
  GROUP BY w.organization_id
), js AS (
  SELECT w.organization_id AS org,
    COALESCE(SUM(CAST(CASE WHEN jsonb_typeof(wel.output)='string'
      THEN (wel.output #>> '{}')::jsonb->>'gasUsed' ELSE wel.output->>'gasUsed' END AS NUMERIC)),0) AS json_total
  FROM workflow_execution_logs wel
  JOIN workflow_executions we ON wel.execution_id = we.id
  JOIN workflows w ON we.workflow_id = w.id
  WHERE (CASE WHEN jsonb_typeof(wel.output)='string'
      THEN (wel.output #>> '{}')::jsonb->>'gasUsed' ELSE wel.output->>'gasUsed' END) IS NOT NULL
  GROUP BY w.organization_id
)
SELECT COALESCE(col.org,js.org) AS org, COALESCE(col.col_total,0) col_total,
       COALESCE(js.json_total,0) json_total,
       COALESCE(col.col_total,0)-COALESCE(js.json_total,0) AS diff
FROM col FULL OUTER JOIN js ON col.org=js.org
WHERE COALESCE(col.col_total,0) <> COALESCE(js.json_total,0)
ORDER BY abs(COALESCE(col.col_total,0)-COALESCE(js.json_total,0)) DESC;

Expect 0 rows. Validated on staging: 0 per-org mismatches, exact grand-total match.

Two notes that matter

The verification query reproduces the heavy scan this ticket exists to kill. The js (JSON) side is the same unindexed per-row JSONB aggregation that touched ~33 GB on Sky and drove the 2026-05-29 incident. Running the gate on prod briefly reproduces that load, so run it off-peak. It is a one-time verification and the last time we ever pay that cost - once perf: switch analytics gas reads to the denormalized column #1434 ships, nothing reads gas that way again.
Run the gate only after the backfill has fully completed (cursor exhausted). If run mid-backfill it will report false mismatches for rows not yet filled (column < JSON), which looks like a correctness failure but is just incomplete progress. Confirm the full run logged done ... last_cursor=<end> with the final batch smaller than the batch size before trusting the gate. Do not promote perf: switch analytics gas reads to the denormalized column #1434 until this passes.

chong-techops and others added 30 commits May 29, 2026 18:42

feat: KEEP-493 add editable workflow name field to editor toolbar

59c4b3a

feat: KEEP-493 add status tooltips to workflow run panel

b87dcda

feat: KEEP-493 add fund-this-wallet faucet panel on wallet creation

45fbac8

fix: KEEP-493 auto-derive workflowType from write-action content on l…

523be05

…isting

fix: KEEP-493 extend workflowType auto-flip into workflows PATCH route

b62ca27

refactor: KEEP-493 extract deriveWorkflowType helper, deduplicate aut…

fea948b

…o-flip

Merge pull request #1424 from KeeperHub/feat/KEEP-679-serialise-metri…

461359e

…cs-collector-queries refactor: cap metrics collector queries at 2 concurrent via dedicated pool

chore: KEEP-493 remove fund-this-wallet faucet panel from wallet section

a2507b5

refactor: KEEP-493 make workflow toolbar name read-only

382c1c4

fix: KEEP-493 hide web3 wallets from connections settings panel

27492e9

fix: KEEP-493 disable mono font ligatures so truncated addresses rend…

7986fa2

…er literally

Merge pull request #1393 from KeeperHub/feature/TECH-6480-metrics-scr…

ce34a63

…ape-cache perf(metrics): add a statement_timeout backstop on the metrics DB pool

Merge pull request #1423 from KeeperHub/feat/KEEP-493-workflow-editor…

1d9499a

…-ux-polish-reapply feat: KEEP-493 re-apply workflow editor UX polish (reapply PR #1370)

suisuss and others added 5 commits June 2, 2026 14:03

Merge pull request #1403 from KeeperHub/feature/KEEP-612-detection-fo…

436d020

…llowups fix(security): close KEEP-612 detection gaps found in staging verification

Merge pull request #1433 from KeeperHub/feature/KEEP-683-denormalize-…

7b8ffcd

…workflow-gas feat: denormalize workflow gas onto workflow_executions

suisuss temporarily deployed to staging June 2, 2026 04:58 — with GitHub Actions Inactive

suisuss added the metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680) label Jun 2, 2026

suisuss temporarily deployed to staging June 2, 2026 05:01 — with GitHub Actions Inactive

suisuss merged commit 3021283 into prod Jun 2, 2026
37 of 38 checks passed

suisuss mentioned this pull request Jun 2, 2026

release: promote staging to prod #1436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: promote staging to prod#1435

release: promote staging to prod#1435
suisuss merged 35 commits into
prodfrom
staging

suisuss commented Jun 2, 2026

Uh oh!

suisuss commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suisuss commented Jun 2, 2026

KEEP-683 (wave 1 of 2) - the piece needing operator follow-up

Required operator steps AFTER this deploys to prod

Also included (accumulated staging changes)

Verification done on staging

Uh oh!

suisuss commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Post-deploy operator runbook (prod gas backfill + verification)

Setup

1. Dry-run (read-only; confirms column exists + total scope)

2. Controlled probe, then full backfill (writes; --yes required on non-local DB)

3. Equivalence gate (read-only) - run only after the backfill completes

Two notes that matter

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suisuss commented Jun 2, 2026 •

edited

Loading

2. Controlled probe, then full backfill (writes; `--yes` required on non-local DB)