Conversation
Hardens the KEEP-612 content scanner against attacker-controlled triggerInput (webhook bodies reach scanValue via request.json()). - scanAndReport is now total: wraps the scan+emit in try/catch so a throw can never escape into executeWorkflow. The call site at executor.workflow.ts:1598 is outside any try/catch, so an unhandled throw there aborted the whole run before any node executed -- violating the documented alert-only/never-blocks contract. - scanValue takes a depth arg and stops at MAX_SCAN_DEPTH (64); a deeply nested payload previously stack-overflowed (RangeError) through the executor. Also caps total hits at MAX_HITS (1000). - database_url pattern is now case-insensitive; env templates use lowercase database_url, which the case-sensitive form missed. Tests: deep-nesting no-throw (5000 levels), depth-cap bounds the walk, lowercase database_url match. Found during KEEP-612 staging verification.
…r inserts
The executor pre-creates the workflow_executions row for every
SQS-dispatched run (schedule / block / event) and downstream dispatch
reuses it, so the app-route attribution code never ran for these.
Staging verification found 255/255 post-deploy rows with NULL
attribution -- i.e. 100% of automated traffic was unattributed, the
exact paths the behavioral detection cares about.
Both executor inserts now:
- set trigger_source via buildAttribution({ source: triggerType })
(schedule|block|event). ip/country/api-key correctly stay null --
SQS dispatch has no inbound client request to attribute.
- are wrapped in withBackstopCapture so a 42501 reject from the 0082
block_executions trigger (e.g. owner deactivated in the
check->insert race) emits security.backstop_execution_blocked
instead of vanishing. This closes the original-incident shape:
a deactivated owner's scheduled workflow hitting the DB backstop
with zero alerting.
Coverage: composes buildAttribution + withBackstopCapture, both unit-
tested. processExecutorMessage has no unit harness (heavy db/SQS deps);
re-verified against the staging DB after deploy (trigger_source
populates on new scheduled rows).
Found during KEEP-612 staging verification.
Two uncovered detection surfaces found during KEEP-612 verification. MCP OAuth deactivated-login (lib/mcp/oauth-auth.ts): a deactivated user with a still-valid 1h MCP OAuth JWT was rejected silently. This is the 4th deactivated-login surface (session, account, api-key were already covered). Now emits security.deactivated_login_attempt with surface=mcp_oauth (Sentry + structured stdout), best-effort. The existing Loki rule keys on the event string, so no infra rule change. KH001 sessions backstop (lib/auth.ts onAPIError): migration 0090's sessions trigger raises SQLSTATE KH001 when a session insert is attempted for a deactivated user. Better Auth owns the session insert, so the central onAPIError handler is the only seam to observe it. The session.create.before hook is the primary gate; KH001 only fires on an app-path bypass (future OAuth provider, direct insert, hook regression) -- a fired backstop = bug or attack, page-worthy. Emits security.backstop_session_blocked (needs a new Loki rule in infra). Tests: oauth-auth-deactivated asserts the mcp_oauth capture fires with the right tags. Found during KEEP-612 staging verification.
…ivergence Resolves the schedule/scheduled split flagged in KEEP-612 verification. The execute route's internal fallback emitted trigger_type 'scheduled' while the executor emits 'schedule', fragmenting the keeperhub_workflow_executions_started_total series -- papered over by the Grafana rule's =~"schedule|scheduled" regex. Converge the route fallback on 'schedule' (the executor's canonical value); the Grafana regex already matches it, and internal callers always send an explicit x-trigger-type header so this fallback is defensive anyway. Also documents that trigger_source intentionally keeps more precision than the metric's trigger_type (it records the auth path: 'internal', 'mcp'), so the trigger_source='internal' vs trigger_type='schedule' divergence for the header-less internal case is by design, not drift. No infra change: the Grafana regex stays tolerant of historical 'scheduled' series. Found during KEEP-612 staging verification.
… fallback) Self-review of the follow-up fixes flagged that the KH001 detection in onAPIError assumed the postgres SQLSTATE sits on the top-level error. Better Auth wraps adapter errors, so the SQLSTATE typically lands on a nested cause -- the original top-level-only check would silently never fire. Extract the detection into lib/security/session-backstop.ts (dependency-free, unit-testable without building the Better Auth config) and harden it: - walk the error.cause chain, bounded to 5 levels (terminates on a self-referential cycle) - fall back to the trigger's fixed RAISE message fragment in case a wrapper normalised the SQLSTATE away entirely 9 unit tests: top-level code, nested cause (1 and 2 deep), message fallback at top and nested, unrelated code/message negatives, null/primitive safety, cycle termination. Still needs one staging confirmation that a real KH001 reject reaches onAPIError in a shape this matches (documented in the review file). Found during KEEP-612 follow-up self-review.
Codex flagged no critical/high in the diff; this resolves the medium + low findings it did raise. content-scanner (low): MAX_HITS is now a strict cap enforced at the push site, and on reaching it a single observable 'scan_truncated' marker is appended (the comment had claimed a marker that never existed). Saturation test added; the duplicate lowercase-database_url test removed. execute route (medium): the trigger-label resolution moves into a pure, exported resolveTriggerLabels(headerTrigger, isInternal) in request-attribution, unit-tested for every branch -- so a value typo can't silently re-fragment the started_total metric again. The route just destructures it. KH001 session backstop (medium): the onAPIError emit wiring is extracted into reportSessionBackstop() in lib/security/session-backstop.ts and unit-tested directly (fires on wrapped KH001, no-ops on unrelated errors, survives a Sentry throw) -- the integration seam Codex noted was previously only predicate-tested. mcp oauth (low): tests now assert the structured console.warn signal and resilience when captureMessage throws, not just the Sentry call. 98 security tests pass; lint + typecheck clean. Found during KEEP-612 follow-up Codex review.
The Loki security rules match pod=~".*keeperhub.*". The k8s-job workflow runner pod was named "workflow-<id>-<ts>" -- no "keeperhub" substring -- so it fell outside the matcher. executeWorkflow (and its security.content_scanner_hit emit) runs inside that pod for web3-write, k8s-job-dispatched workflows, so scanner hits on those executions were emitted into a log stream no alert rule watched. Prefix the job name with "keeperhub-" so the runner pod is captured by the existing rules (and is clearly attributable in logs). The main app and the executor pods already contain "keeperhub"; this closes the last execution mode. namespace is already "keeperhub" (matches namespace=~"keeperhub.*"). Found during KEEP-612 two-PR alignment review.
…d' header Two medium findings from the Codex alignment review. content-scanner: scanNodes/scanTriggerInput now share one hits accumulator, so config + trigger-input scans draw from a single MAX_HITS budget. Previously each capped independently and scanAndReport concatenated, so a saturating execution could emit ~2x the intended payload. Regression test saturates both sources together. request-attribution: resolveTriggerLabels now normalizes an explicit 'scheduled' header to canonical 'schedule'. TriggerType still accepts 'scheduled' for back-compat, so isTriggerType passed it straight through -- re-fragmenting the metric/audit series the header-less path already converged. Regression test covers the legacy header. 112 security tests pass; lint + typecheck clean. Found during KEEP-612 two-PR alignment Codex review.
…dedicated pool The /api/metrics/db scrape issues ~38 aggregations (13 functions via Promise.all, several fanning out further) over workflow_executions and workflow_execution_logs. Up to ~10 stacking concurrently - the app pool's max - is what saturated DB CPU in the 2026-05-29 incident. Route every db-metrics query through a dedicated postgres pool (metricsDb, max:2) so at most two run on the DB at once. Measured end-to-end, the refresh cost is dominated by 5 index-only heavy queries (~1.2s at staging, ~5-7s extrapolated at prod scale); capping at 2 lets them pair up (~3-4s prod) rather than fully serialise, staying under the 12s scrape timeout while bounding concurrency far below the incident regime. Composes with the existing updateDbMetrics cache (in-flight dedup + TTL): the cache gates whether a refresh runs, the pool bounds the queries inside it.
A cache-miss /api/metrics/db scrape blocks on the metrics refresh, which is ~3-4s at prod scale with the max:2 metrics pool. The ServiceMonitor scrapeTimeout was 10s; bump db-metrics to 12s in prod and staging (still well under the 30s interval) for headroom. KEEP-682 adds a refresh-duration metric to confirm and to alert before it ever threatens this timeout.
Two reviews/*.md Codex scratch files were committed by mistake -- they carry absolute local paths (and a local username) and are not something we want in history. Untrack both and add reviews/ to .gitignore so the review-loop scratch output can't be re-committed. Addresses PR review (must-fix).
The outer try/catch keeps the scanner from ever throwing into the executor, but a bare swallow meant a scanner bug would drop detection to zero with no signal -- the exact 'controls run but nobody is watching' gap this layer exists to close. Emit one structured security.content_scanner_error line (self-guarded so the failure signal can't escape either) so a probe that quietly stopped working is itself detectable. Test forces an internal throw and asserts the line. Addresses PR review.
session-backstop.ts walks the cause chain for KH001_SQLSTATE and falls back to matching KH001_MESSAGE_FRAGMENT in the error text; both are owned by migration 0090's RAISE. The existing assertions pinned the literal SQL; this one pins the runtime constants against the migration, so editing the trigger text without updating the constants (or vice versa) fails the test instead of silently rotting the fallback path. Addresses PR review.
Drop the non-author requirement from the metrics-db-review-gate: the `metrics-db-reviewed` label now only needs to be present, and the PR author may apply it. This reverts the second-person enforcement added by KEEP-680 at the repo owner's request, making the label an attestation rather than an enforced review. Removes the timeline-based labeler resolution and the author-vs-labeler check; comments updated to state the weaker policy.
Found by live-firing the backstop in the pr-1403 deployment: the block_executions trigger rejected an executor insert with ERRCODE 42501, but security.backstop_execution_blocked never emitted. Drizzle wraps the driver PostgresError in a DrizzleQueryError, so the 42501 SQLSTATE sits on error.cause.code, not the top-level error.code. The bare top-level check missed every real reject -- bullet 4's headline alert would never have fired in production. Walk the cause chain (bounded, depth 5) exactly like isKh001SessionBackstop already does for the sessions backstop. The existing unit test only passed a top-level .code error, so it never exercised the wrapped shape; add regression tests for a Drizzle-wrapped reject, a reject nested two levels deep, and a wrapped non-42501 (must not capture). 11 backstop-capture tests pass.
…cs-collector-queries refactor: cap metrics collector queries at 2 concurrent via dedicated pool
Set statement_timeout once on every metricsDb connection via the pool's connection options, so each /api/metrics/db aggregation is bounded without per-query wrapping. A runaway statement (e.g. a query-plan regression as workflow_executions grows) is cancelled with 57014 and caught per section, degrading that section's gauges to defaults rather than hanging the scrape past its timeout. Configurable via METRICS_STATEMENT_TIMEOUT_MS (default 8000), kept below the db-metrics scrapeTimeout. This is the slimmed remainder of TECH-6480. The cache (KEEP-669), the dedicated metrics pool (KEEP-679), and the index-only-scan migration (0095) all landed on staging independently and address the 2026-05-29 incident; statement_timeout is the complementary always-on backstop they do not provide, now expressed as a single pool-connection setting on top of the KEEP-679 pool.
…ape-cache perf(metrics): add a statement_timeout backstop on the metrics DB pool
Add gas_used_wei and network columns to workflow_executions, populated at terminal finalize alongside transaction_hashes, plus a batched backfill for historical rows. This lets the /analytics summary and spend-cap reads aggregate a first-class column instead of full-scanning and detoasting the per-step logs JSONB - the path behind the 2026-05-29 RDS saturation. Gas is resolved on error finalizes too, since the current gas total counts gas-bearing steps from runs that later errored. Extraction is centralized in a shared helper so the writer, backfill, and analytics reads cannot drift. No read paths change here; reads switch to the column in a follow-up PR.
…-ux-polish-reapply feat: KEEP-493 re-apply workflow editor UX polish (reapply PR #1370)
Network is a per-step property: each web3 node targets its own chain, so a run can span multiple networks and no single run-level value is correct (the writer was picking an arbitrary MIN). The column was also read by nothing - the per-network breakdown needs step-level data. Keep only gas_used_wei, which is a legitimate network-agnostic run total. Also corrects the schema/migration/writer comments that said gas is populated at the success finalize - it is resolved on error finalizes too - and documents why gas is sourced from the DB extraction rather than the in-memory step tracker (one shared extraction path with the backfill and reads, avoiding drift).
…e runs Add a we.gas_used_wei IS NULL guard to the backfill UPDATE so it only fills empty rows, never overwrites the live finalize writer's value. This keeps it off the hot recent rows the writer owns (no clobber race), makes re-runs cheap, and avoids redundant write churn on the multi-GB table. The dry-run count uses the same predicate so it matches what a live run would write. Also require --yes for a LIVE run against a non-local DB so a staging/prod backfill cannot be started by reflex; dry-runs and local DBs are unguarded.
The writer's resolveGasTotal issues db.select(), which the workflow-logging mock did not provide, so the new code path threw (caught and logged) and tripped an existing 'logSystemError not called' assertion. Mock db.select and add coverage: gas is set on success, on error finalizes too, and null when the run produced no gas-bearing step.
…llowups fix(security): close KEEP-612 detection gaps found in staging verification
…workflow-gas feat: denormalize workflow gas onto workflow_executions
Post-deploy operator runbook (prod gas backfill + verification)Run these AFTER this promotion deploys to prod. This is the write side only; the read-switch (#1434) stays out until the gate below passes. Until then prod reads are unchanged (still JSON), so the column being partially populated is not user-visible. Setupcd <TechOps workspace>
export KUBECONFIG=<path to the prod kubeconfig>
REG=020732533267.dkr.ecr.us-east-2.amazonaws.com/keeperhub-prod # note: us-east-2, not us-east-1
# Derive the deployed SHA from the running prod image (no guessing):
SHA=$(kubectl get deploy keeperhub-common -n keeperhub \
-o jsonpath='{.spec.template.spec.containers[0].image}' | sed 's/.*:app-//')
IMG=$REG:migrator-$SHADB secret: 1. Dry-run (read-only; confirms column exists + total scope)kubectl run kh-backfill-dryrun -n keeperhub --image="$IMG" --restart=Never \
--overrides='{"spec":{"restartPolicy":"Never","containers":[{"name":"backfill","image":"'"$IMG"'","command":["sh","-c","cd /app && pnpm tsx scripts/backfill-workflow-gas.ts --dry-run"],"env":[{"name":"DATABASE_URL","valueFrom":{"secretKeyRef":{"name":"keeperhub-common-db-url","key":"keeperhub-common-db-url"}}}]}]}}'
kubectl logs kh-backfill-dryrun -n keeperhub | tail
kubectl delete pod kh-backfill-dryrun -n keeperhub2. Controlled probe, then full backfill (writes;
|
Promotes
stagingtoprod(35 commits since the last sync).KEEP-683 (wave 1 of 2) - the piece needing operator follow-up
Includes the gas-denormalization write side only: migration
0096(adds nullableworkflow_executions.gas_used_wei), the finalize writer, and the backfill script. The read-switch (#1434) is intentionally NOT in this promotion - it must not go live on prod until the prod backfill has run and the equivalence check passes, or the dashboard would read ~0 gas for historical rows.Migration
0096is the only schema change in this batch and is safe to run inline: nullableADD COLUMN, no default, catalog-only in PG11+ (no rewrite, no@requires-db-prep).Required operator steps AFTER this deploys to prod
--after-id):SUM(gas_used_wei)== old JSON sum). Validated on staging: 0 mismatches.During the window between this deploy and the backfill completing, prod reads are unchanged (still JSON), so there is no user-visible regression - the column is write-only until #1434.
Also included (accumulated staging changes)
KEEP-493 (workflow editor UX), KEEP-612 (detection follow-ups), KEEP-679 (metrics collector serialization + timeouts), TECH-6480 (metrics scrape cache), and a batch of security hardening fixes. These are independent of KEEP-683 and were already merged + deployed on staging.
Verification done on staging
app-7b8ffcdrolled out, migration0096applied via Helm hook.