feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484) by chong-techops · Pull Request #1439 · KeeperHub/keeperhub

chong-techops · 2026-06-02T07:40:45Z

Infra note: the keeperhub-metrics-collector-{staging,prod} ECR repos are already created/applied (techops-services/infrastructure#241 records the terraform on main). No blocker for this PR — the deploy workflow can push to the repos once merged.

What

Adds keeperhub-metrics-collector (TECH-6484): a dedicated single-replica service that serves the DB-sourced Prometheus gauges (the former /api/metrics/db scrape) off the request-serving pods, plus its image and deploy wiring. Consolidates the previously-split #1437 (service+image) and #1438 (deploy).

Why

The db-metrics ServiceMonitor scrapes every app pod, so the heavy aggregate scan runs once per pod per window (2x today). A dedicated single replica makes the scan deterministic (one pod) and decouples it from app replica count — and keeps it off the latency-sensitive request pods. Composes with the cache (KEEP-669), the max:2 metrics pool (KEEP-679), and index-only scans (migration 0095) already on staging.

How

Executor-style: build-context = repo root, reuses lib/metrics + lib/db verbatim via tsx (zero query duplication / schema drift).

keeperhub-metrics-collector/ — node:http server (PORT 9090) serving GET /metrics (updateDbMetrics -> getDbMetrics) and GET /health; server.test.ts covers the wiring.
Dockerfile metrics-collector stage — source-only (no Next builder): the metrics graph references no builder-generated file at runtime (lib/db/schema uses lib/types/integration only via import type, erased by tsx); server-only shimmed.
docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO.
deploy/metrics-collector/{staging,prod}/values.yaml — replicaCount: 1, ServiceMonitor scraping /metrics deterministically from one pod; minimal env (only DATABASE_URL).
.github/workflows/deploy-metrics-collector.yaml — standalone build + helm-deploy via techops-services/common.

Validation

pnpm type-check clean, pnpm check (biome) clean, collector tests pass (4).
Built the image (docker build --target metrics-collector): no Next build runs; prometheus-api + db-metrics import cleanly in the trimmed container; /health -> 200; /metrics constructs and runs the real aggregate queries (only the DB connection fails under a dummy URL — no missing modules).

No cutover

The app's db-metrics ServiceMonitor and /api/metrics/db route are intentionally untouched. Cutover (remove the app monitor + env-gate the route to 404) is a follow-up, after the collector is verified scraping in staging.

Standalone single-replica service that serves the DB-sourced Prometheus gauges (the former /api/metrics/db scrape) off the request-serving pods. Executor-style: build-context = repo root, imports lib/metrics verbatim via tsx, so the exposed gauge families are identical to today's endpoint. - index.ts boots a node:http server (PORT, default 9090) with graceful shutdown - server.ts serves GET /metrics (updateDbMetrics -> getDbMetrics, TTL- and pool-gated in lib/) and GET /health - server.test.ts mocks the metrics module and exercises the HTTP wiring on an ephemeral port (200 + gauges, /health, 500 on refresh failure, 404) Containerization, CI, and deploy wiring follow in later commits (TECH-6484).

Dockerfile metrics-collector stage (executor-style: reuses lib/ + root node_modules via tsx, shims server-only, serves on :9090) and a docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO var. Bake config validated with 'buildx bake --print'. Image build/push runs once the ECR repo (keeperhub-metrics-collector-{env}) exists and a deploy workflow wires the bake target -- following commits.

…(Stage 3) - deploy/metrics-collector/{staging,prod}/values.yaml: replicaCount 1, serves /metrics on :9090, ServiceMonitor scrapes the DB gauges from this one pod (deterministic, no hashmod). Minimal env -- only DATABASE_URL; no SQS/Turnkey. - .github/workflows/deploy-metrics-collector.yaml: standalone (events-style) trigger on staging/prod + dispatch; bakes the metrics-collector target and helm-deploys via the shared techops-services/common chart, release name keeperhub-metrics-collector, namespace keeperhub. Does NOT cut over: the app's db-metrics ServiceMonitor and /api/metrics/db route stay in place until the collector is verified in staging (Stage 4). Requires the ECR repo + TFC workspace (terraform drafted in techops_infrastructure, applied by infra) before the first deploy.

Validated by building the image: the metrics-collector stage now (1) is included in the Docker source stage (COPY keeperhub-metrics-collector/) so the build doesn't fail, and (2) drops the COPY --from=builder generated-file lines -- the metrics import graph references no builder-generated file at runtime (lib/db/schema only uses lib/types/integration via import type, which tsx erases). The stage now depends on source only, so the image builds without the expensive Next builder stage. Confirmed: image builds, prometheus-api + db-metrics import cleanly, /health 200, /metrics runs the real queries.

chong-techops added 4 commits June 2, 2026 16:25

chong-techops temporarily deployed to staging June 2, 2026 07:40 — with GitHub Actions Inactive

This was referenced Jun 2, 2026

feat(metrics): add keeperhub-metrics-collector service + image #1437

Closed

ci(metrics): deploy metrics-collector single-replica to staging/prod #1438

Closed

chong-techops requested review from a team, OleksandrUA, eskp, joelorzet and suisuss and removed request for a team June 2, 2026 07:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)#1439

feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)#1439
chong-techops wants to merge 4 commits into
stagingfrom
feature/TECH-6484-metrics-collector-service

chong-techops commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chong-techops commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Validation

No cutover

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chong-techops commented Jun 2, 2026 •

edited

Loading