Skip to content

feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)#1439

Open
chong-techops wants to merge 4 commits into
stagingfrom
feature/TECH-6484-metrics-collector-service
Open

feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)#1439
chong-techops wants to merge 4 commits into
stagingfrom
feature/TECH-6484-metrics-collector-service

Conversation

@chong-techops
Copy link
Copy Markdown

@chong-techops chong-techops commented Jun 2, 2026

Infra note: the keeperhub-metrics-collector-{staging,prod} ECR repos are already created/applied (techops-services/infrastructure#241 records the terraform on main). No blocker for this PR — the deploy workflow can push to the repos once merged.

What

Adds keeperhub-metrics-collector (TECH-6484): a dedicated single-replica service that serves the DB-sourced Prometheus gauges (the former /api/metrics/db scrape) off the request-serving pods, plus its image and deploy wiring. Consolidates the previously-split #1437 (service+image) and #1438 (deploy).

Why

The db-metrics ServiceMonitor scrapes every app pod, so the heavy aggregate scan runs once per pod per window (2x today). A dedicated single replica makes the scan deterministic (one pod) and decouples it from app replica count — and keeps it off the latency-sensitive request pods. Composes with the cache (KEEP-669), the max:2 metrics pool (KEEP-679), and index-only scans (migration 0095) already on staging.

How

Executor-style: build-context = repo root, reuses lib/metrics + lib/db verbatim via tsx (zero query duplication / schema drift).

  • keeperhub-metrics-collector/node:http server (PORT 9090) serving GET /metrics (updateDbMetrics -> getDbMetrics) and GET /health; server.test.ts covers the wiring.
  • Dockerfile metrics-collector stage — source-only (no Next builder): the metrics graph references no builder-generated file at runtime (lib/db/schema uses lib/types/integration only via import type, erased by tsx); server-only shimmed.
  • docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO.
  • deploy/metrics-collector/{staging,prod}/values.yamlreplicaCount: 1, ServiceMonitor scraping /metrics deterministically from one pod; minimal env (only DATABASE_URL).
  • .github/workflows/deploy-metrics-collector.yaml — standalone build + helm-deploy via techops-services/common.

Validation

  • pnpm type-check clean, pnpm check (biome) clean, collector tests pass (4).
  • Built the image (docker build --target metrics-collector): no Next build runs; prometheus-api + db-metrics import cleanly in the trimmed container; /health -> 200; /metrics constructs and runs the real aggregate queries (only the DB connection fails under a dummy URL — no missing modules).

No cutover

The app's db-metrics ServiceMonitor and /api/metrics/db route are intentionally untouched. Cutover (remove the app monitor + env-gate the route to 404) is a follow-up, after the collector is verified scraping in staging.

Standalone single-replica service that serves the DB-sourced Prometheus
gauges (the former /api/metrics/db scrape) off the request-serving pods.
Executor-style: build-context = repo root, imports lib/metrics verbatim via
tsx, so the exposed gauge families are identical to today's endpoint.

- index.ts boots a node:http server (PORT, default 9090) with graceful
  shutdown
- server.ts serves GET /metrics (updateDbMetrics -> getDbMetrics, TTL- and
  pool-gated in lib/) and GET /health
- server.test.ts mocks the metrics module and exercises the HTTP wiring on an
  ephemeral port (200 + gauges, /health, 500 on refresh failure, 404)

Containerization, CI, and deploy wiring follow in later commits (TECH-6484).
Dockerfile metrics-collector stage (executor-style: reuses lib/ + root
node_modules via tsx, shims server-only, serves on :9090) and a
docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO var. Bake config
validated with 'buildx bake --print'.

Image build/push runs once the ECR repo (keeperhub-metrics-collector-{env})
exists and a deploy workflow wires the bake target -- following commits.
…(Stage 3)

- deploy/metrics-collector/{staging,prod}/values.yaml: replicaCount 1, serves
  /metrics on :9090, ServiceMonitor scrapes the DB gauges from this one pod
  (deterministic, no hashmod). Minimal env -- only DATABASE_URL; no SQS/Turnkey.
- .github/workflows/deploy-metrics-collector.yaml: standalone (events-style)
  trigger on staging/prod + dispatch; bakes the metrics-collector target and
  helm-deploys via the shared techops-services/common chart, release name
  keeperhub-metrics-collector, namespace keeperhub.

Does NOT cut over: the app's db-metrics ServiceMonitor and /api/metrics/db
route stay in place until the collector is verified in staging (Stage 4).
Requires the ECR repo + TFC workspace (terraform drafted in
techops_infrastructure, applied by infra) before the first deploy.
Validated by building the image: the metrics-collector stage now (1) is
included in the Docker source stage (COPY keeperhub-metrics-collector/) so
the build doesn't fail, and (2) drops the COPY --from=builder generated-file
lines -- the metrics import graph references no builder-generated file at
runtime (lib/db/schema only uses lib/types/integration via import type, which
tsx erases). The stage now depends on source only, so the image builds without
the expensive Next builder stage. Confirmed: image builds, prometheus-api +
db-metrics import cleanly, /health 200, /metrics runs the real queries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant