feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)#1439
Open
chong-techops wants to merge 4 commits into
Open
feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484)#1439chong-techops wants to merge 4 commits into
chong-techops wants to merge 4 commits into
Conversation
Standalone single-replica service that serves the DB-sourced Prometheus gauges (the former /api/metrics/db scrape) off the request-serving pods. Executor-style: build-context = repo root, imports lib/metrics verbatim via tsx, so the exposed gauge families are identical to today's endpoint. - index.ts boots a node:http server (PORT, default 9090) with graceful shutdown - server.ts serves GET /metrics (updateDbMetrics -> getDbMetrics, TTL- and pool-gated in lib/) and GET /health - server.test.ts mocks the metrics module and exercises the HTTP wiring on an ephemeral port (200 + gauges, /health, 500 on refresh failure, 404) Containerization, CI, and deploy wiring follow in later commits (TECH-6484).
Dockerfile metrics-collector stage (executor-style: reuses lib/ + root
node_modules via tsx, shims server-only, serves on :9090) and a
docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO var. Bake config
validated with 'buildx bake --print'.
Image build/push runs once the ECR repo (keeperhub-metrics-collector-{env})
exists and a deploy workflow wires the bake target -- following commits.
…(Stage 3)
- deploy/metrics-collector/{staging,prod}/values.yaml: replicaCount 1, serves
/metrics on :9090, ServiceMonitor scrapes the DB gauges from this one pod
(deterministic, no hashmod). Minimal env -- only DATABASE_URL; no SQS/Turnkey.
- .github/workflows/deploy-metrics-collector.yaml: standalone (events-style)
trigger on staging/prod + dispatch; bakes the metrics-collector target and
helm-deploys via the shared techops-services/common chart, release name
keeperhub-metrics-collector, namespace keeperhub.
Does NOT cut over: the app's db-metrics ServiceMonitor and /api/metrics/db
route stay in place until the collector is verified in staging (Stage 4).
Requires the ECR repo + TFC workspace (terraform drafted in
techops_infrastructure, applied by infra) before the first deploy.
Validated by building the image: the metrics-collector stage now (1) is included in the Docker source stage (COPY keeperhub-metrics-collector/) so the build doesn't fail, and (2) drops the COPY --from=builder generated-file lines -- the metrics import graph references no builder-generated file at runtime (lib/db/schema only uses lib/types/integration via import type, which tsx erases). The stage now depends on source only, so the image builds without the expensive Next builder stage. Confirmed: image builds, prometheus-api + db-metrics import cleanly, /health 200, /metrics runs the real queries.
This was referenced Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds
keeperhub-metrics-collector(TECH-6484): a dedicated single-replica service that serves the DB-sourced Prometheus gauges (the former/api/metrics/dbscrape) off the request-serving pods, plus its image and deploy wiring. Consolidates the previously-split #1437 (service+image) and #1438 (deploy).Why
The
db-metricsServiceMonitor scrapes every app pod, so the heavy aggregate scan runs once per pod per window (2x today). A dedicated single replica makes the scan deterministic (one pod) and decouples it from app replica count — and keeps it off the latency-sensitive request pods. Composes with the cache (KEEP-669), themax:2metrics pool (KEEP-679), and index-only scans (migration 0095) already on staging.How
Executor-style: build-context = repo root, reuses
lib/metrics+lib/dbverbatim viatsx(zero query duplication / schema drift).keeperhub-metrics-collector/—node:httpserver (PORT9090) servingGET /metrics(updateDbMetrics->getDbMetrics) andGET /health;server.test.tscovers the wiring.Dockerfilemetrics-collectorstage — source-only (no Next builder): the metrics graph references no builder-generated file at runtime (lib/db/schemauseslib/types/integrationonly viaimport type, erased by tsx);server-onlyshimmed.docker-bake.hcltarget/group +METRICS_COLLECTOR_ECR_REPO.deploy/metrics-collector/{staging,prod}/values.yaml—replicaCount: 1, ServiceMonitor scraping/metricsdeterministically from one pod; minimal env (onlyDATABASE_URL)..github/workflows/deploy-metrics-collector.yaml— standalone build + helm-deploy viatechops-services/common.Validation
pnpm type-checkclean,pnpm check(biome) clean, collector tests pass (4).docker build --target metrics-collector): no Next build runs;prometheus-api+db-metricsimport cleanly in the trimmed container;/health-> 200;/metricsconstructs and runs the real aggregate queries (only the DB connection fails under a dummy URL — no missing modules).No cutover
The app's
db-metricsServiceMonitor and/api/metrics/dbroute are intentionally untouched. Cutover (remove the app monitor + env-gate the route to 404) is a follow-up, after the collector is verified scraping in staging.