feat(metrics): cut over /api/metrics/db to the collector + PR-env wiring (TECH-6484)#1442
feat(metrics): cut over /api/metrics/db to the collector + PR-env wiring (TECH-6484)#1442chong-techops wants to merge 3 commits into
Conversation
…ing (TECH-6484)
Stage 4 (cutover):
- Gate the app's /api/metrics/db route to 404 via METRICS_DB_OFFLOADED so the
heavy aggregate scan never runs on the request-serving pods.
- Remove the db-metrics ServiceMonitor from deploy/keeperhub/{staging,prod} and
set METRICS_DB_OFFLOADED=true. /api/metrics/api is unchanged.
Stage 5 (PR-env wiring + docs):
- deploy/pr-environment/metrics-collector.template.yaml (single replica, PR DB,
ServiceMonitor off).
- deploy-pr-environment.yaml: opt-in deploy-pr-metrics label -> build-collector-image
job + a gated deploy step. Default off, so existing PR envs are unaffected.
- METRICS_REFERENCE.md note on the collector + offload.
Depends on the collector being live + verified in staging (PR #1439). Cutover
must merge only after that, else a DB-metrics gap.
PR-env wiring validated (with one gap found)Exercised the
Gap to fix: the incremental path - adding |
…or lands Found during B-now validation: adding deploy-pr-metrics to an already-deployed PR built the collector image but did not deploy it - the collector deploy step sits inside the should-deploy-gated deploy job. Set should-deploy=true on the metrics-only path (mirroring deploy-pr-executor) so the deploy re-runs and the collector step executes. The both-labels path was already correct.
This PR turns off the app's DB-metrics scrape. It is safe only after the dedicated collector (PR #1439) is live and confirmed scraping. Merging out of order creates a metrics gap.
Required order:
keeperhub-metrics-collector-…-db-metricsPrometheus target isUpwith gauges matching/api/metrics/db.Upin prod), then promote this cutover. Never ship both to prod in one promotion (same gap risk).Principle: always overlap (two sources briefly,
max()dedupes), never gap.What
Cutover (Stage 4):
app/api/metrics/dbroute → 404 whenMETRICS_DB_OFFLOADED=true(reversible via config).db-metricsServiceMonitor fromdeploy/keeperhub/{staging,prod}and setMETRICS_DB_OFFLOADED=true./api/metrics/apiunchanged.PR-env wiring (Stage 5):
deploy/pr-environment/metrics-collector.template.yaml+deploy-pr-environment.yamldeploy-pr-metricsopt-in label (build + deploy a PR collector). Default off — existing PR envs unaffected.METRICS_REFERENCE.mdnote.Validation
deploy-pr-metricspath was exercised end-to-end (throwaway combined PR): built a PR-sha collector image and deployed a real pod (/metrics200). Thedeploy-pr-metricslabel was created in the repo.a9076653): the metrics-only label (added to an already-deployed PR) now re-runs the deploy so the collector actually lands, mirroringdeploy-pr-executor.Depends on #1439 (image stage + deploy config).