Skip to content

feat(metrics): cut over /api/metrics/db to the collector + PR-env wiring (TECH-6484)#1442

Open
chong-techops wants to merge 3 commits into
stagingfrom
feature/TECH-6484-collector-cutover
Open

feat(metrics): cut over /api/metrics/db to the collector + PR-env wiring (TECH-6484)#1442
chong-techops wants to merge 3 commits into
stagingfrom
feature/TECH-6484-collector-cutover

Conversation

@chong-techops
Copy link
Copy Markdown

@chong-techops chong-techops commented Jun 3, 2026

⚠️ Deployment sequence (read before merging)

This PR turns off the app's DB-metrics scrape. It is safe only after the dedicated collector (PR #1439) is live and confirmed scraping. Merging out of order creates a metrics gap.

Required order:

  1. Merge feat(metrics): dedicated keeperhub-metrics-collector service (TECH-6484) #1439 → collector deploys to staging (app still serving db-metrics too — harmless overlap).
  2. Verify in staging: the keeperhub-metrics-collector-…-db-metrics Prometheus target is Up with gauges matching /api/metrics/db.
  3. Then merge this PR → app stops serving db-metrics; collector is the sole source → no gap.
  4. Prod: promote the collector first (verify target Up in prod), then promote this cutover. Never ship both to prod in one promotion (same gap risk).

Principle: always overlap (two sources briefly, max() dedupes), never gap.

What

Cutover (Stage 4):

  • app/api/metrics/db route → 404 when METRICS_DB_OFFLOADED=true (reversible via config).
  • Remove the db-metrics ServiceMonitor from deploy/keeperhub/{staging,prod} and set METRICS_DB_OFFLOADED=true. /api/metrics/api unchanged.

PR-env wiring (Stage 5):

  • deploy/pr-environment/metrics-collector.template.yaml + deploy-pr-environment.yaml deploy-pr-metrics opt-in label (build + deploy a PR collector). Default off — existing PR envs unaffected.
  • METRICS_REFERENCE.md note.

Validation

  • Type-check / biome clean; all changed YAML parses.
  • The deploy-pr-metrics path was exercised end-to-end (throwaway combined PR): built a PR-sha collector image and deployed a real pod (/metrics 200). The deploy-pr-metrics label was created in the repo.
  • Fixed a gap found during that run (a9076653): the metrics-only label (added to an already-deployed PR) now re-runs the deploy so the collector actually lands, mirroring deploy-pr-executor.

Depends on #1439 (image stage + deploy config).

…ing (TECH-6484)

Stage 4 (cutover):
- Gate the app's /api/metrics/db route to 404 via METRICS_DB_OFFLOADED so the
  heavy aggregate scan never runs on the request-serving pods.
- Remove the db-metrics ServiceMonitor from deploy/keeperhub/{staging,prod} and
  set METRICS_DB_OFFLOADED=true. /api/metrics/api is unchanged.

Stage 5 (PR-env wiring + docs):
- deploy/pr-environment/metrics-collector.template.yaml (single replica, PR DB,
  ServiceMonitor off).
- deploy-pr-environment.yaml: opt-in deploy-pr-metrics label -> build-collector-image
  job + a gated deploy step. Default off, so existing PR envs are unaffected.
- METRICS_REFERENCE.md note on the collector + offload.

Depends on the collector being live + verified in staging (PR #1439). Cutover
must merge only after that, else a DB-metrics gap.
Consistency with the #1439 review (#3): rely on the Dockerfile CMD, no helm
command/args override. Matches deploy/metrics-collector/{staging,prod}.
@chong-techops
Copy link
Copy Markdown
Author

PR-env wiring validated (with one gap found)

Exercised the deploy-pr-metrics path here via a throwaway combined branch (this PR's wiring + #1439's image stage), labeled deploy-pr-environment + deploy-pr-metrics:

  • check-labels set should-deploy-collector=true, build-collector-image built + pushed a PR-sha image, and the deploy step rolled out a real metrics-pr-N-common collector pod (Running 1/1, /metrics 200). So the primary both-labels path works.
  • Also created the deploy-pr-metrics label in the repo (it didn't exist - required for this to be usable).

Gap to fix: the incremental path - adding deploy-pr-metrics to an already-deployed PR (without re-adding deploy-pr-environment) - builds the collector image but does not deploy it, because the collector deploy step lives inside the should-deploy-gated deploy job (the metrics-only branch sets should-deploy-collector but not should-deploy). The fix is to make the collector deploy its own step gated only on should-deploy-collector (or have the metrics-only branch also set should-deploy). Low impact (the both-labels path is the common case), but the incremental label currently doesn't do what it implies.

…or lands

Found during B-now validation: adding deploy-pr-metrics to an already-deployed
PR built the collector image but did not deploy it - the collector deploy step
sits inside the should-deploy-gated deploy job. Set should-deploy=true on the
metrics-only path (mirroring deploy-pr-executor) so the deploy re-runs and the
collector step executes. The both-labels path was already correct.
@chong-techops chong-techops marked this pull request as ready for review June 3, 2026 03:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant