Skip to content

DO NOT MERGE: TECH-6484 collector PR-env validation (throwaway)#1443

Closed
chong-techops wants to merge 8 commits into
stagingfrom
tmp/TECH-6484-collector-validate
Closed

DO NOT MERGE: TECH-6484 collector PR-env validation (throwaway)#1443
chong-techops wants to merge 8 commits into
stagingfrom
tmp/TECH-6484-collector-validate

Conversation

@chong-techops
Copy link
Copy Markdown

Throwaway PR to validate the real metrics-collector pod + the deploy-pr-metrics PR-env wiring before #1442 merges. Combines #1439 (service+image) and #1442 (PR-env wiring). Will be closed + torn down after validation. Do not review/merge.

Standalone single-replica service that serves the DB-sourced Prometheus
gauges (the former /api/metrics/db scrape) off the request-serving pods.
Executor-style: build-context = repo root, imports lib/metrics verbatim via
tsx, so the exposed gauge families are identical to today's endpoint.

- index.ts boots a node:http server (PORT, default 9090) with graceful
  shutdown
- server.ts serves GET /metrics (updateDbMetrics -> getDbMetrics, TTL- and
  pool-gated in lib/) and GET /health
- server.test.ts mocks the metrics module and exercises the HTTP wiring on an
  ephemeral port (200 + gauges, /health, 500 on refresh failure, 404)

Containerization, CI, and deploy wiring follow in later commits (TECH-6484).
Dockerfile metrics-collector stage (executor-style: reuses lib/ + root
node_modules via tsx, shims server-only, serves on :9090) and a
docker-bake.hcl target/group + METRICS_COLLECTOR_ECR_REPO var. Bake config
validated with 'buildx bake --print'.

Image build/push runs once the ECR repo (keeperhub-metrics-collector-{env})
exists and a deploy workflow wires the bake target -- following commits.
…(Stage 3)

- deploy/metrics-collector/{staging,prod}/values.yaml: replicaCount 1, serves
  /metrics on :9090, ServiceMonitor scrapes the DB gauges from this one pod
  (deterministic, no hashmod). Minimal env -- only DATABASE_URL; no SQS/Turnkey.
- .github/workflows/deploy-metrics-collector.yaml: standalone (events-style)
  trigger on staging/prod + dispatch; bakes the metrics-collector target and
  helm-deploys via the shared techops-services/common chart, release name
  keeperhub-metrics-collector, namespace keeperhub.

Does NOT cut over: the app's db-metrics ServiceMonitor and /api/metrics/db
route stay in place until the collector is verified in staging (Stage 4).
Requires the ECR repo + TFC workspace (terraform drafted in
techops_infrastructure, applied by infra) before the first deploy.
Validated by building the image: the metrics-collector stage now (1) is
included in the Docker source stage (COPY keeperhub-metrics-collector/) so
the build doesn't fail, and (2) drops the COPY --from=builder generated-file
lines -- the metrics import graph references no builder-generated file at
runtime (lib/db/schema only uses lib/types/integration via import type, which
tsx erases). The stage now depends on source only, so the image builds without
the expensive Next builder stage. Confirmed: image builds, prometheus-api +
db-metrics import cleanly, /health 200, /metrics runs the real queries.
…ing (TECH-6484)

Stage 4 (cutover):
- Gate the app's /api/metrics/db route to 404 via METRICS_DB_OFFLOADED so the
  heavy aggregate scan never runs on the request-serving pods.
- Remove the db-metrics ServiceMonitor from deploy/keeperhub/{staging,prod} and
  set METRICS_DB_OFFLOADED=true. /api/metrics/api is unchanged.

Stage 5 (PR-env wiring + docs):
- deploy/pr-environment/metrics-collector.template.yaml (single replica, PR DB,
  ServiceMonitor off).
- deploy-pr-environment.yaml: opt-in deploy-pr-metrics label -> build-collector-image
  job + a gated deploy step. Default off, so existing PR envs are unaffected.
- METRICS_REFERENCE.md note on the collector + offload.

Depends on the collector being live + verified in staging (PR #1439). Cutover
must merge only after that, else a DB-metrics gap.
…uild-check CI

Review feedback on PR #1439 (suisuss):

- #3 Startup defined twice: drop the helm command/args override from
  deploy/metrics-collector/{staging,prod}/values.yaml; rely on the Dockerfile
  CMD (tsx keeperhub-metrics-collector/index.ts) as the single source of truth.
- #4 Nothing builds this stage in CI: add .github/workflows/collector-build-check.yml
  -- a PR smoke job that builds the metrics-collector target and runs
  keeperhub-metrics-collector/import-check.ts inside the image, asserting the
  runtime import graph resolves. Catches a future value-import of a
  builder-generated module (server.test.ts mocks the module, so it can't).
  Validated locally: image builds, import-check prints IMPORTS_OK, exit 0.

- #1 ServiceMonitor port name: matches the executor pattern; will verify the
  actual Prometheus target in staging (replied on the PR).
Consistency with the #1439 review (#3): rely on the Dockerfile CMD, no helm
command/args override. Matches deploy/metrics-collector/{staging,prod}.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

PR Environment Deployed

Your PR environment has been deployed!

Environment Details:

Components:

  • Keeperhub Application
  • PostgreSQL Database (isolated instance)
  • LocalStack (SQS emulation)
  • Redis (isolated instance)
  • Schedule Dispatcher (staging image)
  • Block Dispatcher (staging image)
  • Event Tracker (staging image)

The environment will be automatically cleaned up when this PR is closed or merged.

@chong-techops
Copy link
Copy Markdown
Author

Validation complete (B-now): collector image built from PR sha, deployed as a real pod (Running 1/1), /health 200 and /metrics 200 with gauge families against the PR DB. Combined with the chart-render port-binding check (Jacob #1) and validation A (48 families on the populated staging DB), the collector is verified end-to-end. Throwaway PR - closing + tearing down.

@chong-techops chong-techops deleted the tmp/TECH-6484-collector-validate branch June 3, 2026 03:29
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🧹 PR Environment Cleaned Up

The PR environment has been successfully deleted.

Deleted Resources:

  • Namespace: pr-1443
  • All Helm releases (Keeperhub, Scheduler, Event services)
  • PostgreSQL Database (including data)
  • LocalStack, Redis
  • All associated secrets and configs

All resources have been cleaned up and will no longer incur costs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deploy-pr-environment deploy-pr-metrics Deploy the metrics-collector as a satellite in the PR environment (TECH-6484)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant