Skip to content

feat(addons): make the infra addon dashboards reach AMP — pod scrape annotations#61

Merged
stxkxs merged 1 commit into
mainfrom
addon-scrape-annotations
Jun 24, 2026
Merged

feat(addons): make the infra addon dashboards reach AMP — pod scrape annotations#61
stxkxs merged 1 commit into
mainfrom
addon-scrape-annotations

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 24, 2026

Copy link
Copy Markdown
Member

The hub's grafana-agent (Alloy) is annotation-gated — it scrapes only pods with prometheus.io/scrape="true" (plus the static KSM + cAdvisor targets); ServiceMonitor is inert in prod. None of these addon deltas carried a scrape annotation, so their community id-import dashboards were hollow in prod (metrics never reached AMP).

Adds the scrape config to every always-on infra addon that ships a dashboard. Each chart's mechanism + port was helm-template-verified (a wrong key/port silently re-hollows the board), catching real gotchas:

addon key port
karpenter podAnnotations 8080
external-dns podAnnotations 7979
external-secrets podAnnotations 8080
tempo podAnnotations 3200 (self)
loki singleBinary.podAnnotations 3100 (self)
opencost opencost.podAnnotations 9003
argo-rollouts controller.podAnnotations 8090 (not 8080=healthz)
argo-events controller.podAnnotations 7777 (not 8082=svc)
argo-workflows controller.metricsConfig.enabled + podAnnotations 9090
cilium prometheus.enabled + operator.prometheus.enabled + hubble.relay.prometheus.enabled (auto-stamps pods) 9962/9963/9966

No change needed: cert-manager + aws-load-balancer-controller already auto-emit the annotations by default — never the gap.

Not covered (tracked separately): the hubble L7 board reads hubble_http_* on :9965, which cilium only annotates on a headless Service — unreachable by pod-annotation scrape (a pod carries one prometheus.io/port, already :9962). Needs an Alloy service-discovery rule.

All 10 parse with correct nesting + yamllint clean.

…annotations

The hub's grafana-agent (Alloy) is annotation-gated: it keeps only pods with
prometheus.io/scrape="true" (honoring prometheus.io/port + /path), plus the static
kube-state-metrics + cAdvisor targets. ServiceMonitor is inert in prod (no
prometheus-operator). None of these addon deltas carried a scrape annotation, so
their community id-import dashboards were hollow in prod — the metrics never reached
Amazon Managed Prometheus.

Adds the scrape config to every always-on infra addon that ships a dashboard. Each
chart's exact mechanism + metrics port was helm-template-verified (a wrong key or
port silently re-hollows the board):

  - karpenter         podAnnotations            :8080  (controller)
  - external-dns      podAnnotations            :7979
  - external-secrets  podAnnotations            :8080  (main controller)
  - tempo             podAnnotations            :3200  (self-metrics)
  - loki              singleBinary.podAnnotations :3100 (self-metrics)
  - opencost          opencost.podAnnotations   :9003  (subchart alias)
  - argo-rollouts     controller.podAnnotations :8090  (NOT 8080 — that's healthz)
  - argo-events       controller.podAnnotations :7777  (NOT the 8082 service port)
  - argo-workflows    controller.metricsConfig.enabled + podAnnotations :9090
                      (the controller serves no metrics until metricsConfig.enabled)
  - cilium            prometheus.enabled + operator.prometheus.enabled +
                      hubble.relay.prometheus.enabled — the chart auto-stamps the
                      agent/operator/relay pods (:9962/:9963/:9966) when these are on
                      and serviceMonitor is off (which it is in prod)

Two addons need no change — cert-manager and aws-load-balancer-controller already
auto-emit the prometheus.io annotations by default (prometheus.enabled, serviceMonitor
off), so their boards were never the gap.

Not covered here: the hubble L7 overview reads hubble_http_* on :9965, which the
cilium chart only annotates on a headless Service — unreachable by the pod-annotation
scrape (a pod can carry only one prometheus.io/port, already used by the agent's
:9962). That needs an Alloy service-discovery scrape rule, tracked separately.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production
hub

All validations passed.

@stxkxs stxkxs merged commit 9a9e915 into main Jun 24, 2026
8 checks passed
@stxkxs stxkxs deleted the addon-scrape-annotations branch June 24, 2026 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant