feat(addons): make the infra addon dashboards reach AMP — pod scrape annotations#61
Merged
Conversation
…annotations
The hub's grafana-agent (Alloy) is annotation-gated: it keeps only pods with
prometheus.io/scrape="true" (honoring prometheus.io/port + /path), plus the static
kube-state-metrics + cAdvisor targets. ServiceMonitor is inert in prod (no
prometheus-operator). None of these addon deltas carried a scrape annotation, so
their community id-import dashboards were hollow in prod — the metrics never reached
Amazon Managed Prometheus.
Adds the scrape config to every always-on infra addon that ships a dashboard. Each
chart's exact mechanism + metrics port was helm-template-verified (a wrong key or
port silently re-hollows the board):
- karpenter podAnnotations :8080 (controller)
- external-dns podAnnotations :7979
- external-secrets podAnnotations :8080 (main controller)
- tempo podAnnotations :3200 (self-metrics)
- loki singleBinary.podAnnotations :3100 (self-metrics)
- opencost opencost.podAnnotations :9003 (subchart alias)
- argo-rollouts controller.podAnnotations :8090 (NOT 8080 — that's healthz)
- argo-events controller.podAnnotations :7777 (NOT the 8082 service port)
- argo-workflows controller.metricsConfig.enabled + podAnnotations :9090
(the controller serves no metrics until metricsConfig.enabled)
- cilium prometheus.enabled + operator.prometheus.enabled +
hubble.relay.prometheus.enabled — the chart auto-stamps the
agent/operator/relay pods (:9962/:9963/:9966) when these are on
and serviceMonitor is off (which it is in prod)
Two addons need no change — cert-manager and aws-load-balancer-controller already
auto-emit the prometheus.io annotations by default (prometheus.enabled, serviceMonitor
off), so their boards were never the gap.
Not covered here: the hubble L7 overview reads hubble_http_* on :9965, which the
cilium chart only annotates on a headless Service — unreachable by the pod-annotation
scrape (a pod can carry only one prometheus.io/port, already used by the agent's
:9962). That needs an Alloy service-discovery scrape rule, tracked separately.
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The hub's grafana-agent (Alloy) is annotation-gated — it scrapes only pods with
prometheus.io/scrape="true"(plus the static KSM + cAdvisor targets); ServiceMonitor is inert in prod. None of these addon deltas carried a scrape annotation, so their community id-import dashboards were hollow in prod (metrics never reached AMP).Adds the scrape config to every always-on infra addon that ships a dashboard. Each chart's mechanism + port was helm-template-verified (a wrong key/port silently re-hollows the board), catching real gotchas:
podAnnotationspodAnnotationspodAnnotationspodAnnotationssingleBinary.podAnnotationsopencost.podAnnotationscontroller.podAnnotationscontroller.podAnnotationscontroller.metricsConfig.enabled+podAnnotationsprometheus.enabled+operator.prometheus.enabled+hubble.relay.prometheus.enabled(auto-stamps pods)No change needed: cert-manager + aws-load-balancer-controller already auto-emit the annotations by default — never the gap.
Not covered (tracked separately): the hubble L7 board reads
hubble_http_*on :9965, which cilium only annotates on a headless Service — unreachable by pod-annotation scrape (a pod carries oneprometheus.io/port, already :9962). Needs an Alloy service-discovery rule.All 10 parse with correct nesting + yamllint clean.