Skip to content

feat(operators): scrape the operator's reconcile metrics into AMP in prod#46

Open
stxkxs wants to merge 1 commit into
mainfrom
operator-prod-scrape
Open

feat(operators): scrape the operator's reconcile metrics into AMP in prod#46
stxkxs wants to merge 1 commit into
mainfrom
operator-prod-scrape

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

What

The operator's controller-runtime metrics (reconcile rate/errors/latency, workqueue depth/latency) were only reachable via the ServiceMonitor — consumed by kube-prometheus-stack (present on the local kx cluster, not on the EKS clusters). In prod the Grafana Agent scrapes pod annotations → Amazon Managed Prometheus, with no prometheus-operator. So the reconcile RED never reached the metrics backend in prod; it was alert-only, and only on kx.

Change

Adds prometheus.io/scrape + /port + /path pod annotations to the operator Deployment, gated behind metrics.enabled && metrics.podScrapeAnnotations (new value, default true).

  • The operator NetworkPolicy already allows ingress to the metrics port from the monitoring namespace, where the Grafana Agent runs — no policy change needed.
  • The agent stamps the namespace label that the existing recording rules (charts/operator/files/slo/prometheusrule.yaml) and the new operator dashboard filter on (namespace="eks-agent-platform").
  • The metrics server serves plain HTTP on :8080 (no secure-serving), so annotation scraping works without auth.

The ServiceMonitor stays for kube-prometheus-stack clusters; the two scrape paths are complementary.

Verification

helm template confirms the three annotations render when enabled and are absent when podScrapeAnnotations=false; helm lint clean. Independently quality-checked: the scrape path (cilium NP → monitoring namespace → agent → AMP) and the namespace label injection were verified end-to-end.

Pairs with the agent-operator dashboard + Grafana-managed alert group in eks-gitops (reconcile RED + latency SLO/error-budget).

…prod

The operator's controller-runtime metrics (reconcile rate/errors/latency,
workqueue depth/latency) were only reachable via the ServiceMonitor, which is
consumed by kube-prometheus-stack — present on the local kx cluster but not on
the EKS clusters, where the Grafana Agent scrapes pod annotations and remote-
writes to Amazon Managed Prometheus. So in prod the reconcile RED never reached
the metrics backend and could not be charted (alert-only, and only on kx).

Adds prometheus.io/scrape + /port + /path pod annotations to the operator
Deployment, gated behind metrics.enabled && metrics.podScrapeAnnotations (new
value, default true). The operator NetworkPolicy already allows ingress to the
metrics port from the monitoring namespace, where the Grafana Agent runs, so no
policy change is needed; the agent stamps the namespace label the recording
rules and the new operator dashboard filter on.

The ServiceMonitor stays for kube-prometheus-stack clusters; the two scrape
paths are complementary. Pairs with the agent-operator dashboard + Grafana-
managed alert group in eks-gitops.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant