feat(operators): scrape the operator's reconcile metrics into AMP in prod by stxkxs · Pull Request #46 · nanohype/eks-agent-platform

stxkxs · 2026-06-23T02:15:40Z

What

The operator's controller-runtime metrics (reconcile rate/errors/latency, workqueue depth/latency) were only reachable via the ServiceMonitor — consumed by kube-prometheus-stack (present on the local kx cluster, not on the EKS clusters). In prod the Grafana Agent scrapes pod annotations → Amazon Managed Prometheus, with no prometheus-operator. So the reconcile RED never reached the metrics backend in prod; it was alert-only, and only on kx.

Change

Adds prometheus.io/scrape + /port + /path pod annotations to the operator Deployment, gated behind metrics.enabled && metrics.podScrapeAnnotations (new value, default true).

The operator NetworkPolicy already allows ingress to the metrics port from the monitoring namespace, where the Grafana Agent runs — no policy change needed.
The agent stamps the namespace label that the existing recording rules (charts/operator/files/slo/prometheusrule.yaml) and the new operator dashboard filter on (namespace="eks-agent-platform").
The metrics server serves plain HTTP on :8080 (no secure-serving), so annotation scraping works without auth.

The ServiceMonitor stays for kube-prometheus-stack clusters; the two scrape paths are complementary.

Verification

helm template confirms the three annotations render when enabled and are absent when podScrapeAnnotations=false; helm lint clean. Independently quality-checked: the scrape path (cilium NP → monitoring namespace → agent → AMP) and the namespace label injection were verified end-to-end.

Pairs with the agent-operator dashboard + Grafana-managed alert group in eks-gitops (reconcile RED + latency SLO/error-budget).

…prod The operator's controller-runtime metrics (reconcile rate/errors/latency, workqueue depth/latency) were only reachable via the ServiceMonitor, which is consumed by kube-prometheus-stack — present on the local kx cluster but not on the EKS clusters, where the Grafana Agent scrapes pod annotations and remote- writes to Amazon Managed Prometheus. So in prod the reconcile RED never reached the metrics backend and could not be charted (alert-only, and only on kx). Adds prometheus.io/scrape + /port + /path pod annotations to the operator Deployment, gated behind metrics.enabled && metrics.podScrapeAnnotations (new value, default true). The operator NetworkPolicy already allows ingress to the metrics port from the monitoring namespace, where the Grafana Agent runs, so no policy change is needed; the agent stamps the namespace label the recording rules and the new operator dashboard filter on. The ServiceMonitor stays for kube-prometheus-stack clusters; the two scrape paths are complementary. Pairs with the agent-operator dashboard + Grafana- managed alert group in eks-gitops.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(operators): scrape the operator's reconcile metrics into AMP in prod#46

feat(operators): scrape the operator's reconcile metrics into AMP in prod#46
stxkxs wants to merge 1 commit into
mainfrom
operator-prod-scrape

stxkxs commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026

What

Change

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant