feat(operators): scrape the operator's reconcile metrics into AMP in prod#46
Open
stxkxs wants to merge 1 commit into
Open
feat(operators): scrape the operator's reconcile metrics into AMP in prod#46stxkxs wants to merge 1 commit into
stxkxs wants to merge 1 commit into
Conversation
…prod The operator's controller-runtime metrics (reconcile rate/errors/latency, workqueue depth/latency) were only reachable via the ServiceMonitor, which is consumed by kube-prometheus-stack — present on the local kx cluster but not on the EKS clusters, where the Grafana Agent scrapes pod annotations and remote- writes to Amazon Managed Prometheus. So in prod the reconcile RED never reached the metrics backend and could not be charted (alert-only, and only on kx). Adds prometheus.io/scrape + /port + /path pod annotations to the operator Deployment, gated behind metrics.enabled && metrics.podScrapeAnnotations (new value, default true). The operator NetworkPolicy already allows ingress to the metrics port from the monitoring namespace, where the Grafana Agent runs, so no policy change is needed; the agent stamps the namespace label the recording rules and the new operator dashboard filter on. The ServiceMonitor stays for kube-prometheus-stack clusters; the two scrape paths are complementary. Pairs with the agent-operator dashboard + Grafana- managed alert group in eks-gitops.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The operator's controller-runtime metrics (reconcile rate/errors/latency, workqueue depth/latency) were only reachable via the ServiceMonitor — consumed by kube-prometheus-stack (present on the local
kxcluster, not on the EKS clusters). In prod the Grafana Agent scrapes pod annotations → Amazon Managed Prometheus, with no prometheus-operator. So the reconcile RED never reached the metrics backend in prod; it was alert-only, and only onkx.Change
Adds
prometheus.io/scrape+/port+/pathpod annotations to the operator Deployment, gated behindmetrics.enabled && metrics.podScrapeAnnotations(new value, defaulttrue).monitoringnamespace, where the Grafana Agent runs — no policy change needed.namespacelabel that the existing recording rules (charts/operator/files/slo/prometheusrule.yaml) and the new operator dashboard filter on (namespace="eks-agent-platform").:8080(no secure-serving), so annotation scraping works without auth.The ServiceMonitor stays for kube-prometheus-stack clusters; the two scrape paths are complementary.
Verification
helm templateconfirms the three annotations render when enabled and are absent whenpodScrapeAnnotations=false;helm lintclean. Independently quality-checked: the scrape path (cilium NP →monitoringnamespace → agent → AMP) and thenamespacelabel injection were verified end-to-end.Pairs with the agent-operator dashboard + Grafana-managed alert group in eks-gitops (reconcile RED + latency SLO/error-budget).