feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#52
Merged
Conversation
…erts Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) with no SLO/error-budget board; this adds both, self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46). dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR, four rows: - Reconcile SLO & error budget (99% of reconciles <1s / 30d): inline fraction- under-1s SLI, budget remaining, and fast/slow burn (1h/6h) stats. - Reconcile RED per controller: rate, error ratio, latency p50/p95/p99, and p99 by controller with the 1s SLO line. - Work queue & workers: depth, add rate, queue-wait p95, active workers. - Reconciled fleet: Platforms Ready ratio + CRs by kind & phase (kube_customresource). dashboards/base/alerting/agent-operator.yaml — GrafanaAlertRuleGroup (Grafana- managed, evaluated by AMG): dual-window latency burn (14.4x fast / 6x slow, page), reconcile error rate >5% (page), and operator-metrics-absent (page). Each links its runbook. This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror for kx — the header documents the split. Registered in kustomization; .yamllint embedded-JSON ignore generalized to the agent-* glob + the new alert file. kustomize build green.
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Recreated against
mainafter #48 merged (GitHub closed the original #49 when the stacked base branchportal-sre-dashboardwas deleted). Same reviewed change, rebased to a single clean commit.What
Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod — self-contained over the controller-runtime metrics that reach AMP via the operator pod's scrape annotation (merged in eks-agent-platform).
dashboards/base/platform/agent-operator.yaml— GrafanaDashboard CR: reconcile RED (rate / errors / latency p50/p95/p99 by controller), work-queue depth/wait/workers, a latency SLO row (99% of reconciles <1s / 30d), and a reconciled-fleet (kube_customresource) row.dashboards/base/alerting/agent-operator.yaml— Grafana-managed dual-window latency burn, reconcile error rate, and operator-metrics-absent — each runbook-linked. The operator chart's PrometheusRule remains the kube-prometheus-stack mirror for kx.Quality-checked A/A− (not hollow; metrics verified to reach AMP).
kustomize build+ yamllint green.