feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#49
Closed
stxkxs wants to merge 1 commit into
Closed
feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#49stxkxs wants to merge 1 commit into
stxkxs wants to merge 1 commit into
Conversation
…erts Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) with no SLO/error-budget board; this adds both, self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46). dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR, four rows: - Reconcile SLO & error budget (99% of reconciles <1s / 30d): inline fraction- under-1s SLI, budget remaining, and fast/slow burn (1h/6h) stats. - Reconcile RED per controller: rate, error ratio, latency p50/p95/p99, and p99 by controller with the 1s SLO line. - Work queue & workers: depth, add rate, queue-wait p95, active workers. - Reconciled fleet: Platforms Ready ratio + CRs by kind & phase (kube_customresource). dashboards/base/alerting/agent-operator.yaml — GrafanaAlertRuleGroup (Grafana- managed, evaluated by AMG): dual-window latency burn (14.4x fast / 6x slow, page), reconcile error rate >5% (page), and operator-metrics-absent (page). Each links its runbook. This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror for kx — the header documents the split. Registered in kustomization; .yamllint embedded-JSON ignore generalized to the agent-* glob + the new alert file. kustomize build green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) and there was no SLO/error-budget board. This adds both — self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46).
Dashboard —
dashboards/base/platform/agent-operator.yamlGrafanaDashboard CR, four rows:
kube_customresource_*).Alerting —
dashboards/base/alerting/agent-operator.yamlGrafana-managed (evaluated by AMG): dual-window latency burn (14.4× fast / 6× slow, page), reconcile error rate >5% (page), operator-metrics-absent (page). Each links its runbook (
reconcile-latency,reconcile-errors,operator-down). This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror forkx— the header documents the split.Verification
Independently quality-checked (graded A/A−): not hollow — every metric (controller_runtime_, workqueue_, kube_customresource_*) confirmed real and confirmed to reach AMP via the cilium-NP-allowed
monitoring-namespace scrape;le="1"bucket and workqueuenamelabel verified present in the pinned controller-runtime version; SLO/burn math correct.kustomize build+ yamllint green. Runbook links and the dual-path note were added in response to the quality-check.