feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts by stxkxs · Pull Request #49 · nanohype/eks-gitops

stxkxs · 2026-06-23T02:16:14Z

What

Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) and there was no SLO/error-budget board. This adds both — self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46).

Stacked on portal-sre-dashboard (base), which introduces the shared slo-alerts folder + pinned datasource UID. Retarget to main once that merges.

Dashboard — `dashboards/base/platform/agent-operator.yaml`

GrafanaDashboard CR, four rows:

Reconcile SLO & error budget (99% of reconciles <1s / 30d) — inline fraction-under-1s SLI, budget remaining, fast/slow burn (1h/6h).
Reconcile RED per controller — rate, error ratio, latency p50/p95/p99, p99-by-controller with the 1s SLO line.
Work queue & workers — depth, add rate, queue-wait p95, active workers.
Reconciled fleet — Platforms Ready ratio + CRs by kind & phase (kube_customresource_*).

Alerting — `dashboards/base/alerting/agent-operator.yaml`

Grafana-managed (evaluated by AMG): dual-window latency burn (14.4× fast / 6× slow, page), reconcile error rate >5% (page), operator-metrics-absent (page). Each links its runbook (reconcile-latency, reconcile-errors, operator-down). This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror for kx — the header documents the split.

Verification

Independently quality-checked (graded A/A−): not hollow — every metric (controller_runtime_, workqueue_, kube_customresource_*) confirmed real and confirmed to reach AMP via the cilium-NP-allowed monitoring-namespace scrape; le="1" bucket and workqueue name label verified present in the pinned controller-runtime version; SLO/burn math correct. kustomize build + yamllint green. Runbook links and the dual-path note were added in response to the quality-check.

…erts Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) with no SLO/error-budget board; this adds both, self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46). dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR, four rows: - Reconcile SLO & error budget (99% of reconciles <1s / 30d): inline fraction- under-1s SLI, budget remaining, and fast/slow burn (1h/6h) stats. - Reconcile RED per controller: rate, error ratio, latency p50/p95/p99, and p99 by controller with the 1s SLO line. - Work queue & workers: depth, add rate, queue-wait p95, active workers. - Reconciled fleet: Platforms Ready ratio + CRs by kind & phase (kube_customresource). dashboards/base/alerting/agent-operator.yaml — GrafanaAlertRuleGroup (Grafana- managed, evaluated by AMG): dual-window latency burn (14.4x fast / 6x slow, page), reconcile error rate >5% (page), and operator-metrics-absent (page). Each links its runbook. This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror for kx — the header documents the split. Registered in kustomization; .yamllint embedded-JSON ignore generalized to the agent-* glob + the new alert file. kustomize build green.

stxkxs deleted the branch portal-sre-dashboard June 23, 2026 17:55

stxkxs closed this Jun 23, 2026

stxkxs mentioned this pull request Jun 23, 2026

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts #52

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#49

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#49
stxkxs wants to merge 1 commit into
portal-sre-dashboardfrom
agent-operator-dashboard

stxkxs commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026

What

Dashboard — dashboards/base/platform/agent-operator.yaml

Alerting — dashboards/base/alerting/agent-operator.yaml

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Dashboard — `dashboards/base/platform/agent-operator.yaml`

Alerting — `dashboards/base/alerting/agent-operator.yaml`