Skip to content

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#49

Closed
stxkxs wants to merge 1 commit into
portal-sre-dashboardfrom
agent-operator-dashboard
Closed

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#49
stxkxs wants to merge 1 commit into
portal-sre-dashboardfrom
agent-operator-dashboard

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

What

Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) and there was no SLO/error-budget board. This adds both — self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46).

Stacked on portal-sre-dashboard (base), which introduces the shared slo-alerts folder + pinned datasource UID. Retarget to main once that merges.

Dashboard — dashboards/base/platform/agent-operator.yaml

GrafanaDashboard CR, four rows:

  • Reconcile SLO & error budget (99% of reconciles <1s / 30d) — inline fraction-under-1s SLI, budget remaining, fast/slow burn (1h/6h).
  • Reconcile RED per controller — rate, error ratio, latency p50/p95/p99, p99-by-controller with the 1s SLO line.
  • Work queue & workers — depth, add rate, queue-wait p95, active workers.
  • Reconciled fleet — Platforms Ready ratio + CRs by kind & phase (kube_customresource_*).

Alerting — dashboards/base/alerting/agent-operator.yaml

Grafana-managed (evaluated by AMG): dual-window latency burn (14.4× fast / 6× slow, page), reconcile error rate >5% (page), operator-metrics-absent (page). Each links its runbook (reconcile-latency, reconcile-errors, operator-down). This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror for kx — the header documents the split.

Verification

Independently quality-checked (graded A/A−): not hollow — every metric (controller_runtime_, workqueue_, kube_customresource_*) confirmed real and confirmed to reach AMP via the cilium-NP-allowed monitoring-namespace scrape; le="1" bucket and workqueue name label verified present in the pinned controller-runtime version; SLO/burn math correct. kustomize build + yamllint green. Runbook links and the dual-path note were added in response to the quality-check.

…erts

Charts the eks-agent-platform operator's own control loop and gives its latency
SLO teeth in prod. The audit found the operator's reconcile RED was alert-only
(never visualized) with no SLO/error-budget board; this adds both, self-contained
over the controller-runtime metrics that now reach AMP via the operator pod's
scrape annotation (eks-agent-platform#46).

dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR, four rows:
- Reconcile SLO & error budget (99% of reconciles <1s / 30d): inline fraction-
  under-1s SLI, budget remaining, and fast/slow burn (1h/6h) stats.
- Reconcile RED per controller: rate, error ratio, latency p50/p95/p99, and p99
  by controller with the 1s SLO line.
- Work queue & workers: depth, add rate, queue-wait p95, active workers.
- Reconciled fleet: Platforms Ready ratio + CRs by kind & phase (kube_customresource).

dashboards/base/alerting/agent-operator.yaml — GrafanaAlertRuleGroup (Grafana-
managed, evaluated by AMG): dual-window latency burn (14.4x fast / 6x slow,
page), reconcile error rate >5% (page), and operator-metrics-absent (page). Each
links its runbook. This is the prod path; the operator chart's PrometheusRule is
the kube-prometheus-stack mirror for kx — the header documents the split.

Registered in kustomization; .yamllint embedded-JSON ignore generalized to the
agent-* glob + the new alert file. kustomize build green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant