Skip to content

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#52

Merged
stxkxs merged 1 commit into
mainfrom
agent-operator-dashboard
Jun 23, 2026
Merged

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#52
stxkxs merged 1 commit into
mainfrom
agent-operator-dashboard

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

Recreated against main after #48 merged (GitHub closed the original #49 when the stacked base branch portal-sre-dashboard was deleted). Same reviewed change, rebased to a single clean commit.

What

Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod — self-contained over the controller-runtime metrics that reach AMP via the operator pod's scrape annotation (merged in eks-agent-platform).

  • dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR: reconcile RED (rate / errors / latency p50/p95/p99 by controller), work-queue depth/wait/workers, a latency SLO row (99% of reconciles <1s / 30d), and a reconciled-fleet (kube_customresource) row.
  • dashboards/base/alerting/agent-operator.yaml — Grafana-managed dual-window latency burn, reconcile error rate, and operator-metrics-absent — each runbook-linked. The operator chart's PrometheusRule remains the kube-prometheus-stack mirror for kx.

Quality-checked A/A− (not hollow; metrics verified to reach AMP). kustomize build + yamllint green.

…erts

Charts the eks-agent-platform operator's own control loop and gives its latency
SLO teeth in prod. The audit found the operator's reconcile RED was alert-only
(never visualized) with no SLO/error-budget board; this adds both, self-contained
over the controller-runtime metrics that now reach AMP via the operator pod's
scrape annotation (eks-agent-platform#46).

dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR, four rows:
- Reconcile SLO & error budget (99% of reconciles <1s / 30d): inline fraction-
  under-1s SLI, budget remaining, and fast/slow burn (1h/6h) stats.
- Reconcile RED per controller: rate, error ratio, latency p50/p95/p99, and p99
  by controller with the 1s SLO line.
- Work queue & workers: depth, add rate, queue-wait p95, active workers.
- Reconciled fleet: Platforms Ready ratio + CRs by kind & phase (kube_customresource).

dashboards/base/alerting/agent-operator.yaml — GrafanaAlertRuleGroup (Grafana-
managed, evaluated by AMG): dual-window latency burn (14.4x fast / 6x slow,
page), reconcile error rate >5% (page), and operator-metrics-absent (page). Each
links its runbook. This is the prod path; the operator chart's PrometheusRule is
the kube-prometheus-stack mirror for kx — the header documents the split.

Registered in kustomization; .yamllint embedded-JSON ignore generalized to the
agent-* glob + the new alert file. kustomize build green.
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

@stxkxs stxkxs merged commit 3c89842 into main Jun 23, 2026
7 checks passed
@stxkxs stxkxs deleted the agent-operator-dashboard branch June 23, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant