feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts by stxkxs · Pull Request #52 · nanohype/eks-gitops

stxkxs · 2026-06-23T17:56:56Z

Recreated against main after #48 merged (GitHub closed the original #49 when the stacked base branch portal-sre-dashboard was deleted). Same reviewed change, rebased to a single clean commit.

What

Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod — self-contained over the controller-runtime metrics that reach AMP via the operator pod's scrape annotation (merged in eks-agent-platform).

dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR: reconcile RED (rate / errors / latency p50/p95/p99 by controller), work-queue depth/wait/workers, a latency SLO row (99% of reconciles <1s / 30d), and a reconciled-fleet (kube_customresource) row.
dashboards/base/alerting/agent-operator.yaml — Grafana-managed dual-window latency burn, reconcile error rate, and operator-metrics-absent — each runbook-linked. The operator chart's PrometheusRule remains the kube-prometheus-stack mirror for kx.

Quality-checked A/A− (not hollow; metrics verified to reach AMP). kustomize build + yamllint green.

…erts Charts the eks-agent-platform operator's own control loop and gives its latency SLO teeth in prod. The audit found the operator's reconcile RED was alert-only (never visualized) with no SLO/error-budget board; this adds both, self-contained over the controller-runtime metrics that now reach AMP via the operator pod's scrape annotation (eks-agent-platform#46). dashboards/base/platform/agent-operator.yaml — GrafanaDashboard CR, four rows: - Reconcile SLO & error budget (99% of reconciles <1s / 30d): inline fraction- under-1s SLI, budget remaining, and fast/slow burn (1h/6h) stats. - Reconcile RED per controller: rate, error ratio, latency p50/p95/p99, and p99 by controller with the 1s SLO line. - Work queue & workers: depth, add rate, queue-wait p95, active workers. - Reconciled fleet: Platforms Ready ratio + CRs by kind & phase (kube_customresource). dashboards/base/alerting/agent-operator.yaml — GrafanaAlertRuleGroup (Grafana- managed, evaluated by AMG): dual-window latency burn (14.4x fast / 6x slow, page), reconcile error rate >5% (page), and operator-metrics-absent (page). Each links its runbook. This is the prod path; the operator chart's PrometheusRule is the kube-prometheus-stack mirror for kx — the header documents the split. Registered in kustomization; .yamllint embedded-JSON ignore generalized to the agent-* glob + the new alert file. kustomize build green.

github-actions · 2026-06-23T17:57:31Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

stxkxs merged commit 3c89842 into main Jun 23, 2026
7 checks passed

stxkxs deleted the agent-operator-dashboard branch June 23, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#52

feat(dashboards): operator reconcile RED + latency SLO dashboard & alerts#52
stxkxs merged 1 commit into
mainfrom
agent-operator-dashboard

stxkxs commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026

What

Uh oh!

github-actions Bot commented Jun 23, 2026

CI Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant