feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts by stxkxs · Pull Request #53 · nanohype/eks-gitops

stxkxs · 2026-06-23T18:00:38Z

SRE dashboard + Grafana-managed SLO alerts for the eks-fleet vend pipeline — the
Crossplane Cluster XR → provider-opentofu Workspace → tofu path that manufactures
clusters. Observes the hub's vend execution signal so a failing vend is caught
before someone notices a cluster never came up.

Dashboard (dashboards/base/platform/fleet-vend.yaml)

GrafanaDashboard CR, self-contained inline PromQL over the provider's
controller-runtime metrics (reconcile rate / errors / latency for the Workspace
controller). RED on the vend reconcile loop + a 99%-success SLO panel with live
error-budget burn — no recording-rule dependency, so it renders against AMP with
no ruler.

Alerts (dashboards/base/alerting/fleet-vend.yaml)

GrafanaAlertRuleGroup, multi-window multi-burn-rate against the 99%/30d objective:
fast burn (14.4x over 1h ∧ 5m) and slow burn (6x over 6h ∧ 30m), both page, plus a
FleetVendProviderAbsent canary that pages when the provider's reconcile metrics
vanish (provider down or scrape broken). Dual-window AND is encoded as a > bool
product; clamp_min(denominator, 1) guards div-by-zero.

Wiring (verified end-to-end)

provider-opentofu's pod template carries prometheus.io/scrape on :8080 /metrics
(eks-fleet), the hub's Grafana Agent honors those annotations and remote-writes to
AMP, and the provider runs in crossplane-system — the namespace the queries
target. Delivered via the hub environment (this is the hub's pipeline).

Per-cluster vend inventory (Cluster + Workspace CR state via kube_customresource_*)
is a follow-up — the kube-state-metrics fleet CRDs need validating on a live hub.

github-actions · 2026-06-23T18:01:11Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅

All validations passed.

… on hub observability) Vend-pipeline observability for eks-fleet, addressing the audit's F grade (no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics that the eks-fleet vend-provider-scrape PR exposes. dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR, three rows: - Vend reconcile SLO & error budget (99% reconcile success / 30d): inline success ratio, budget remaining, fast/slow burn. - Vend provider RED: provider-opentofu reconcile rate/error-ratio/latency p50/p95/p99 by controller (each reconcile = a tofu plan/apply cycle). - Work queue: depth, add rate, queue-wait p95, active workers (vend backpressure). dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page). DRAFT / BLOCKED: the eks-fleet hub registers as environment=hub, for which eks-gitops has no env (no overlays/hub, no values-hub.yaml) and the fleet account has no AMP/AMG — so the hub runs neither grafana-agent nor grafana-operator and these would render nothing there. Blocked on #50 (wire the hub into the observability fabric). Per-cluster vend inventory/readiness (Cluster + Workspace CR state via KSM) is a follow-on in that issue.

github-actions · 2026-06-24T00:50:44Z

CI Results

Check	Status
YAML Lint	✅

Environment	Kustomize Build
dev	✅
staging	✅
production	✅
hub	✅

All validations passed.

This was referenced Jun 23, 2026

Wire the eks-fleet hub into the observability fabric (blocks the fleet-vend dashboard) #50

Closed

Add a vend-failure runbook + backfill alert runbook_url nanohype/eks-fleet#3

Closed

stxkxs mentioned this pull request Jun 23, 2026

feat(hub): full observability for the eks-fleet hub (metrics + logs + traces + dashboards) #54

Merged

stxkxs marked this pull request as ready for review June 24, 2026 00:31

stxkxs force-pushed the fleet-vend-dashboard branch from 3c5a733 to dc3ee05 Compare June 24, 2026 00:50

stxkxs changed the title ~~feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]~~ feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts Jun 24, 2026

stxkxs merged commit fd9281c into main Jun 24, 2026
8 checks passed

stxkxs deleted the fleet-vend-dashboard branch June 24, 2026 00:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts#53

feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts#53
stxkxs merged 1 commit into
mainfrom
fleet-vend-dashboard

stxkxs commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dashboard (dashboards/base/platform/fleet-vend.yaml)

Alerts (dashboards/base/alerting/fleet-vend.yaml)

Wiring (verified end-to-end)

Uh oh!

github-actions Bot commented Jun 23, 2026

CI Results

Uh oh!

github-actions Bot commented Jun 24, 2026

CI Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stxkxs commented Jun 23, 2026 •

edited

Loading