Skip to content

feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts#53

Merged
stxkxs merged 1 commit into
mainfrom
fleet-vend-dashboard
Jun 24, 2026
Merged

feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts#53
stxkxs merged 1 commit into
mainfrom
fleet-vend-dashboard

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

SRE dashboard + Grafana-managed SLO alerts for the eks-fleet vend pipeline — the
Crossplane Cluster XR → provider-opentofu Workspace → tofu path that manufactures
clusters. Observes the hub's vend execution signal so a failing vend is caught
before someone notices a cluster never came up.

Dashboard (dashboards/base/platform/fleet-vend.yaml)

GrafanaDashboard CR, self-contained inline PromQL over the provider's
controller-runtime metrics (reconcile rate / errors / latency for the Workspace
controller). RED on the vend reconcile loop + a 99%-success SLO panel with live
error-budget burn — no recording-rule dependency, so it renders against AMP with
no ruler.

Alerts (dashboards/base/alerting/fleet-vend.yaml)

GrafanaAlertRuleGroup, multi-window multi-burn-rate against the 99%/30d objective:
fast burn (14.4x over 1h ∧ 5m) and slow burn (6x over 6h ∧ 30m), both page, plus a
FleetVendProviderAbsent canary that pages when the provider's reconcile metrics
vanish (provider down or scrape broken). Dual-window AND is encoded as a > bool
product; clamp_min(denominator, 1) guards div-by-zero.

Wiring (verified end-to-end)

provider-opentofu's pod template carries prometheus.io/scrape on :8080 /metrics
(eks-fleet), the hub's Grafana Agent honors those annotations and remote-writes to
AMP, and the provider runs in crossplane-system — the namespace the queries
target. Delivered via the hub environment (this is the hub's pipeline).

Per-cluster vend inventory (Cluster + Workspace CR state via kube_customresource_*)
is a follow-up — the kube-state-metrics fleet CRDs need validating on a live hub.

@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production

All validations passed.

… on hub observability)

Vend-pipeline observability for eks-fleet, addressing the audit's F grade (no
dashboard, no instrumentation). Self-contained over the provider-opentofu
controller-runtime metrics that the eks-fleet vend-provider-scrape PR exposes.

dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR, three rows:
- Vend reconcile SLO & error budget (99% reconcile success / 30d): inline
  success ratio, budget remaining, fast/slow burn.
- Vend provider RED: provider-opentofu reconcile rate/error-ratio/latency
  p50/p95/p99 by controller (each reconcile = a tofu plan/apply cycle).
- Work queue: depth, add rate, queue-wait p95, active workers (vend backpressure).

dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window
reconcile-error burn (fast/slow, page) + provider-absent (page).

DRAFT / BLOCKED: the eks-fleet hub registers as environment=hub, for which
eks-gitops has no env (no overlays/hub, no values-hub.yaml) and the fleet account
has no AMP/AMG — so the hub runs neither grafana-agent nor grafana-operator and
these would render nothing there. Blocked on #50 (wire the hub into the
observability fabric). Per-cluster vend inventory/readiness (Cluster + Workspace
CR state via KSM) is a follow-on in that issue.
@stxkxs stxkxs force-pushed the fleet-vend-dashboard branch from 3c5a733 to dc3ee05 Compare June 24, 2026 00:50
@stxkxs stxkxs changed the title feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50] feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts Jun 24, 2026
@github-actions

Copy link
Copy Markdown

CI Results

Check Status
YAML Lint
Environment Kustomize Build
dev
staging
production
hub

All validations passed.

@stxkxs stxkxs merged commit fd9281c into main Jun 24, 2026
8 checks passed
@stxkxs stxkxs deleted the fleet-vend-dashboard branch June 24, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant