feat(dashboards): eks-fleet vend pipeline dashboard + SLO alerts#53
Merged
Conversation
This was referenced Jun 23, 2026
CI Results
All validations passed. |
… on hub observability) Vend-pipeline observability for eks-fleet, addressing the audit's F grade (no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics that the eks-fleet vend-provider-scrape PR exposes. dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR, three rows: - Vend reconcile SLO & error budget (99% reconcile success / 30d): inline success ratio, budget remaining, fast/slow burn. - Vend provider RED: provider-opentofu reconcile rate/error-ratio/latency p50/p95/p99 by controller (each reconcile = a tofu plan/apply cycle). - Work queue: depth, add rate, queue-wait p95, active workers (vend backpressure). dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page). DRAFT / BLOCKED: the eks-fleet hub registers as environment=hub, for which eks-gitops has no env (no overlays/hub, no values-hub.yaml) and the fleet account has no AMP/AMG — so the hub runs neither grafana-agent nor grafana-operator and these would render nothing there. Blocked on #50 (wire the hub into the observability fabric). Per-cluster vend inventory/readiness (Cluster + Workspace CR state via KSM) is a follow-on in that issue.
3c5a733 to
dc3ee05
Compare
CI Results
All validations passed. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SRE dashboard + Grafana-managed SLO alerts for the eks-fleet vend pipeline — the
Crossplane Cluster XR → provider-opentofu Workspace → tofu path that manufactures
clusters. Observes the hub's vend execution signal so a failing vend is caught
before someone notices a cluster never came up.
Dashboard (dashboards/base/platform/fleet-vend.yaml)
GrafanaDashboard CR, self-contained inline PromQL over the provider's
controller-runtime metrics (reconcile rate / errors / latency for the Workspace
controller). RED on the vend reconcile loop + a 99%-success SLO panel with live
error-budget burn — no recording-rule dependency, so it renders against AMP with
no ruler.
Alerts (dashboards/base/alerting/fleet-vend.yaml)
GrafanaAlertRuleGroup, multi-window multi-burn-rate against the 99%/30d objective:
fast burn (14.4x over 1h ∧ 5m) and slow burn (6x over 6h ∧ 30m), both page, plus a
FleetVendProviderAbsent canary that pages when the provider's reconcile metrics
vanish (provider down or scrape broken). Dual-window AND is encoded as a
> boolproduct;
clamp_min(denominator, 1)guards div-by-zero.Wiring (verified end-to-end)
provider-opentofu's pod template carries
prometheus.io/scrapeon :8080 /metrics(eks-fleet), the hub's Grafana Agent honors those annotations and remote-writes to
AMP, and the provider runs in
crossplane-system— the namespace the queriestarget. Delivered via the hub environment (this is the hub's pipeline).
Per-cluster vend inventory (Cluster + Workspace CR state via kube_customresource_*)
is a follow-up — the kube-state-metrics fleet CRDs need validating on a live hub.