feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]#51
Draft
stxkxs wants to merge 1 commit into
Draft
feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]#51stxkxs wants to merge 1 commit into
stxkxs wants to merge 1 commit into
Conversation
… on hub observability) Vend-pipeline observability for eks-fleet, addressing the audit's F grade (no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics that the eks-fleet vend-provider-scrape PR exposes. dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR, three rows: - Vend reconcile SLO & error budget (99% reconcile success / 30d): inline success ratio, budget remaining, fast/slow burn. - Vend provider RED: provider-opentofu reconcile rate/error-ratio/latency p50/p95/p99 by controller (each reconcile = a tofu plan/apply cycle). - Work queue: depth, add rate, queue-wait p95, active workers (vend backpressure). dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page). DRAFT / BLOCKED: the eks-fleet hub registers as environment=hub, for which eks-gitops has no env (no overlays/hub, no values-hub.yaml) and the fleet account has no AMP/AMG — so the hub runs neither grafana-agent nor grafana-operator and these would render nothing there. Blocked on #50 (wire the hub into the observability fabric). Per-cluster vend inventory/readiness (Cluster + Workspace CR state via KSM) is a follow-on in that issue.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Vend-pipeline observability for eks-fleet (the audit graded it F — no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics exposed by eks-fleet#2.
dashboards/base/platform/fleet-vend.yaml— GrafanaDashboard CR: vend reconcile-success SLO + provider-opentofu reconcile RED (rate / error-ratio / latency p50/p95/p99, each reconcile = a tofu plan/apply cycle) + work-queue (vend backpressure).dashboards/base/alerting/fleet-vend.yaml— GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page).Why it's blocked (#50)
A quality-check caught the make-or-break gap: the hub registers as
environment: hub, but eks-gitops has nohubenv (nooverlays/hub, novalues-hub.yaml) and the fleet account has no AMP/AMG. So the hub runs neither grafana-agent (nothing scrapes the provider) nor grafana-operator (no CR reconciler) — these would render nothing there, and they'd be empty on every workload cluster too (the vend pipeline runs only on the hub). #50 tracks wiring the hub in; merge this after.Known limitation (also in #50)
The SLI is provider reconcile-error ratio — a provider-RED proxy, not a true per-vend success rate (provider-opentofu polls every 1m, so the denominator is dominated by drift polls and transient retries count as errors). True per-cluster vend success/readiness needs the KSM
customResourceStateextended for theClusterXR +WorkspaceCR — the follow-on in #50.Validation
kustomize build dashboards/basegreen; yamllint clean; dashboard JSON valid; alert conditions resolve; consistent with the portal + agent-operator patterns (sharedslo-alertsfolder, pinned datasource UID,> booldual-window burn). Stacked onportal-sre-dashboardfor the shared alerting infra.