Skip to content

feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]#51

Draft
stxkxs wants to merge 1 commit into
portal-sre-dashboardfrom
fleet-vend-dashboard
Draft

feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]#51
stxkxs wants to merge 1 commit into
portal-sre-dashboardfrom
fleet-vend-dashboard

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 23, 2026

Copy link
Copy Markdown
Member

Draft / blocked on #50. The artifacts are authored, validated, and quality-checked (A/A−/A on consistency/docs/patterns), but they render nothing until the eks-fleet hub joins the observability fabric — so this should not merge yet.

What

Vend-pipeline observability for eks-fleet (the audit graded it F — no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics exposed by eks-fleet#2.

  • dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR: vend reconcile-success SLO + provider-opentofu reconcile RED (rate / error-ratio / latency p50/p95/p99, each reconcile = a tofu plan/apply cycle) + work-queue (vend backpressure).
  • dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page).

Why it's blocked (#50)

A quality-check caught the make-or-break gap: the hub registers as environment: hub, but eks-gitops has no hub env (no overlays/hub, no values-hub.yaml) and the fleet account has no AMP/AMG. So the hub runs neither grafana-agent (nothing scrapes the provider) nor grafana-operator (no CR reconciler) — these would render nothing there, and they'd be empty on every workload cluster too (the vend pipeline runs only on the hub). #50 tracks wiring the hub in; merge this after.

Known limitation (also in #50)

The SLI is provider reconcile-error ratio — a provider-RED proxy, not a true per-vend success rate (provider-opentofu polls every 1m, so the denominator is dominated by drift polls and transient retries count as errors). True per-cluster vend success/readiness needs the KSM customResourceState extended for the Cluster XR + Workspace CR — the follow-on in #50.

Validation

kustomize build dashboards/base green; yamllint clean; dashboard JSON valid; alert conditions resolve; consistent with the portal + agent-operator patterns (shared slo-alerts folder, pinned datasource UID, > bool dual-window burn). Stacked on portal-sre-dashboard for the shared alerting infra.

… on hub observability)

Vend-pipeline observability for eks-fleet, addressing the audit's F grade (no
dashboard, no instrumentation). Self-contained over the provider-opentofu
controller-runtime metrics that the eks-fleet vend-provider-scrape PR exposes.

dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR, three rows:
- Vend reconcile SLO & error budget (99% reconcile success / 30d): inline
  success ratio, budget remaining, fast/slow burn.
- Vend provider RED: provider-opentofu reconcile rate/error-ratio/latency
  p50/p95/p99 by controller (each reconcile = a tofu plan/apply cycle).
- Work queue: depth, add rate, queue-wait p95, active workers (vend backpressure).

dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window
reconcile-error burn (fast/slow, page) + provider-absent (page).

DRAFT / BLOCKED: the eks-fleet hub registers as environment=hub, for which
eks-gitops has no env (no overlays/hub, no values-hub.yaml) and the fleet account
has no AMP/AMG — so the hub runs neither grafana-agent nor grafana-operator and
these would render nothing there. Blocked on #50 (wire the hub into the
observability fabric). Per-cluster vend inventory/readiness (Cluster + Workspace
CR state via KSM) is a follow-on in that issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant