feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50] by stxkxs · Pull Request #51 · nanohype/eks-gitops

stxkxs · 2026-06-23T02:36:54Z

Draft / blocked on #50. The artifacts are authored, validated, and quality-checked (A/A−/A on consistency/docs/patterns), but they render nothing until the eks-fleet hub joins the observability fabric — so this should not merge yet.

What

Vend-pipeline observability for eks-fleet (the audit graded it F — no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics exposed by eks-fleet#2.

dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR: vend reconcile-success SLO + provider-opentofu reconcile RED (rate / error-ratio / latency p50/p95/p99, each reconcile = a tofu plan/apply cycle) + work-queue (vend backpressure).
dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page).

Why it's blocked (#50)

A quality-check caught the make-or-break gap: the hub registers as environment: hub, but eks-gitops has no hub env (no overlays/hub, no values-hub.yaml) and the fleet account has no AMP/AMG. So the hub runs neither grafana-agent (nothing scrapes the provider) nor grafana-operator (no CR reconciler) — these would render nothing there, and they'd be empty on every workload cluster too (the vend pipeline runs only on the hub). #50 tracks wiring the hub in; merge this after.

Known limitation (also in #50)

The SLI is provider reconcile-error ratio — a provider-RED proxy, not a true per-vend success rate (provider-opentofu polls every 1m, so the denominator is dominated by drift polls and transient retries count as errors). True per-cluster vend success/readiness needs the KSM customResourceState extended for the Cluster XR + Workspace CR — the follow-on in #50.

Validation

kustomize build dashboards/base green; yamllint clean; dashboard JSON valid; alert conditions resolve; consistent with the portal + agent-operator patterns (shared slo-alerts folder, pinned datasource UID, > bool dual-window burn). Stacked on portal-sre-dashboard for the shared alerting infra.

… on hub observability) Vend-pipeline observability for eks-fleet, addressing the audit's F grade (no dashboard, no instrumentation). Self-contained over the provider-opentofu controller-runtime metrics that the eks-fleet vend-provider-scrape PR exposes. dashboards/base/platform/fleet-vend.yaml — GrafanaDashboard CR, three rows: - Vend reconcile SLO & error budget (99% reconcile success / 30d): inline success ratio, budget remaining, fast/slow burn. - Vend provider RED: provider-opentofu reconcile rate/error-ratio/latency p50/p95/p99 by controller (each reconcile = a tofu plan/apply cycle). - Work queue: depth, add rate, queue-wait p95, active workers (vend backpressure). dashboards/base/alerting/fleet-vend.yaml — GrafanaAlertRuleGroup: dual-window reconcile-error burn (fast/slow, page) + provider-absent (page). DRAFT / BLOCKED: the eks-fleet hub registers as environment=hub, for which eks-gitops has no env (no overlays/hub, no values-hub.yaml) and the fleet account has no AMP/AMG — so the hub runs neither grafana-agent nor grafana-operator and these would render nothing there. Blocked on #50 (wire the hub into the observability fabric). Per-cluster vend inventory/readiness (Cluster + Workspace CR state via KSM) is a follow-on in that issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]#51

feat(dashboards): eks-fleet vend pipeline dashboard + alerts [BLOCKED on #50]#51
stxkxs wants to merge 1 commit into
portal-sre-dashboardfrom
fleet-vend-dashboard

stxkxs commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stxkxs commented Jun 23, 2026

What

Why it's blocked (#50)

Known limitation (also in #50)

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant