Context
The fleet-vend Grafana alert group (nanohype/eks-gitops PR #51) ships three page-severity alerts — FleetVendReconcileFastBurn, FleetVendReconcileSlowBurn, FleetVendProviderAbsent — but without runbook_url annotations, because eks-fleet has no docs/runbooks/ directory and no vend-failure runbook to link. (The sibling eks-agent-platform operator alerts all carry runbook_url.)
The alert description fields are actionable in the meantime (they name the triage step — kubectl get workspace for the Synced condition, the provider pod logs, the provider Healthy condition).
Fix
- Add
eks-fleet/docs/runbooks/vend-failure.md covering: a stuck/failed vend (Workspace Synced=False → read conditions[].message for the tofu/AWS error), provider-opentofu down/crashlooping, and the budget-burn response. Cross-link the teardown/orphan-sweep steps already in docs/stand-up-the-hub.md.
- Backfill
runbook_url: https://github.com/nanohype/eks-fleet/blob/main/docs/runbooks/vend-failure.md onto the three rules in dashboards/base/alerting/fleet-vend.yaml.
Low priority — depends on nanohype/eks-gitops#50 (the dashboard isn't live until the hub joins the observability fabric).
Context
The
fleet-vendGrafana alert group (nanohype/eks-gitops PR #51) ships three page-severity alerts —FleetVendReconcileFastBurn,FleetVendReconcileSlowBurn,FleetVendProviderAbsent— but withoutrunbook_urlannotations, because eks-fleet has nodocs/runbooks/directory and no vend-failure runbook to link. (The sibling eks-agent-platform operator alerts all carryrunbook_url.)The alert
descriptionfields are actionable in the meantime (they name the triage step —kubectl get workspacefor the Synced condition, the provider pod logs, the provider Healthy condition).Fix
eks-fleet/docs/runbooks/vend-failure.mdcovering: a stuck/failed vend (WorkspaceSynced=False→ readconditions[].messagefor the tofu/AWS error), provider-opentofu down/crashlooping, and the budget-burn response. Cross-link the teardown/orphan-sweep steps already indocs/stand-up-the-hub.md.runbook_url: https://github.com/nanohype/eks-fleet/blob/main/docs/runbooks/vend-failure.mdonto the three rules indashboards/base/alerting/fleet-vend.yaml.Low priority — depends on nanohype/eks-gitops#50 (the dashboard isn't live until the hub joins the observability fabric).