Skip to content

Add a vend-failure runbook + backfill alert runbook_url #3

Description

@stxkxs

Context

The fleet-vend Grafana alert group (nanohype/eks-gitops PR #51) ships three page-severity alerts — FleetVendReconcileFastBurn, FleetVendReconcileSlowBurn, FleetVendProviderAbsent — but without runbook_url annotations, because eks-fleet has no docs/runbooks/ directory and no vend-failure runbook to link. (The sibling eks-agent-platform operator alerts all carry runbook_url.)

The alert description fields are actionable in the meantime (they name the triage step — kubectl get workspace for the Synced condition, the provider pod logs, the provider Healthy condition).

Fix

  • Add eks-fleet/docs/runbooks/vend-failure.md covering: a stuck/failed vend (Workspace Synced=False → read conditions[].message for the tofu/AWS error), provider-opentofu down/crashlooping, and the budget-burn response. Cross-link the teardown/orphan-sweep steps already in docs/stand-up-the-hub.md.
  • Backfill runbook_url: https://github.com/nanohype/eks-fleet/blob/main/docs/runbooks/vend-failure.md onto the three rules in dashboards/base/alerting/fleet-vend.yaml.

Low priority — depends on nanohype/eks-gitops#50 (the dashboard isn't live until the hub joins the observability fabric).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions