Skip to content

docs(runbooks): add the vend-failure runbook for the fleet-vend alerts#4

Merged
stxkxs merged 1 commit into
mainfrom
vend-failure-runbook
Jun 24, 2026
Merged

docs(runbooks): add the vend-failure runbook for the fleet-vend alerts#4
stxkxs merged 1 commit into
mainfrom
vend-failure-runbook

Conversation

@stxkxs

@stxkxs stxkxs commented Jun 24, 2026

Copy link
Copy Markdown
Member

The fleet-vend Grafana alerts (FleetVendReconcileFastBurn/SlowBurn/ProviderAbsent) shipped without a runbook_url because eks-fleet had no docs/runbooks/. Adds vend-failure.md:

  • ProviderAbsent → provider-opentofu down/crashlooping vs healthy-but-unscraped.
  • Burn alerts → the failing-Workspace drill: find Synced=False Workspaces, read the condition message for the real tofu/AWS error (the root cause behind the burn rate), scope the provider logs, budget-burn blast-radius call.
  • external-create-pending deadlock recovery (drop the finalizer + direct AWS deletion; never cycle the provider mid-apply), cross-linked to the hub teardown.

The runbook_url backfill onto the three rules in eks-gitops dashboards/base/alerting/fleet-vend.yaml lands in a companion eks-gitops PR.

Closes #3.

The fleet-vend Grafana alerts (FleetVendReconcileFastBurn/SlowBurn/ProviderAbsent)
had no runbook to link — eks-fleet had no docs/runbooks/. Adds vend-failure.md with
triage per alert: provider-down/unscraped (ProviderAbsent), and the failing-Workspace
drill for the burn alerts — find Workspaces with Synced=False, read the condition
message for the real tofu/AWS error (the root cause behind the burn rate), provider
log scoping, the budget-burn blast-radius call, and the external-create-pending
deadlock recovery (drop the finalizer + direct AWS deletion, never cycle the provider
mid-apply), cross-linked to the hub teardown procedure.

Closes #3.
@github-actions

Copy link
Copy Markdown

CI

yamllint + crossplane render passed.

@stxkxs stxkxs merged commit 256144d into main Jun 24, 2026
5 checks passed
@stxkxs stxkxs deleted the vend-failure-runbook branch June 24, 2026 04:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a vend-failure runbook + backfill alert runbook_url

1 participant