Skip to content

docs(pir): AWX pod stuck Pending — Calico RBAC gap + dqlite write storm (PGM-181)#6

Merged
pgmac merged 1 commit into
mainfrom
pgm-181-pir-awx-pod-pending-calico-rbac
May 14, 2026
Merged

docs(pir): AWX pod stuck Pending — Calico RBAC gap + dqlite write storm (PGM-181)#6
pgmac merged 1 commit into
mainfrom
pgm-181-pir-awx-pod-pending-calico-rbac

Conversation

@pgmac
Copy link
Copy Markdown
Contributor

@pgmac pgmac commented May 14, 2026

Summary

  • Adds PIR for 2026-05-15 incident where AWX job pods were repeatedly stuck Pending due to a Calico RBAC gap amplifying dqlite write pressure
  • Updates incident index with new entry

What's in the PIR

  • Full AEST timeline from pod creation through monthly maintenance completion
  • Infinite How's root cause chain (8 levels deep) from "AWX job failed" to "RBAC gap existed since cluster creation with no validation"
  • Impact table covering 4 prior failing AWX jobs (1859, 1861, 1863, 1865)
  • 8 action items covering alerting gaps, GitOps fix commit, microk8s upgrade, and Calico IPAM GC
  • Key surprise findings including why rolling-restart maintenance cannot compact a fragmented dqlite SQLite database

Related

  • Linear: PGM-181
  • Related PIRs: 2026-04-02-dqlite-snapshot-crash-loop-watch-stream-failure, 2026-04-12-pvek8s-dqlite-quorum-loss-complete-cluster-outage

🤖 Generated with Claude Code

Calico RBAC gap (missing list/watch on pods, absent workloadendpoints)
caused kube-controllers to crash-loop every 5s, amplifying dqlite write
pressure to the point that 225MB snapshot locks disrupted the scheduler
informer watch stream. Four AWX jobs failed before investigation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pgmac pgmac merged commit 0ccf520 into main May 14, 2026
2 checks passed
@pgmac pgmac deleted the pgm-181-pir-awx-pod-pending-calico-rbac branch May 14, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant