Skip to content

Stale MongoDB state causes ghost reconciliation after InputManifest deletion #2009

@samuelstolicny

Description

@samuelstolicny

Problem

When an InputManifest fails during Terraform apply (e.g., due to Azure VM quota errors), the subsequent destroy also fails. Deleting the InputManifest from Kubernetes does not clean up the corresponding MongoDB documents, causing Claudie to continue reconciling (creating/destroying infrastructure) from stale state even though the InputManifest no longer exists.

Observed behavior

  1. Applied an InputManifest with a machine type that exceeded Azure quota — Terraform apply failed
  2. Terraform destroy also failed
  3. Deleted the InputManifest from Kubernetes — it got stuck due to the v1beta1.claudie.io/finalizer
  4. Removed the finalizer to force deletion — the operator re-added the finalizer before the object was garbage collected, requiring a second patch
  5. After the InputManifest was fully removed from Kubernetes, Claudie continued to provision/destroy infrastructure on Azure because MongoDB still held in-flight task documents
  6. Editing and re-applying the InputManifest with a corrected machine type resulted in both the old and new manifests being processed simultaneously

Root cause

MongoDB inputManifests collection retained documents with active inFlight tasks (CREATE and DELETE events) even after the corresponding Kubernetes InputManifest resources were deleted. The Claudie services continued processing these stale documents.

Expected behavior

  • Deleting an InputManifest from Kubernetes should clean up (or mark as cancelled) the corresponding MongoDB documents
  • The operator should not re-add a finalizer to a resource that already has a deletionTimestamp
  • Editing an InputManifest spec should update in-place, not result in duplicate processing

Workaround

  1. Remove the finalizer from the InputManifest (may need to run twice if operator re-adds it):
kubectl patch inputmanifest <name> -n <namespace> --type=json -p='[{"op": "remove", "path": "/metadata/finalizers"}]'
  1. Clean up stale documents from MongoDB:
kubectl exec -n claudie <mongodb-pod> -- mongosh "mongodb://<user>:<pass>@localhost:27017/claudie?authSource=admin" --quiet --eval 'db.inputManifests.deleteMany({})'
  1. Optionally restart Claudie deployments to flush in-memory state.

Impact

Claudie silently continues to provision/destroy cloud resources after InputManifest deletion, leading to unexpected costs and resource creation that the user believes they have stopped.

Environment

  • Cloud provider: Azure
  • Failure trigger: VM quota exceeded for specified machine type

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions