Skip to content

[BUG] SandboxUpdateOps permanently stalls when any sandbox upgrade fails — failed replicas consume maxUnavailable budget forever #489

Description

@rakshaak29

What happened:

When one or more sandboxes fail during a SandboxUpdateOps rolling upgrade, the entire
update operation enters a permanent deadlock: no further sandboxes are ever upgraded,
and the operation remains stuck in the Updating phase indefinitely — even though many
candidate sandboxes have not yet been touched.

Root Cause:

The budget calculation in pkg/controller/sandboxupdateops/sandboxupdateops_controller.go
subtracts both updating (sandboxes actively in-flight) and failed (sandboxes
in a terminal failure state) from the maxUnavailable slot budget:

// line 192
toUpgrade := int(maxConcurrent) - int(updating) - int(failed)

The semantic intent of MaxUnavailable is to cap the number of sandboxes that are concurrently in-flight. Failed sandboxes are already in a terminal, non-recovering state — they are no longer being processed. Subtracting them from the budget causes the available upgrade slots to shrink permanently as failures accumulate.

For example, with maxUnavailable: 1 and 1 failed sandbox:

maxConcurrent = 1
updating      = 0   (nothing actively in-flight)
failed        = 1   (one terminal failure)
toUpgrade     = 1 - 0 - 1 = 0  ← no new upgrades ever dispatched

The inner loop (for i := 0; i < toUpgrade; i++) never executes. No new sandboxes are patched. The reconciler returns without error, so controller-runtime does not requeue with backoff. The operation is silently stuck.

Compound effect — completion condition is also never satisfied:
The transition to SandboxUpdateOpsFailed (or SandboxUpdateOpsCompleted) requires:

case updated+failed == total && len(candidates) == 0:

Because candidates are never drained (nothing is patched after the first failure), len(candidates) > 0 forever, and the operation never transitions out of Updating phase. The object persists indefinitely with no event or condition emitted to indicate the stall.

What you expected to happen:

A SandboxUpdateOps should continue upgrading remaining candidate sandboxes even after some sandboxes fail. The MaxUnavailable budget should govern only the number of upgrades simultaneously in-flight, not the cumulative count of terminal failures. Once all candidates have been processed (either upgraded or failed), the operation should transition to Failed (because failed > 0), reflecting the partial failure clearly.

How to reproduce it:

  1. Create 5 standalone Sandbox resources (not controlled by a SandboxSet) labeled app: my-app:
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
metadata:
  name: sandbox-1        # repeat for sandbox-2 through sandbox-5
  labels:
    app: my-app
spec:
  template:
    spec:
      containers:
      - name: main
        image: busybox:1.35
  1. Create a SandboxUpdateOps with maxUnavailable: 1 and a lifecycle hook that always fails:
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxUpdateOps
metadata:
  name: test-upgrade
spec:
  selector:
    matchLabels:
      app: my-app
  updateStrategy:
    maxUnavailable: 1
  patch:
    spec:
      containers:
      - name: main
        image: busybox:1.36
  lifecycle:
    postUpgrade:
      exec:
        command: ["/bin/sh", "-c", "exit 1"]   # always fails

3.Wait for the first sandbox to be patched and fail — observe status.failedReplicas: 1, status.updatingReplicas: 0.
4.Observe that no further sandboxes are ever patched. status.updatedReplicas never increases beyond 0. The operation stays in Updating phase indefinitely.
5.Confirm via controller logs that toUpgrade = 0 is computed on every reconcile cycle with no "Applying patch to sandbox" log line ever appearing for the remaining sandboxes.

Anything else we need to know?:

Suggested Fix:

Remove failed from the budget subtraction — terminal failures must not hold back in-flight budget slots:

// BEFORE (line 192):
toUpgrade := int(maxConcurrent) - int(updating) - int(failed)

// AFTER:
toUpgrade := int(maxConcurrent) - int(updating)

This matches the semantics of MaxUnavailable in Kubernetes RollingUpdate strategies, where failed (terminated) pods do not consume the unavailable budget slot.

If a "stop on first failure" mode is desired in the future, it should be implemented as an explicit FailurePolicy: Abort field on SandboxUpdateOpsStrategy, not via silent budget exhaustion.

Environment:

  • OpenKruise Agents version: Latest (main branch)
  • Kubernetes version: 1.28+
  • Component: agent-sandbox-controller (SandboxUpdateOps controller)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions