What happened:
When one or more sandboxes fail during a SandboxUpdateOps rolling upgrade, the entire
update operation enters a permanent deadlock: no further sandboxes are ever upgraded,
and the operation remains stuck in the Updating phase indefinitely — even though many
candidate sandboxes have not yet been touched.
Root Cause:
The budget calculation in pkg/controller/sandboxupdateops/sandboxupdateops_controller.go
subtracts both updating (sandboxes actively in-flight) and failed (sandboxes
in a terminal failure state) from the maxUnavailable slot budget:
// line 192
toUpgrade := int(maxConcurrent) - int(updating) - int(failed)
The semantic intent of MaxUnavailable is to cap the number of sandboxes that are concurrently in-flight. Failed sandboxes are already in a terminal, non-recovering state — they are no longer being processed. Subtracting them from the budget causes the available upgrade slots to shrink permanently as failures accumulate.
For example, with maxUnavailable: 1 and 1 failed sandbox:
maxConcurrent = 1
updating = 0 (nothing actively in-flight)
failed = 1 (one terminal failure)
toUpgrade = 1 - 0 - 1 = 0 ← no new upgrades ever dispatched
The inner loop (for i := 0; i < toUpgrade; i++) never executes. No new sandboxes are patched. The reconciler returns without error, so controller-runtime does not requeue with backoff. The operation is silently stuck.
Compound effect — completion condition is also never satisfied:
The transition to SandboxUpdateOpsFailed (or SandboxUpdateOpsCompleted) requires:
case updated+failed == total && len(candidates) == 0:
Because candidates are never drained (nothing is patched after the first failure), len(candidates) > 0 forever, and the operation never transitions out of Updating phase. The object persists indefinitely with no event or condition emitted to indicate the stall.
What you expected to happen:
A SandboxUpdateOps should continue upgrading remaining candidate sandboxes even after some sandboxes fail. The MaxUnavailable budget should govern only the number of upgrades simultaneously in-flight, not the cumulative count of terminal failures. Once all candidates have been processed (either upgraded or failed), the operation should transition to Failed (because failed > 0), reflecting the partial failure clearly.
How to reproduce it:
- Create 5 standalone Sandbox resources (not controlled by a SandboxSet) labeled app: my-app:
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
metadata:
name: sandbox-1 # repeat for sandbox-2 through sandbox-5
labels:
app: my-app
spec:
template:
spec:
containers:
- name: main
image: busybox:1.35
- Create a SandboxUpdateOps with maxUnavailable: 1 and a lifecycle hook that always fails:
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxUpdateOps
metadata:
name: test-upgrade
spec:
selector:
matchLabels:
app: my-app
updateStrategy:
maxUnavailable: 1
patch:
spec:
containers:
- name: main
image: busybox:1.36
lifecycle:
postUpgrade:
exec:
command: ["/bin/sh", "-c", "exit 1"] # always fails
3.Wait for the first sandbox to be patched and fail — observe status.failedReplicas: 1, status.updatingReplicas: 0.
4.Observe that no further sandboxes are ever patched. status.updatedReplicas never increases beyond 0. The operation stays in Updating phase indefinitely.
5.Confirm via controller logs that toUpgrade = 0 is computed on every reconcile cycle with no "Applying patch to sandbox" log line ever appearing for the remaining sandboxes.
Anything else we need to know?:
Suggested Fix:
Remove failed from the budget subtraction — terminal failures must not hold back in-flight budget slots:
// BEFORE (line 192):
toUpgrade := int(maxConcurrent) - int(updating) - int(failed)
// AFTER:
toUpgrade := int(maxConcurrent) - int(updating)
This matches the semantics of MaxUnavailable in Kubernetes RollingUpdate strategies, where failed (terminated) pods do not consume the unavailable budget slot.
If a "stop on first failure" mode is desired in the future, it should be implemented as an explicit FailurePolicy: Abort field on SandboxUpdateOpsStrategy, not via silent budget exhaustion.
Environment:
- OpenKruise Agents version: Latest (main branch)
- Kubernetes version: 1.28+
- Component: agent-sandbox-controller (SandboxUpdateOps controller)
What happened:
When one or more sandboxes fail during a
SandboxUpdateOpsrolling upgrade, the entireupdate operation enters a permanent deadlock: no further sandboxes are ever upgraded,
and the operation remains stuck in the
Updatingphase indefinitely — even though manycandidate sandboxes have not yet been touched.
Root Cause:
The budget calculation in
pkg/controller/sandboxupdateops/sandboxupdateops_controller.gosubtracts both
updating(sandboxes actively in-flight) andfailed(sandboxesin a terminal failure state) from the
maxUnavailableslot budget:The semantic intent of MaxUnavailable is to cap the number of sandboxes that are concurrently in-flight. Failed sandboxes are already in a terminal, non-recovering state — they are no longer being processed. Subtracting them from the budget causes the available upgrade slots to shrink permanently as failures accumulate.
For example, with maxUnavailable: 1 and 1 failed sandbox:
The inner loop (for i := 0; i < toUpgrade; i++) never executes. No new sandboxes are patched. The reconciler returns without error, so controller-runtime does not requeue with backoff. The operation is silently stuck.
Compound effect — completion condition is also never satisfied:
The transition to SandboxUpdateOpsFailed (or SandboxUpdateOpsCompleted) requires:
Because candidates are never drained (nothing is patched after the first failure), len(candidates) > 0 forever, and the operation never transitions out of Updating phase. The object persists indefinitely with no event or condition emitted to indicate the stall.
What you expected to happen:
A SandboxUpdateOps should continue upgrading remaining candidate sandboxes even after some sandboxes fail. The MaxUnavailable budget should govern only the number of upgrades simultaneously in-flight, not the cumulative count of terminal failures. Once all candidates have been processed (either upgraded or failed), the operation should transition to Failed (because failed > 0), reflecting the partial failure clearly.
How to reproduce it:
3.Wait for the first sandbox to be patched and fail — observe status.failedReplicas: 1, status.updatingReplicas: 0.
4.Observe that no further sandboxes are ever patched. status.updatedReplicas never increases beyond 0. The operation stays in Updating phase indefinitely.
5.Confirm via controller logs that toUpgrade = 0 is computed on every reconcile cycle with no "Applying patch to sandbox" log line ever appearing for the remaining sandboxes.
Anything else we need to know?:
Suggested Fix:
Remove failed from the budget subtraction — terminal failures must not hold back in-flight budget slots:
This matches the semantics of MaxUnavailable in Kubernetes RollingUpdate strategies, where failed (terminated) pods do not consume the unavailable budget slot.
If a "stop on first failure" mode is desired in the future, it should be implemented as an explicit FailurePolicy: Abort field on SandboxUpdateOpsStrategy, not via silent budget exhaustion.
Environment: