[BUG] SandboxUpdateOps permanently stalls when any sandbox upgrade fails — failed replicas consume maxUnavailable budget forever

**What happened**:

When one or more sandboxes fail during a `SandboxUpdateOps` rolling upgrade, the entire
update operation enters a **permanent deadlock**: no further sandboxes are ever upgraded,
and the operation remains stuck in the `Updating` phase indefinitely — even though many
candidate sandboxes have not yet been touched.

**Root Cause**:

The budget calculation in `pkg/controller/sandboxupdateops/sandboxupdateops_controller.go`
subtracts **both** `updating` (sandboxes actively in-flight) **and** `failed` (sandboxes
in a terminal failure state) from the `maxUnavailable` slot budget:

```go
// line 192
toUpgrade := int(maxConcurrent) - int(updating) - int(failed)
```
The semantic intent of MaxUnavailable is to cap the number of sandboxes that are concurrently in-flight. Failed sandboxes are already in a terminal, non-recovering state — they are no longer being processed. Subtracting them from the budget causes the available upgrade slots to shrink permanently as failures accumulate.

For example, with maxUnavailable: 1 and 1 failed sandbox:
```
maxConcurrent = 1
updating      = 0   (nothing actively in-flight)
failed        = 1   (one terminal failure)
toUpgrade     = 1 - 0 - 1 = 0  ← no new upgrades ever dispatched
```
The inner loop (for i := 0; i < toUpgrade; i++) never executes. No new sandboxes are patched. The reconciler returns without error, so controller-runtime does not requeue with backoff. The operation is silently stuck.

**Compound effect — completion condition is also never satisfied:**
The transition to SandboxUpdateOpsFailed (or SandboxUpdateOpsCompleted) requires:
```go
case updated+failed == total && len(candidates) == 0:
```
Because candidates are never drained (nothing is patched after the first failure), len(candidates) > 0 forever, and the operation never transitions out of Updating phase. The object persists indefinitely with no event or condition emitted to indicate the stall.

**What you expected to happen:**

A SandboxUpdateOps should continue upgrading remaining candidate sandboxes even after some sandboxes fail. The MaxUnavailable budget should govern only the number of upgrades simultaneously in-flight, not the cumulative count of terminal failures. Once all candidates have been processed (either upgraded or failed), the operation should transition to Failed (because failed > 0), reflecting the partial failure clearly.

How to reproduce it:

1. Create 5 standalone Sandbox resources (not controlled by a SandboxSet) labeled app: my-app:
```yaml
apiVersion: agents.kruise.io/v1alpha1
kind: Sandbox
metadata:
  name: sandbox-1        # repeat for sandbox-2 through sandbox-5
  labels:
    app: my-app
spec:
  template:
    spec:
      containers:
      - name: main
        image: busybox:1.35
```
2. Create a SandboxUpdateOps with maxUnavailable: 1 and a lifecycle hook that always fails:
```yaml
apiVersion: agents.kruise.io/v1alpha1
kind: SandboxUpdateOps
metadata:
  name: test-upgrade
spec:
  selector:
    matchLabels:
      app: my-app
  updateStrategy:
    maxUnavailable: 1
  patch:
    spec:
      containers:
      - name: main
        image: busybox:1.36
  lifecycle:
    postUpgrade:
      exec:
        command: ["/bin/sh", "-c", "exit 1"]   # always fails
```
3.Wait for the first sandbox to be patched and fail — observe status.failedReplicas: 1, status.updatingReplicas: 0.
4.Observe that no further sandboxes are ever patched. status.updatedReplicas never increases beyond 0. The operation stays in Updating phase indefinitely.
5.Confirm via controller logs that toUpgrade = 0 is computed on every reconcile cycle with no "Applying patch to sandbox" log line ever appearing for the remaining sandboxes.

**Anything else we need to know?:**

**Suggested Fix:**

Remove failed from the budget subtraction — terminal failures must not hold back in-flight budget slots:
```go
// BEFORE (line 192):
toUpgrade := int(maxConcurrent) - int(updating) - int(failed)

// AFTER:
toUpgrade := int(maxConcurrent) - int(updating)
```
This matches the semantics of MaxUnavailable in Kubernetes RollingUpdate strategies, where failed (terminated) pods do not consume the unavailable budget slot.

If a "stop on first failure" mode is desired in the future, it should be implemented as an explicit FailurePolicy: Abort field on SandboxUpdateOpsStrategy, not via silent budget exhaustion.

**Environment:**

- OpenKruise Agents version: Latest (main branch)
- Kubernetes version: 1.28+
- Component: agent-sandbox-controller (SandboxUpdateOps controller)







Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] SandboxUpdateOps permanently stalls when any sandbox upgrade fails — failed replicas consume maxUnavailable budget forever #489

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] SandboxUpdateOps permanently stalls when any sandbox upgrade fails — failed replicas consume maxUnavailable budget forever #489

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions