[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life

# [BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life

**What's going on**

Both the `sandbox-gateway` and the `sandbox-manager` keep an in-memory route table that maps a sandbox ID (`namespace--name`) to its pod IP and current state. Lifecycle events (claim / pause / resume / delete) are pushed between replicas over the memberlist `POST /refresh` endpoint, and on the gateway side a second writer — a CR-watch controller — also updates the same table from the Sandbox informer. The Envoy Go filter then reads that table and only routes traffic when the entry's `State == running`.

Reading through this path, three things stand out, and together they let the table drift away from reality.

**1. The two `/refresh` handlers do opposite things with the same message.**

The push payload is identical (a `proxy.Route` with a `state` field), but the two receivers disagree on what a non-running, non-dead state means:

```go
// pkg/proxy/server.go (manager) — deletes ONLY on "dead", keeps everything else
if route.State == v1alpha1.SandboxStateDead {
    s.DeleteRoute(route.ID)
} else {
    s.SetRoute(ctx, route)            // a {paused} push KEEPS the entry
}
```

```go
// pkg/sandbox-gateway/server/server.go — keeps ONLY "running", deletes everything else
if route.State == v1alpha1.SandboxStateRunning {
    registry.GetRegistry().Update(route.ID, route)
} else {
    registry.GetRegistry().Delete(route.ID)   // a {paused} push DROPS the entry
}
```

Pause pushes `{state: "paused"}` to every peer (`pkg/sandbox-manager/api.go`), so for the exact same event the manager's table ends up holding a `paused` entry while the gateway's table has nothing at all. The two data planes are supposed to be mirror images of each other and they aren't. Today the filter happens to reject both "paused" and "missing" with a 502, so it's latent — but the moment either side does anything more nuanced with a non-running route (wake-on-traffic resume, draining, surfacing state to a client), the two planes will behave differently for the same sandbox.

**2. Deleting a route throws away the resourceVersion ordering that updates carefully maintain.**

`Update` is written to be safe against out-of-order events — it compares `resourceVersion` and refuses to let an older event overwrite a newer one:

```go
// pkg/sandbox-gateway/registry/registry.go
func (r *Registry) Update(id string, route proxy.Route) bool {
    // LoadOrStore + CompareAndSwap, guarded by IsResourceVersionNewer(...)
}

func (r *Registry) Delete(id string) {
    r.entries.Delete(id)              // unconditional — no version check at all
}
```

`Delete` has no such guard. Once it empties an entry, the *next* write — no matter how stale — lands as a brand-new first write and wins unconditionally, because there's no longer an older value to compare against. So the careful ordering invariant only holds until the first delete.

That's a problem because the gateway has **two** writers racing on the same key with no ordering between them: the `/refresh` peer push and the CR-watch controller (both launched in `cmd/sandbox-gateway/main.go`). Picture a sandbox that gets deleted:

1. A `{dead}` push arrives → the entry is deleted.
2. The CR-watch controller reconciles from an informer cache that hasn't seen the delete yet → it reads the still-`running` Sandbox and calls `Update`. Because the entry was just deleted, this is a *first write*, so it's accepted unconditionally and the route is **resurrected as `running`**, pointing at the old pod IP.

A stale or reordered peer push can do the same thing. The entry self-heals once the informer catches up, but in the meantime the filter will happily route live client traffic to a pod IP that's been freed and possibly handed to a different sandbox. The whole point of the `resourceVersion` check is to prevent exactly this, and the delete path quietly opts out of it.

**3. (Smaller, same neighbourhood) the manager looks routes up by the wrong key when refreshing them.**

The manager stores routes under the full sandbox ID (`namespace--name`), but its periodic refresh loads them back by name only:

```go
// pkg/sandbox-manager/infra/sandboxcr/infra.go
oldRoute, exists := i.Proxy.LoadRoute(sbx.GetName())   // name only
newRoute := proxyutils.DefaultGetRouteFunc(sbx)        // stored under namespace--name
if !exists || newRoute.State != oldRoute.State || newRoute.IP != oldRoute.IP {
    i.Proxy.SetRoute(logs.NewContext(), newRoute)
}
```

For any non-empty namespace `exists` is always `false`, so the "skip if unchanged" shortcut never fires and the compare against `oldRoute` is meaningless. It's harmless today (it just always writes), but it's the same class of mistake — inconsistent keying of the same table — so it's worth fixing alongside.

**Why it actually matters**

The serious one is #2: routing real traffic to a dead/recycled pod IP, even briefly, is a correctness and isolation problem (a request meant for sandbox A can reach a pod now belonging to sandbox B). #1 is a latent consistency bug between two data planes that will turn into a real one as soon as non-running routes start being used for anything. Both get worse exactly under churn — high claim/pause/resume/delete rates and multiple replicas — which is when informer caches lag and peer pushes reorder.

**How to reproduce**

The divergence (#1) is deterministic:

1. Run a sandbox-manager and a sandbox-gateway in the same memberlist cluster.
2. Pause a running sandbox.
3. Inspect both route tables: the manager keeps a `paused` entry, the gateway has no entry. Same event, different state.

The resurrection (#2) is a race, but a realistic one:

1. Have a sandbox `running` in the gateway registry.
2. Delete it (a `{dead}` push removes the entry) while the gateway's Sandbox informer cache still shows it `running`.
3. Trigger a reconcile of that Sandbox before the cache catches up → the controller re-adds it as `running`.
4. The entry is back, pointing at the now-freed pod IP, and the filter routes to it until the cache converges.

**Suggested fix**

Make the route table the single arbiter and stop the two writers from being able to contradict each other:

- Give "delete" the same `resourceVersion` discipline as "update" — keep a short-lived tombstone carrying the resourceVersion at deletion time so a stale write can't resurrect a removed route as a fresh first write. The tombstone expires after a short TTL (by then the cluster has converged) and is swept inline, so there's no new background goroutine to leak.
- Pull the keep-vs-delete decision into one shared predicate (`delete iff state == dead`) and use it in *both* `/refresh` handlers, so an identical push produces an identical table on both planes. Storing a `paused` entry on the gateway is harmless — the filter already gates on `running` — and it lines the two planes back up.
- Apply the same versioned-delete model to the manager's route store too, since it's the same table and the same bug.
- Fix the manager's refresh lookup to use the full sandbox ID.

This stays inside the routing/registry code (`pkg/proxy`, `pkg/sandbox-gateway`) and the manager's route refresh; it doesn't touch the CRDs or the claim/scale paths.

**Scope / is this a duplicate?**

This is a standalone, pre-existing correctness bug in the route-table consumer/registry code. It's not tied to any roadmap or feature-project deliverable and isn't driver- or workload-specific.

It's distinct from the already-filed route work: #413/#414 is about *how* `/refresh` pushes are sent (connection pooling, retry budget, fan-out on `SyncRouteWithPeers`) — this is about how the receivers *consume* those pushes (versioned deletes, the second CR-watch writer, and the divergent keep/delete semantics). #427/#428 is about authenticating the `/refresh` endpoint, a separate concern in the same handler. I checked the open and closed issues/PRs and found nothing covering the registry delete ordering, the dual-writer resurrection, or the keep/delete divergence. Happy to send a fix.

**Environment**

- OpenKruiseAgent version: `master`
- Kubernetes version: N/A — code-level defect
- Affects: any multi-replica sandbox-manager + sandbox-gateway deployment; the resurrection window widens with claim/lifecycle churn and replica count


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life #492

[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life #492

Description

[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions