Skip to content

[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life #492

Description

@PRAteek-singHWY

[BUG] Gateway and manager disagree on a paused sandbox's route, and a deleted route can silently come back to life

What's going on

Both the sandbox-gateway and the sandbox-manager keep an in-memory route table that maps a sandbox ID (namespace--name) to its pod IP and current state. Lifecycle events (claim / pause / resume / delete) are pushed between replicas over the memberlist POST /refresh endpoint, and on the gateway side a second writer — a CR-watch controller — also updates the same table from the Sandbox informer. The Envoy Go filter then reads that table and only routes traffic when the entry's State == running.

Reading through this path, three things stand out, and together they let the table drift away from reality.

1. The two /refresh handlers do opposite things with the same message.

The push payload is identical (a proxy.Route with a state field), but the two receivers disagree on what a non-running, non-dead state means:

// pkg/proxy/server.go (manager) — deletes ONLY on "dead", keeps everything else
if route.State == v1alpha1.SandboxStateDead {
    s.DeleteRoute(route.ID)
} else {
    s.SetRoute(ctx, route)            // a {paused} push KEEPS the entry
}
// pkg/sandbox-gateway/server/server.go — keeps ONLY "running", deletes everything else
if route.State == v1alpha1.SandboxStateRunning {
    registry.GetRegistry().Update(route.ID, route)
} else {
    registry.GetRegistry().Delete(route.ID)   // a {paused} push DROPS the entry
}

Pause pushes {state: "paused"} to every peer (pkg/sandbox-manager/api.go), so for the exact same event the manager's table ends up holding a paused entry while the gateway's table has nothing at all. The two data planes are supposed to be mirror images of each other and they aren't. Today the filter happens to reject both "paused" and "missing" with a 502, so it's latent — but the moment either side does anything more nuanced with a non-running route (wake-on-traffic resume, draining, surfacing state to a client), the two planes will behave differently for the same sandbox.

2. Deleting a route throws away the resourceVersion ordering that updates carefully maintain.

Update is written to be safe against out-of-order events — it compares resourceVersion and refuses to let an older event overwrite a newer one:

// pkg/sandbox-gateway/registry/registry.go
func (r *Registry) Update(id string, route proxy.Route) bool {
    // LoadOrStore + CompareAndSwap, guarded by IsResourceVersionNewer(...)
}

func (r *Registry) Delete(id string) {
    r.entries.Delete(id)              // unconditional — no version check at all
}

Delete has no such guard. Once it empties an entry, the next write — no matter how stale — lands as a brand-new first write and wins unconditionally, because there's no longer an older value to compare against. So the careful ordering invariant only holds until the first delete.

That's a problem because the gateway has two writers racing on the same key with no ordering between them: the /refresh peer push and the CR-watch controller (both launched in cmd/sandbox-gateway/main.go). Picture a sandbox that gets deleted:

  1. A {dead} push arrives → the entry is deleted.
  2. The CR-watch controller reconciles from an informer cache that hasn't seen the delete yet → it reads the still-running Sandbox and calls Update. Because the entry was just deleted, this is a first write, so it's accepted unconditionally and the route is resurrected as running, pointing at the old pod IP.

A stale or reordered peer push can do the same thing. The entry self-heals once the informer catches up, but in the meantime the filter will happily route live client traffic to a pod IP that's been freed and possibly handed to a different sandbox. The whole point of the resourceVersion check is to prevent exactly this, and the delete path quietly opts out of it.

3. (Smaller, same neighbourhood) the manager looks routes up by the wrong key when refreshing them.

The manager stores routes under the full sandbox ID (namespace--name), but its periodic refresh loads them back by name only:

// pkg/sandbox-manager/infra/sandboxcr/infra.go
oldRoute, exists := i.Proxy.LoadRoute(sbx.GetName())   // name only
newRoute := proxyutils.DefaultGetRouteFunc(sbx)        // stored under namespace--name
if !exists || newRoute.State != oldRoute.State || newRoute.IP != oldRoute.IP {
    i.Proxy.SetRoute(logs.NewContext(), newRoute)
}

For any non-empty namespace exists is always false, so the "skip if unchanged" shortcut never fires and the compare against oldRoute is meaningless. It's harmless today (it just always writes), but it's the same class of mistake — inconsistent keying of the same table — so it's worth fixing alongside.

Why it actually matters

The serious one is #2: routing real traffic to a dead/recycled pod IP, even briefly, is a correctness and isolation problem (a request meant for sandbox A can reach a pod now belonging to sandbox B). #1 is a latent consistency bug between two data planes that will turn into a real one as soon as non-running routes start being used for anything. Both get worse exactly under churn — high claim/pause/resume/delete rates and multiple replicas — which is when informer caches lag and peer pushes reorder.

How to reproduce

The divergence (#1) is deterministic:

  1. Run a sandbox-manager and a sandbox-gateway in the same memberlist cluster.
  2. Pause a running sandbox.
  3. Inspect both route tables: the manager keeps a paused entry, the gateway has no entry. Same event, different state.

The resurrection (#2) is a race, but a realistic one:

  1. Have a sandbox running in the gateway registry.
  2. Delete it (a {dead} push removes the entry) while the gateway's Sandbox informer cache still shows it running.
  3. Trigger a reconcile of that Sandbox before the cache catches up → the controller re-adds it as running.
  4. The entry is back, pointing at the now-freed pod IP, and the filter routes to it until the cache converges.

Suggested fix

Make the route table the single arbiter and stop the two writers from being able to contradict each other:

  • Give "delete" the same resourceVersion discipline as "update" — keep a short-lived tombstone carrying the resourceVersion at deletion time so a stale write can't resurrect a removed route as a fresh first write. The tombstone expires after a short TTL (by then the cluster has converged) and is swept inline, so there's no new background goroutine to leak.
  • Pull the keep-vs-delete decision into one shared predicate (delete iff state == dead) and use it in both /refresh handlers, so an identical push produces an identical table on both planes. Storing a paused entry on the gateway is harmless — the filter already gates on running — and it lines the two planes back up.
  • Apply the same versioned-delete model to the manager's route store too, since it's the same table and the same bug.
  • Fix the manager's refresh lookup to use the full sandbox ID.

This stays inside the routing/registry code (pkg/proxy, pkg/sandbox-gateway) and the manager's route refresh; it doesn't touch the CRDs or the claim/scale paths.

Scope / is this a duplicate?

This is a standalone, pre-existing correctness bug in the route-table consumer/registry code. It's not tied to any roadmap or feature-project deliverable and isn't driver- or workload-specific.

It's distinct from the already-filed route work: #413/#414 is about how /refresh pushes are sent (connection pooling, retry budget, fan-out on SyncRouteWithPeers) — this is about how the receivers consume those pushes (versioned deletes, the second CR-watch writer, and the divergent keep/delete semantics). #427/#428 is about authenticating the /refresh endpoint, a separate concern in the same handler. I checked the open and closed issues/PRs and found nothing covering the registry delete ordering, the dual-writer resurrection, or the keep/delete divergence. Happy to send a fix.

Environment

  • OpenKruiseAgent version: master
  • Kubernetes version: N/A — code-level defect
  • Affects: any multi-replica sandbox-manager + sandbox-gateway deployment; the resurrection window widens with claim/lifecycle churn and replica count

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions