Problem
SandboxBackend.pause_by_id() defaults to calling stop_by_id(), which deletes the pod entirely. resume_by_id() is a no-op (returns None). The KubernetesExecutorBackend inherits both defaults without overriding either.
Meanwhile, get_or_spawn() in agent.py:846-873 treats state='suspended' as resumable: it calls resume_by_id() (no-op), checks status (pod is gone), and raises RuntimeError("failed to resume suspended sandbox: ...") — surfaced to the end user as "Failed to start the agent runtime."
Only the CRD-based KubernetesAgentSandboxBackend actually supports pause/resume (replicas 0→1), but there is no guard preventing the suspend codepath from running on backends that don't support it.
Impact
Every deployment using the default bare-pod backend accumulates suspended sessions in the DB (we had 2,206) whose pods are already deleted. Any user message to one of those threads permanently errors until the DB row is manually cleaned up.
Suggested fix
Either:
- Have
get_or_spawn() fall through to spawn a new sandbox when resume fails (instead of raising), clearing agent_thread_id since the session state died with the pod
- Or have
KubernetesExecutorBackend override pause_by_id to set DB state to gone instead of suspended, so resume is never attempted
Option 1 is more defensive and handles edge cases in the CRD backend too. We're running it as a hotfix now.
Relevant code
services/api/api/sandbox/base.py:135-141 — base class defaults
services/api/api/agent.py:846-873 — resume logic that assumes pod exists
services/api/api/sandbox/kubernetes.py — bare-pod backend, no pause/resume override
Problem
SandboxBackend.pause_by_id()defaults to callingstop_by_id(), which deletes the pod entirely.resume_by_id()is a no-op (returnsNone). TheKubernetesExecutorBackendinherits both defaults without overriding either.Meanwhile,
get_or_spawn()inagent.py:846-873treatsstate='suspended'as resumable: it callsresume_by_id()(no-op), checks status (pod is gone), and raisesRuntimeError("failed to resume suspended sandbox: ...")— surfaced to the end user as "Failed to start the agent runtime."Only the CRD-based
KubernetesAgentSandboxBackendactually supports pause/resume (replicas 0→1), but there is no guard preventing the suspend codepath from running on backends that don't support it.Impact
Every deployment using the default bare-pod backend accumulates
suspendedsessions in the DB (we had 2,206) whose pods are already deleted. Any user message to one of those threads permanently errors until the DB row is manually cleaned up.Suggested fix
Either:
get_or_spawn()fall through to spawn a new sandbox when resume fails (instead of raising), clearingagent_thread_idsince the session state died with the podKubernetesExecutorBackendoverridepause_by_idto set DB state togoneinstead ofsuspended, so resume is never attemptedOption 1 is more defensive and handles edge cases in the CRD backend too. We're running it as a hotfix now.
Relevant code
services/api/api/sandbox/base.py:135-141— base class defaultsservices/api/api/agent.py:846-873— resume logic that assumes pod existsservices/api/api/sandbox/kubernetes.py— bare-pod backend, no pause/resume override