Skip to content

Bare-pod backend: pause_by_id deletes pod, resume_by_id is a no-op — suspended sessions are unrecoverable #211

Description

@j-s

Problem

SandboxBackend.pause_by_id() defaults to calling stop_by_id(), which deletes the pod entirely. resume_by_id() is a no-op (returns None). The KubernetesExecutorBackend inherits both defaults without overriding either.

Meanwhile, get_or_spawn() in agent.py:846-873 treats state='suspended' as resumable: it calls resume_by_id() (no-op), checks status (pod is gone), and raises RuntimeError("failed to resume suspended sandbox: ...") — surfaced to the end user as "Failed to start the agent runtime."

Only the CRD-based KubernetesAgentSandboxBackend actually supports pause/resume (replicas 0→1), but there is no guard preventing the suspend codepath from running on backends that don't support it.

Impact

Every deployment using the default bare-pod backend accumulates suspended sessions in the DB (we had 2,206) whose pods are already deleted. Any user message to one of those threads permanently errors until the DB row is manually cleaned up.

Suggested fix

Either:

  1. Have get_or_spawn() fall through to spawn a new sandbox when resume fails (instead of raising), clearing agent_thread_id since the session state died with the pod
  2. Or have KubernetesExecutorBackend override pause_by_id to set DB state to gone instead of suspended, so resume is never attempted

Option 1 is more defensive and handles edge cases in the CRD backend too. We're running it as a hotfix now.

Relevant code

  • services/api/api/sandbox/base.py:135-141 — base class defaults
  • services/api/api/agent.py:846-873 — resume logic that assumes pod exists
  • services/api/api/sandbox/kubernetes.py — bare-pod backend, no pause/resume override

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions