Bare-pod backend: pause_by_id deletes pod, resume_by_id is a no-op — suspended sessions are unrecoverable

## Problem

`SandboxBackend.pause_by_id()` defaults to calling `stop_by_id()`, which deletes the pod entirely. `resume_by_id()` is a no-op (returns `None`). The `KubernetesExecutorBackend` inherits both defaults without overriding either.

Meanwhile, `get_or_spawn()` in `agent.py:846-873` treats `state='suspended'` as resumable: it calls `resume_by_id()` (no-op), checks status (pod is gone), and raises `RuntimeError("failed to resume suspended sandbox: ...")` — surfaced to the end user as "Failed to start the agent runtime."

Only the CRD-based `KubernetesAgentSandboxBackend` actually supports pause/resume (replicas 0→1), but there is no guard preventing the suspend codepath from running on backends that don't support it.

## Impact

Every deployment using the default bare-pod backend accumulates `suspended` sessions in the DB (we had 2,206) whose pods are already deleted. Any user message to one of those threads permanently errors until the DB row is manually cleaned up.

## Suggested fix

Either:
1. Have `get_or_spawn()` fall through to spawn a new sandbox when resume fails (instead of raising), clearing `agent_thread_id` since the session state died with the pod
2. Or have `KubernetesExecutorBackend` override `pause_by_id` to set DB state to `gone` instead of `suspended`, so resume is never attempted

Option 1 is more defensive and handles edge cases in the CRD backend too. We're running it as a hotfix now.

## Relevant code

- `services/api/api/sandbox/base.py:135-141` — base class defaults
- `services/api/api/agent.py:846-873` — resume logic that assumes pod exists
- `services/api/api/sandbox/kubernetes.py` — bare-pod backend, no pause/resume override

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bare-pod backend: pause_by_id deletes pod, resume_by_id is a no-op — suspended sessions are unrecoverable #211

Problem

Impact

Suggested fix

Relevant code

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bare-pod backend: pause_by_id deletes pod, resume_by_id is a no-op — suspended sessions are unrecoverable #211

Description

Problem

Impact

Suggested fix

Relevant code

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions