browser-session pod stays Running indefinitely after node process exits — entrypoint keeps container alive forever

## Summary

`vexa/services/vexa-bot/core/entrypoint.sh` (browser-session-mode branch) intentionally keeps the container alive forever after the `node dist/docker.js` process exits, with no time limit. K8s sees the pod as `Running` indefinitely. From a user perspective, "stop session" appears to do nothing because the pod never terminates.

## Reproduce

1. Dispatch a browser-session via runtime-api
2. Stop the session via the dashboard (or any other path that signals graceful stop)
3. The bot's `node` process exits cleanly with code 0
4. Observe: pod remains `Running` in K8s indefinitely

In a recent dogfood run, **two browser-session pods were `Running` for 6+ hours** after their `node` processes exited at `07:13:39`. From the user's perspective the stop button never worked.

## Root cause

`vexa/services/vexa-bot/core/entrypoint.sh` (last section of the browser-session-mode branch):

```bash
# Run node and keep container alive even if node crashes
echo "[entrypoint] Starting browser session node process..."
node dist/docker.js
EXIT_CODE=$?
echo "[entrypoint] node dist/docker.js exited with code $EXIT_CODE — keeping container alive for VNC access"
# Keep alive so VNC + websockify remain accessible
wait
```

`wait` (no args) blocks on **all** background children. Earlier in the script, browser-session mode starts:

- `Xvfb` (background)
- `x11vnc` (background, runs forever)
- `websockify` (background, runs forever)
- `socat` CDP proxy (background)
- `xdotool` listener (background)
- `sshd` (background)

So `wait` blocks indefinitely on processes that never exit. The container never exits. The pod stays `Running` until something external kills it.

## Pod log (from the affected session)

```
25/04/2026 07:13:39  TOTALS              :     73 |       722/      722 (  0.0%)
25/04/2026 07:13:39 destroyed xdamage object: 0x200040
[browser-session] Saving browser data...
[s3-sync] S3 save (auth-essential files only)...
[s3-sync] Uploaded 13 auth-essential items
[browser-session] Save complete
[entrypoint] node dist/docker.js exited with code 0 — keeping container alive for VNC access
```

After this last line: pod stays Running for 6+ hours. Memory keeps being held; node capacity keeps being consumed; the user sees their session as "still active" even though the bot has finished.

## Impact

1. **Lifecycle correctness** — "stop session" doesn't actually stop the pod. Users see ghost sessions in any UI that lists active runtime-managed pods.
2. **Cost** — pods consume node resources for hours after their work is done. On managed-K8s with paid nodes, this is real money.
3. **Capacity exhaustion** — capacity-reserve / autoscaler accounting treats orphan pods as active. Cluster scales up to handle "load" that isn't real.
4. **SoC blast-radius** — orphan pods on the wrong node pool stay orphan forever. Hard to clean up without manual intervention.

## Proposed fixes (pick one)

### Option A — Time-bound the keep-alive (minimal change)

```bash
node dist/docker.js
EXIT_CODE=$?
echo "[entrypoint] node exited code $EXIT_CODE — keep-alive for ${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS:-0}s"
if [ "${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS:-0}" -gt 0 ]; then
    timeout "${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS}" wait || true
fi
exit $EXIT_CODE
```

- Default `0` = exit immediately when node exits (correct lifecycle)
- Operators who want VNC-after-exit set the env var to a finite N seconds
- Backwards-compatible: existing deployments default to correct behavior

### Option B — Drop the keep-alive entirely (simplest)

```bash
node dist/docker.js
exit $?
```

VNC-after-exit was always a debug convenience; removing it forces the user to grab a VNC session BEFORE the bot exits. Loses a feature, gains correctness.

### Option C — runtime-api drives termination (more invasive)

Have runtime-api explicitly delete the pod when it observes "node exited" via the watch channel. Fixes the lifecycle ownership question (currently nothing owns the "kill the pod when work is done" responsibility).

## Recommendation

**Option A.** Preserves the debug feature for those who use it, defaults to correct behavior, single env var.

## Related

- Surfaced via vexa-platform release-005 dogfood 2026-04-25 (release-006 harness work).
- Platform-side DoD that catches this class of bug: file [vexa-platform's `feature-bot-lifecycle/dods.yaml`](https://github.com/Vexa-ai/vexa-platform/blob/main/release/capabilities/feature-bot-lifecycle/dods.yaml) — add a DoD `browser-session-pod-terminates-when-stopped` that queries K8s for any pod with label `runtime.profile=browser-session` AND age > 2× max-session-length and asserts zero such pods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

browser-session pod stays Running indefinitely after node process exits — entrypoint keeps container alive forever #259

Summary

Reproduce

Root cause

Pod log (from the affected session)

Impact

Proposed fixes (pick one)

Option A — Time-bound the keep-alive (minimal change)

Option B — Drop the keep-alive entirely (simplest)

Option C — runtime-api drives termination (more invasive)

Recommendation

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

browser-session pod stays Running indefinitely after node process exits — entrypoint keeps container alive forever #259

Description

Summary

Reproduce

Root cause

Pod log (from the affected session)

Impact

Proposed fixes (pick one)

Option A — Time-bound the keep-alive (minimal change)

Option B — Drop the keep-alive entirely (simplest)

Option C — runtime-api drives termination (more invasive)

Recommendation

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions