Skip to content

browser-session pod stays Running indefinitely after node process exits — entrypoint keeps container alive forever #259

@DmitriyG228

Description

@DmitriyG228

Summary

vexa/services/vexa-bot/core/entrypoint.sh (browser-session-mode branch) intentionally keeps the container alive forever after the node dist/docker.js process exits, with no time limit. K8s sees the pod as Running indefinitely. From a user perspective, "stop session" appears to do nothing because the pod never terminates.

Reproduce

  1. Dispatch a browser-session via runtime-api
  2. Stop the session via the dashboard (or any other path that signals graceful stop)
  3. The bot's node process exits cleanly with code 0
  4. Observe: pod remains Running in K8s indefinitely

In a recent dogfood run, two browser-session pods were Running for 6+ hours after their node processes exited at 07:13:39. From the user's perspective the stop button never worked.

Root cause

vexa/services/vexa-bot/core/entrypoint.sh (last section of the browser-session-mode branch):

# Run node and keep container alive even if node crashes
echo "[entrypoint] Starting browser session node process..."
node dist/docker.js
EXIT_CODE=$?
echo "[entrypoint] node dist/docker.js exited with code $EXIT_CODE — keeping container alive for VNC access"
# Keep alive so VNC + websockify remain accessible
wait

wait (no args) blocks on all background children. Earlier in the script, browser-session mode starts:

  • Xvfb (background)
  • x11vnc (background, runs forever)
  • websockify (background, runs forever)
  • socat CDP proxy (background)
  • xdotool listener (background)
  • sshd (background)

So wait blocks indefinitely on processes that never exit. The container never exits. The pod stays Running until something external kills it.

Pod log (from the affected session)

25/04/2026 07:13:39  TOTALS              :     73 |       722/      722 (  0.0%)
25/04/2026 07:13:39 destroyed xdamage object: 0x200040
[browser-session] Saving browser data...
[s3-sync] S3 save (auth-essential files only)...
[s3-sync] Uploaded 13 auth-essential items
[browser-session] Save complete
[entrypoint] node dist/docker.js exited with code 0 — keeping container alive for VNC access

After this last line: pod stays Running for 6+ hours. Memory keeps being held; node capacity keeps being consumed; the user sees their session as "still active" even though the bot has finished.

Impact

  1. Lifecycle correctness — "stop session" doesn't actually stop the pod. Users see ghost sessions in any UI that lists active runtime-managed pods.
  2. Cost — pods consume node resources for hours after their work is done. On managed-K8s with paid nodes, this is real money.
  3. Capacity exhaustion — capacity-reserve / autoscaler accounting treats orphan pods as active. Cluster scales up to handle "load" that isn't real.
  4. SoC blast-radius — orphan pods on the wrong node pool stay orphan forever. Hard to clean up without manual intervention.

Proposed fixes (pick one)

Option A — Time-bound the keep-alive (minimal change)

node dist/docker.js
EXIT_CODE=$?
echo "[entrypoint] node exited code $EXIT_CODE — keep-alive for ${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS:-0}s"
if [ "${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS:-0}" -gt 0 ]; then
    timeout "${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS}" wait || true
fi
exit $EXIT_CODE
  • Default 0 = exit immediately when node exits (correct lifecycle)
  • Operators who want VNC-after-exit set the env var to a finite N seconds
  • Backwards-compatible: existing deployments default to correct behavior

Option B — Drop the keep-alive entirely (simplest)

node dist/docker.js
exit $?

VNC-after-exit was always a debug convenience; removing it forces the user to grab a VNC session BEFORE the bot exits. Loses a feature, gains correctness.

Option C — runtime-api drives termination (more invasive)

Have runtime-api explicitly delete the pod when it observes "node exited" via the watch channel. Fixes the lifecycle ownership question (currently nothing owns the "kill the pod when work is done" responsibility).

Recommendation

Option A. Preserves the debug feature for those who use it, defaults to correct behavior, single env var.

Related

  • Surfaced via vexa-platform release-005 dogfood 2026-04-25 (release-006 harness work).
  • Platform-side DoD that catches this class of bug: file vexa-platform's feature-bot-lifecycle/dods.yaml — add a DoD browser-session-pod-terminates-when-stopped that queries K8s for any pod with label runtime.profile=browser-session AND age > 2× max-session-length and asserts zero such pods.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions