Summary
vexa/services/vexa-bot/core/entrypoint.sh (browser-session-mode branch) intentionally keeps the container alive forever after the node dist/docker.js process exits, with no time limit. K8s sees the pod as Running indefinitely. From a user perspective, "stop session" appears to do nothing because the pod never terminates.
Reproduce
- Dispatch a browser-session via runtime-api
- Stop the session via the dashboard (or any other path that signals graceful stop)
- The bot's
node process exits cleanly with code 0
- Observe: pod remains
Running in K8s indefinitely
In a recent dogfood run, two browser-session pods were Running for 6+ hours after their node processes exited at 07:13:39. From the user's perspective the stop button never worked.
Root cause
vexa/services/vexa-bot/core/entrypoint.sh (last section of the browser-session-mode branch):
# Run node and keep container alive even if node crashes
echo "[entrypoint] Starting browser session node process..."
node dist/docker.js
EXIT_CODE=$?
echo "[entrypoint] node dist/docker.js exited with code $EXIT_CODE — keeping container alive for VNC access"
# Keep alive so VNC + websockify remain accessible
wait
wait (no args) blocks on all background children. Earlier in the script, browser-session mode starts:
Xvfb (background)
x11vnc (background, runs forever)
websockify (background, runs forever)
socat CDP proxy (background)
xdotool listener (background)
sshd (background)
So wait blocks indefinitely on processes that never exit. The container never exits. The pod stays Running until something external kills it.
Pod log (from the affected session)
25/04/2026 07:13:39 TOTALS : 73 | 722/ 722 ( 0.0%)
25/04/2026 07:13:39 destroyed xdamage object: 0x200040
[browser-session] Saving browser data...
[s3-sync] S3 save (auth-essential files only)...
[s3-sync] Uploaded 13 auth-essential items
[browser-session] Save complete
[entrypoint] node dist/docker.js exited with code 0 — keeping container alive for VNC access
After this last line: pod stays Running for 6+ hours. Memory keeps being held; node capacity keeps being consumed; the user sees their session as "still active" even though the bot has finished.
Impact
- Lifecycle correctness — "stop session" doesn't actually stop the pod. Users see ghost sessions in any UI that lists active runtime-managed pods.
- Cost — pods consume node resources for hours after their work is done. On managed-K8s with paid nodes, this is real money.
- Capacity exhaustion — capacity-reserve / autoscaler accounting treats orphan pods as active. Cluster scales up to handle "load" that isn't real.
- SoC blast-radius — orphan pods on the wrong node pool stay orphan forever. Hard to clean up without manual intervention.
Proposed fixes (pick one)
Option A — Time-bound the keep-alive (minimal change)
node dist/docker.js
EXIT_CODE=$?
echo "[entrypoint] node exited code $EXIT_CODE — keep-alive for ${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS:-0}s"
if [ "${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS:-0}" -gt 0 ]; then
timeout "${VEXA_BROWSER_SESSION_VNC_KEEPALIVE_SECONDS}" wait || true
fi
exit $EXIT_CODE
- Default
0 = exit immediately when node exits (correct lifecycle)
- Operators who want VNC-after-exit set the env var to a finite N seconds
- Backwards-compatible: existing deployments default to correct behavior
Option B — Drop the keep-alive entirely (simplest)
node dist/docker.js
exit $?
VNC-after-exit was always a debug convenience; removing it forces the user to grab a VNC session BEFORE the bot exits. Loses a feature, gains correctness.
Option C — runtime-api drives termination (more invasive)
Have runtime-api explicitly delete the pod when it observes "node exited" via the watch channel. Fixes the lifecycle ownership question (currently nothing owns the "kill the pod when work is done" responsibility).
Recommendation
Option A. Preserves the debug feature for those who use it, defaults to correct behavior, single env var.
Related
- Surfaced via vexa-platform release-005 dogfood 2026-04-25 (release-006 harness work).
- Platform-side DoD that catches this class of bug: file vexa-platform's
feature-bot-lifecycle/dods.yaml — add a DoD browser-session-pod-terminates-when-stopped that queries K8s for any pod with label runtime.profile=browser-session AND age > 2× max-session-length and asserts zero such pods.
Summary
vexa/services/vexa-bot/core/entrypoint.sh(browser-session-mode branch) intentionally keeps the container alive forever after thenode dist/docker.jsprocess exits, with no time limit. K8s sees the pod asRunningindefinitely. From a user perspective, "stop session" appears to do nothing because the pod never terminates.Reproduce
nodeprocess exits cleanly with code 0Runningin K8s indefinitelyIn a recent dogfood run, two browser-session pods were
Runningfor 6+ hours after theirnodeprocesses exited at07:13:39. From the user's perspective the stop button never worked.Root cause
vexa/services/vexa-bot/core/entrypoint.sh(last section of the browser-session-mode branch):wait(no args) blocks on all background children. Earlier in the script, browser-session mode starts:Xvfb(background)x11vnc(background, runs forever)websockify(background, runs forever)socatCDP proxy (background)xdotoollistener (background)sshd(background)So
waitblocks indefinitely on processes that never exit. The container never exits. The pod staysRunninguntil something external kills it.Pod log (from the affected session)
After this last line: pod stays Running for 6+ hours. Memory keeps being held; node capacity keeps being consumed; the user sees their session as "still active" even though the bot has finished.
Impact
Proposed fixes (pick one)
Option A — Time-bound the keep-alive (minimal change)
0= exit immediately when node exits (correct lifecycle)Option B — Drop the keep-alive entirely (simplest)
VNC-after-exit was always a debug convenience; removing it forces the user to grab a VNC session BEFORE the bot exits. Loses a feature, gains correctness.
Option C — runtime-api drives termination (more invasive)
Have runtime-api explicitly delete the pod when it observes "node exited" via the watch channel. Fixes the lifecycle ownership question (currently nothing owns the "kill the pod when work is done" responsibility).
Recommendation
Option A. Preserves the debug feature for those who use it, defaults to correct behavior, single env var.
Related
feature-bot-lifecycle/dods.yaml— add a DoDbrowser-session-pod-terminates-when-stoppedthat queries K8s for any pod with labelruntime.profile=browser-sessionAND age > 2× max-session-length and asserts zero such pods.