Skip to content

SDK: detached/one-shot box has no completion signal (status stays Running; remove needs force) #784

Description

@DorianZheng

Summary

A detached box has no completion signal for the container's main process. When a box is started detached and its PID 1 exits (workload finished, batch job done), there is no host-side way to observe that: list_info().state.status stays Running (it tracks the microVM, which is still up), and there is no wait()/exit-code API at the box level. The only way to detect completion today is to exec into the box and pattern-match the resulting error. Tearing the box down then requires force=true because the box still counts as active.

This makes one-shot / batch workloads (run a command to completion, observe it finished, clean up) awkward to express through the SDK.

Where (as of 07fa30f9)

  • Box status has no "container process exited" state — BoxStatus::{Running, Stopped, …} tracks the VM lifecycle; start() on a Running box is a no-op (src/boxlite/src/litebox/box_impl.rs:202).
  • The guest does know init exited, but only surfaces it as an exec-time error string: src/guest/src/service/container.rs:267 and src/guest/src/service/exec/mod.rs:435 ("Container init process exited …").
  • remove() rejects an active box without force: src/boxlite/src/runtime/rt_impl.rs:947cannot remove active box {id} (status: …). Use force=true to stop first.
  • No box-level wait() / exit-code accessor exists (the exit_code field is on exec results only, e.g. sdks/python/boxlite/exec.py:27).

Repro

  1. Start a box detached whose command runs to completion and exits (e.g. sh -c 'echo done').
  2. Poll list_info() → status stays Running indefinitely.
  3. The only signal that the workload finished is box.exec("true") failing with "Container init process exited".
  4. runtime.remove(name)InvalidState: cannot remove active box … Use force=true.

Expected

Either:

  • A queryable completion signal — e.g. a distinct status (Exited) or a wait() returning the init exit code for detached boxes; and/or
  • remove() on a box whose workload has exited treated as a clean (non-forced) removal.

Related — no typed error for lifecycle conflicts (was "A4")

A neighbouring rough edge: lifecycle-conflict conditions surface only as generic BoxliteError::InvalidState(String) (src/shared/src/errors.rs:56) with no stable code, so SDK consumers must substring-match messages (e.g. "already running" on a redundant start()). A typed/coded variant for these would remove the string-matching. Listed here because it shares the same root: lifecycle state isn't exposed in a structured way.

Current workaround (downstream)

apps/infra-local exec-probes for init exit and force-removes the box:

  • apps/infra-local/boxlite_local/orchestrator.py_wait_one_shot_exit (exec-probe loop) and runtime.remove(name, force=True).

These are documented in that package's README "SDK gotchas" table. We'd happily drop them once a first-class completion signal exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions