Skip to content

feat(sdk): add local/cloud backend abstraction#754

Draft
appcypher wants to merge 12 commits into
mainfrom
appcypher/sdk-local-cloud-backend
Draft

feat(sdk): add local/cloud backend abstraction#754
appcypher wants to merge 12 commits into
mainfrom
appcypher/sdk-local-cloud-backend

Conversation

@appcypher
Copy link
Copy Markdown
Member

@appcypher appcypher commented May 18, 2026

No description provided.

appcypher added 12 commits May 18, 2026 19:24
Introduce a Backend trait + ambient default that lets SDK callers route
sandbox operations to either the local libkrun runtime or a remote
msb-cloud control plane over HTTP. Same `Sandbox::builder(...)` API,
two transports. Tracks D6 of msb-cloud's sdk-cloud-parity-plan.md.

Backend module (`crates/microsandbox/lib/backend/`):

- `Backend` trait with `kind()` + `sandboxes()` accessor; object-safe so
  handles can hold `Arc<dyn Backend>`. `SandboxBackend` sub-trait covers
  create/list/get/start/stop/destroy and bridges local `Sandbox` handles
  with cloud `CloudSandbox` records via `BackendSandbox` /
  `BackendSandboxHandle` enums (variants boxed to keep enum size bounded).
- `LocalBackend` skeleton — placeholder for migrating today's
  `GlobalConfig` / `init_global` state in follow-ups.
- `CloudBackend` over `reqwest` with `new` / `from_env` / `from_profile` /
  builder constructors. HTTP errors map to typed variants
  (`SandboxNotFound`, `SandboxAlreadyExists`, `InvalidConfig`, `Runtime`,
  `CloudHttp`).
- Cloud wire types mirror msb-cloud's `msb-models` byte-for-byte,
  including the nine D13 expansion fields (workdir, shell, entrypoint,
  hostname, user, log_level, scripts, max_duration_secs,
  idle_timeout_secs). `cmd` and `rlimits` deliberately omitted.
- `SdkConfig` + `Profile` JSON schema; `resolve_default_backend`
  implements the Q1 precedence ladder (MSB_BACKEND env → direct
  MSB_API_URL/KEY → MSB_PROFILE → `active_profile` → local fallback).
  `api_key_ref` accepts `env:` / `inline:` (warn-on-load) /
  `keyring:` (stubbed).
- Ambient default via `OnceLock<RwLock<Arc<dyn Backend>>>` plus
  `with_backend(...)` task-local scope for per-future overrides.

Process-level libkrunfw (D6.6):

- `set_sdk_libkrunfw_path` mirrors `set_sdk_msb_path`; new
  `resolve_libkrunfw_path` ladder consults `MSB_LIBKRUNFW_PATH` env →
  SDK static → `config.paths.libkrunfw` → filesystem fallbacks.
- Removed per-sandbox `libkrunfw_path` from `SandboxBuilder` /
  `SandboxConfig` / Python+Node-TS builder bindings — one dylib per
  process address space, so per-sandbox makes no sense.

D12 cross-repo name validator:

- `validate_sandbox_name` enforces 1..=128 chars, alphanumeric / `.` /
  `-` / `_`, must start alphanumeric. Mirrors msb-cloud's
  `MAX_SANDBOX_NAME_LEN` rule; if it changes here it must change there.

Config + errors:

- `config_path()` honours `MSB_CONFIG_PATH` so legacy `GlobalConfig`
  and new `SdkConfig` always read/write the same JSON document.
- New `MicrosandboxError` variants: `Http(reqwest::Error)`,
  `CloudHttp { status, code, message }`, `Unsupported { feature,
  available_in_phase }`.

CLI:

- New `msb profile` command (D11): `list` / `show` / `use`. Active
  profile setter does a raw `serde_json::Value` round-trip so other
  config keys are preserved. `inline:` API keys are redacted in
  `profile show`.

FFI:

- Python: `set_runtime_libkrunfw_path` (exported as
  `set_libkrunfw_path`), `set_default_backend(kind, *, url, api_key,
  profile)`, `default_backend_kind`. New `CloudHttpError` and
  `UnsupportedError` mapped from typed Rust variants.
- Node-TS: `setRuntimeLibkrunfwPath`, `setDefaultBackend` (with
  `DefaultBackend` typed union wrapper in `runtime.ts`),
  `defaultBackendKind`. New `CloudHttpError`,
  `SandboxAlreadyExistsError`, `UnsupportedError` classes wired through
  the error-mapping layer.

Out of scope for this commit (next chunks): migrating `Sandbox` /
`Volume` / `ExecHandle` to hold `Arc<dyn Backend>` (fetch-live policy
per D6.4); `VolumeBackend` and `SnapshotBackend` sub-traits; exec /
logs / fs / metrics over cloud — those wait on msb-cloud Phase 16/B.1.
Plan-IDs (D6.6, D7, D12, etc.) belong in commit messages and PR
descriptions, not in code comments. They don't age well: once the
plan is folded into the design, the IDs become noise that future
readers have to chase. Substantive cross-repo warnings (e.g. "stay
in sync with msb-cloud's validate_sandbox_name") remain.

Also rename MicrosandboxError::Unsupported's `available_in_phase`
field to `available_when` and update its callers to pass
human-readable strings ("when cloud volumes ship", "when rlimits
land on the cloud API") instead of phase numbers.
Per the plan's D6.4: Sandbox and SandboxHandle now hold Arc<dyn Backend>
+ a backend-private SandboxInner / SandboxHandleInner enum. Methods
dispatch through the trait so Sandbox::builder("x").create() routes
to whichever backend the active profile points to.

- New SandboxBackend lifecycle surface (create / create_detached /
  start / start_detached / get / list / remove / stop / kill / drain).
- LocalBackend impl wraps today's libkrun + agentd path.
- CloudBackend impl uses the existing HTTP layer for lifecycle ops.
- Deleted BackendSandbox / BackendSandboxHandle / BackendSandboxList
  user-facing variant enums (violated D6.4 "no variants on handles").
- Sandbox::status() / last_error() fetch-live from the backend.
- Sandbox::backend_kind() / local() / cloud() variant accessors per D6.4.

Volume / ExecHandle / SandboxFs / VolumeFs and the globals-into-LocalBackend
absorption come in follow-up commits on this branch.
Per the plan's D6.4 + D8/D9: Volume, VolumeHandle, VolumeFs now hold
Arc<dyn Backend> + private VolumeInner / VolumeHandleInner enum.
Volume::{create,get,list,remove} dispatch through default_backend().

- New VolumeBackend trait on the Backend dispatch surface.
- LocalBackend impl wraps today's volume_dir + sqlite path.
- CloudBackend returns Unsupported for every method — cloud volumes
  ship in Phase 6 (D14 table).
- VolumeFs<'a> borrows the Arc<dyn Backend> + name, dispatches fs_*
  through the trait. Streaming (read_stream/write_stream) stays
  local-only via a require_local_path guard + LocalBackend::volume_path
  helper; cloud streaming returns Unsupported.
- Python/Node FFI rewired: PyVolumeFs/JsVolumeFs now hold (backend, name)
  and reconstruct VolumeFs per call via the new pub
  VolumeFs::with_backend constructor.
- Volume::path() is now Result-returning (errors with Unsupported on
  cloud where bytes live in S3, not on the host).
Per the plan's D6.7 Layer 2a: the GlobalConfig singleton and the
process-wide DB pool now live as fields on LocalBackend instead of
behind crate::config::config() / crate::db::init_global() statics.

- LocalBackend gains db() / config() accessors + path helpers
  (sandboxes_dir / volumes_dir / etc.) read off owned LocalConfig.
- LocalBackend::builder() lands for programmatic config: home,
  *_dir, max_connections, default_cpus / memory / log_level.
- LocalBackend::lazy() lazily initialises the DB pool on first access
  via tokio::sync::OnceCell; config is read eagerly from
  ~/.microsandbox/config.json. Preserves no-explicit-setup ergonomics.
- LocalBackend::ambient() returns the process-wide instance for code
  paths that cannot thread a backend ref (serde defaults, FFI shims,
  CLI helpers).
- Backend trait gains as_local() so dyn-Backend call sites that
  need local-only state can downcast cleanly.
- Threaded &LocalBackend through every local-code helper. Free
  functions like create_local / start_local / get_local_handle_state
  / list_local_handle_state / remove_local / stop_local / kill_local /
  drain_local / pull_oci_image / validate_start_state /
  Volume::create_local / get_local / list_local / remove_local /
  VolumeHandle::from_local_model take &LocalBackend.
- Snapshot module helpers + image module helpers consult
  LocalBackend::ambient() (they are local-only).
- GlobalConfig renamed to LocalConfig. config()/CONFIG static and
  init_global()/GLOBAL_POOL static deleted from crate::config /
  crate::db. resolve_msb_path/resolve_libkrunfw_path take &LocalConfig.
- CLI + FFI external callers (volume/snapshot/image commands, python
  snapshot.rs, helpers.rs, lib.rs, node-ts snapshot.rs, go lib.rs)
  reach the resolved paths via microsandbox::LocalBackend::ambient().

Process-level Layer 1 plumbing (SDK_LIBKRUNFW_PATH / SDK_MSB_PATH /
keyring) stays in crate::config — those are documented process-wide
state, not LocalBackend-instance state.
Per code review on commit f8f750a:

- cloud_create_request_from_config now errors with Unsupported for
  replace_existing, init, pull_policy, registry_auth, insecure,
  ca_certs, and network (ports/secrets/custom DNS/host-CA trust).
  Previously these were silently dropped — user calls
  .port(8080, 80).create() got a cloud sandbox with no port
  forwarding silently.

- SandboxStatus gains Created and Starting variants. The cloud
  status mapping no longer collapses these to Stopped, so
  sb.status() on a booting cloud sandbox reports the truth.
… method

Per code review: SandboxBackend::{remove,stop,kill,drain} and
VolumeBackend::remove previously took &self only, forcing the
LocalBackend impls to re-resolve crate::backend::default_backend()
to satisfy downstream needs. That re-resolution silently picked the
process-wide default even when the dispatcher had been called on a
non-default instance — corrupting the multi-backend testing path
D6.7 explicitly enables.

All lifecycle methods now take backend: Arc<dyn Backend> as the
first arg. CloudBackend impls accept the arg but ignore it (cloud
ops don't need the wrapping Arc). LocalBackend impls use self
directly and never call default_backend() from inside.
Per code review on commit 505df8c: D6.7's "bound to one backend
instance" promise was broken because spawn_sandbox / snapshot::* /
image::* / volume::fs::local::* / read_logs / all_sandbox_metrics
all reached into LocalBackend::ambient() instead of using the
&LocalBackend the trait dispatch already had in scope. Result:
under set_default_backend(custom_local) or with_backend(custom_local,
...), the DB row got written to custom but the spawn looked up
paths against ambient — broken.

- Thread &LocalBackend through every local-only helper. CLI / FFI
  resolve default_backend().as_local() once at the entry point and
  pass it down via new resolve_local_backend / local_backend_ref
  helpers in cli/commands/common.rs.
- Eliminate the LocalBackend::ambient() static AND ambient() method
  per Defect 2 Approach A. The DEFAULT static in backend/mod.rs is
  the single process-wide cell; default_backend().as_local() is the
  only path to a process-wide LocalBackend now.
- Replace SandboxConfig serde defaults with compile-time constants
  (DEFAULT_CPUS, DEFAULT_MEMORY_MIB, default_metrics_sample_interval)
  so DB-row deserialization and sandbox_config_from_cloud don't
  trigger local-disk I/O. Defaults mirror SandboxDefaults::default()
  exactly; explicit-builder configs still flow through SandboxBuilder.
- reap_stale_sandboxes now consults default_backend() and no-ops when
  the active backend is cloud (best-effort startup task).

Defect 3 (Sandbox::stop/kill/drain routing through the trait) is
NOT in this commit. Equivalence check between MessageType::Shutdown
and external SIGTERM-to-libkrun-PID found they are NOT semantically
equivalent: the Shutdown path runs libc::sync() then
libc::reboot(RB_POWER_OFF) inside the guest (or signals the handoff
init in handoff mode), flushing ext4 cleanly. External SIGTERM
relies on libkrun's signal handler with no in-guest sync — risk of
journal replay on next boot. Reported back for a decision on
whether to keep the optimisation with a different trait shape.

Adds test confirming with_backend(custom_local, async { ... })
correctly scopes volume FS ops to custom_local's volumes_dir.
Test count: 223 (was 222).
…n shutdown

Per code review on commit f8f750a: Sandbox::stop on local reached
self.local_state().client directly, bypassing the SandboxBackend
trait — violating D6.4's "methods route through Arc<dyn Backend>"
invariant.

Naive fix would have downgraded Sandbox::stop to SandboxHandle::stop's
SIGTERM-via-PID behavior, losing the agent-channel Shutdown that
runs in-guest sync() + clean reboot (prevents ext4 journal replay).

This commit routes through the trait while preserving semantics:
LocalBackend::SandboxBackend::stop now connects to the agent UDS at
sandboxes_dir/<name>/runtime/agent.sock and sends MessageType::Shutdown,
falling back to SIGTERM via PID if the socket isn't reachable. Same
trait dispatch path now serves both Sandbox::stop and
SandboxHandle::stop — both get clean shutdown.

Side effect: SandboxHandle::stop is upgraded from raw SIGTERM (could
risk journal replay) to clean Shutdown when the agent is reachable.
- Propagate serde error in SandboxHandle::from_cloud instead of
  silently emitting empty config_json. (P2)
- sandbox_config_from_cloud now maps log_level back from cloud, and
  comments which SandboxConfig fields are synthesised defaults. (P2)
- Document SandboxBackend::create's start asymmetry per D6.4 — cloud
  honours it, local always boots immediately. (P2)
- SandboxHandle::updated_at returns a best-effort timestamp on cloud
  (stopped_at.or(started_at).or(created_at)). (P3)
- Document Sandbox::status / last_error round-trip cost and point at
  Sandbox::get for batched reads. (P2)
- Remove SandboxHandle::status alias; status_snapshot is the only
  spelling, making the snapshot semantic explicit vs. Sandbox::status
  which is async + fetch-live. Updated CLI + node/python/go FFI
  callers. (P2/P3)
- Sandbox::remove_persisted takes &self instead of self so cloud
  callers don't lose ownership on the Unsupported branch. (P3)
- Unify VolumeBackend::fs_remove + fs_remove_dir into a single
  fs_remove(name, path, recursive) per plan Q2. VolumeFs's public
  remove / remove_dir surface stays the same.
- VolumeHandle::from_local_model takes only Arc<dyn Backend> and
  derives the &LocalBackend internally — kills the "passing two
  inconsistent things" footgun the review flagged.
- Standardise cloud Unsupported "available_when" strings on
  "when cloud volumes ship" for every volume op (was inconsistent
  with streaming's "when cloud streaming routes ship").
- LocalBackendBuilder gains setters for shell / workdir /
  metrics_sample_interval_ms / disable_metrics_sample / ca_certs /
  registry_hosts.
- LocalBackendBuilder::build().await now merges with
  ~/.microsandbox/config.json — builder overrides win, persisted
  values fill the rest. Adds a unit test exercising the merge.
Completes Phase 2 of the SDK parity refactor — every per-sandbox
method on Sandbox now dispatches through SandboxBackend. Resolves
two P2 findings from code review:

- A7: Sandbox::fs() / client() / client_arc() are infallible again.
  fs() returns SandboxFs<'_> directly; cloud sandboxes get a handle
  whose individual methods return Unsupported rather than the
  Result-at-construction surface that broke the public API contract.
  client() / client_arc() are local-only by their nature; they now
  panic with a documented assertion on cloud rather than returning
  Result, matching Rust convention for asserted-local accessors.
- B1: VolumeFs::read_stream / write_stream now dispatch through
  VolumeBackend::fs_read_stream / fs_write_stream. Local does the
  work; cloud returns Unsupported per-method instead of short-
  circuiting at handle construction.

The SandboxBackend trait gains exec / exec_stream / attach / logs
/ metrics / metrics_stream plus the full fs_* surface. LocalBackend's
trait impl opens a per-call agent UDS connection (option A in the
plan). The per-sandbox AgentClient stored on SandboxLocalState
stays available for callers that explicitly want it via sb.local(),
but the canonical path goes through the trait.

CloudBackend's exec / fs / logs / metrics all return Unsupported
per D14 until msb-cloud Phase 16 / B.1 ships.

Sandbox::exec / exec_stream / attach / shell / shell_stream / logs
/ metrics / metrics_stream now dispatch through the trait. attach
(host TTY raw mode + PTY bridging) moved to sandbox::attach::local;
exec moved to sandbox::exec::local; fs moved to sandbox::fs::local;
metrics gained local_metrics / local_metrics_stream helpers.

ExecHandle / FsReadStream / FsWriteSink stay single-type with
local-only internals. FsReadStream gained an internal Arc<AgentClient>
to pin the per-call agent connection alive for the stream's duration.

FFI bindings updated: python's PySandboxFs now holds backend + name
instead of Arc<AgentClient>; node-ts strips .map_err on fs();
go drops the ? after sb.fs() and awaits sb.logs().
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant