feat(sdk): add local/cloud backend abstraction#754
Draft
appcypher wants to merge 12 commits into
Draft
Conversation
Introduce a Backend trait + ambient default that lets SDK callers route
sandbox operations to either the local libkrun runtime or a remote
msb-cloud control plane over HTTP. Same `Sandbox::builder(...)` API,
two transports. Tracks D6 of msb-cloud's sdk-cloud-parity-plan.md.
Backend module (`crates/microsandbox/lib/backend/`):
- `Backend` trait with `kind()` + `sandboxes()` accessor; object-safe so
handles can hold `Arc<dyn Backend>`. `SandboxBackend` sub-trait covers
create/list/get/start/stop/destroy and bridges local `Sandbox` handles
with cloud `CloudSandbox` records via `BackendSandbox` /
`BackendSandboxHandle` enums (variants boxed to keep enum size bounded).
- `LocalBackend` skeleton — placeholder for migrating today's
`GlobalConfig` / `init_global` state in follow-ups.
- `CloudBackend` over `reqwest` with `new` / `from_env` / `from_profile` /
builder constructors. HTTP errors map to typed variants
(`SandboxNotFound`, `SandboxAlreadyExists`, `InvalidConfig`, `Runtime`,
`CloudHttp`).
- Cloud wire types mirror msb-cloud's `msb-models` byte-for-byte,
including the nine D13 expansion fields (workdir, shell, entrypoint,
hostname, user, log_level, scripts, max_duration_secs,
idle_timeout_secs). `cmd` and `rlimits` deliberately omitted.
- `SdkConfig` + `Profile` JSON schema; `resolve_default_backend`
implements the Q1 precedence ladder (MSB_BACKEND env → direct
MSB_API_URL/KEY → MSB_PROFILE → `active_profile` → local fallback).
`api_key_ref` accepts `env:` / `inline:` (warn-on-load) /
`keyring:` (stubbed).
- Ambient default via `OnceLock<RwLock<Arc<dyn Backend>>>` plus
`with_backend(...)` task-local scope for per-future overrides.
Process-level libkrunfw (D6.6):
- `set_sdk_libkrunfw_path` mirrors `set_sdk_msb_path`; new
`resolve_libkrunfw_path` ladder consults `MSB_LIBKRUNFW_PATH` env →
SDK static → `config.paths.libkrunfw` → filesystem fallbacks.
- Removed per-sandbox `libkrunfw_path` from `SandboxBuilder` /
`SandboxConfig` / Python+Node-TS builder bindings — one dylib per
process address space, so per-sandbox makes no sense.
D12 cross-repo name validator:
- `validate_sandbox_name` enforces 1..=128 chars, alphanumeric / `.` /
`-` / `_`, must start alphanumeric. Mirrors msb-cloud's
`MAX_SANDBOX_NAME_LEN` rule; if it changes here it must change there.
Config + errors:
- `config_path()` honours `MSB_CONFIG_PATH` so legacy `GlobalConfig`
and new `SdkConfig` always read/write the same JSON document.
- New `MicrosandboxError` variants: `Http(reqwest::Error)`,
`CloudHttp { status, code, message }`, `Unsupported { feature,
available_in_phase }`.
CLI:
- New `msb profile` command (D11): `list` / `show` / `use`. Active
profile setter does a raw `serde_json::Value` round-trip so other
config keys are preserved. `inline:` API keys are redacted in
`profile show`.
FFI:
- Python: `set_runtime_libkrunfw_path` (exported as
`set_libkrunfw_path`), `set_default_backend(kind, *, url, api_key,
profile)`, `default_backend_kind`. New `CloudHttpError` and
`UnsupportedError` mapped from typed Rust variants.
- Node-TS: `setRuntimeLibkrunfwPath`, `setDefaultBackend` (with
`DefaultBackend` typed union wrapper in `runtime.ts`),
`defaultBackendKind`. New `CloudHttpError`,
`SandboxAlreadyExistsError`, `UnsupportedError` classes wired through
the error-mapping layer.
Out of scope for this commit (next chunks): migrating `Sandbox` /
`Volume` / `ExecHandle` to hold `Arc<dyn Backend>` (fetch-live policy
per D6.4); `VolumeBackend` and `SnapshotBackend` sub-traits; exec /
logs / fs / metrics over cloud — those wait on msb-cloud Phase 16/B.1.
Plan-IDs (D6.6, D7, D12, etc.) belong in commit messages and PR
descriptions, not in code comments. They don't age well: once the
plan is folded into the design, the IDs become noise that future
readers have to chase. Substantive cross-repo warnings (e.g. "stay
in sync with msb-cloud's validate_sandbox_name") remain.
Also rename MicrosandboxError::Unsupported's `available_in_phase`
field to `available_when` and update its callers to pass
human-readable strings ("when cloud volumes ship", "when rlimits
land on the cloud API") instead of phase numbers.
Per the plan's D6.4: Sandbox and SandboxHandle now hold Arc<dyn Backend>
+ a backend-private SandboxInner / SandboxHandleInner enum. Methods
dispatch through the trait so Sandbox::builder("x").create() routes
to whichever backend the active profile points to.
- New SandboxBackend lifecycle surface (create / create_detached /
start / start_detached / get / list / remove / stop / kill / drain).
- LocalBackend impl wraps today's libkrun + agentd path.
- CloudBackend impl uses the existing HTTP layer for lifecycle ops.
- Deleted BackendSandbox / BackendSandboxHandle / BackendSandboxList
user-facing variant enums (violated D6.4 "no variants on handles").
- Sandbox::status() / last_error() fetch-live from the backend.
- Sandbox::backend_kind() / local() / cloud() variant accessors per D6.4.
Volume / ExecHandle / SandboxFs / VolumeFs and the globals-into-LocalBackend
absorption come in follow-up commits on this branch.
Per the plan's D6.4 + D8/D9: Volume, VolumeHandle, VolumeFs now hold
Arc<dyn Backend> + private VolumeInner / VolumeHandleInner enum.
Volume::{create,get,list,remove} dispatch through default_backend().
- New VolumeBackend trait on the Backend dispatch surface.
- LocalBackend impl wraps today's volume_dir + sqlite path.
- CloudBackend returns Unsupported for every method — cloud volumes
ship in Phase 6 (D14 table).
- VolumeFs<'a> borrows the Arc<dyn Backend> + name, dispatches fs_*
through the trait. Streaming (read_stream/write_stream) stays
local-only via a require_local_path guard + LocalBackend::volume_path
helper; cloud streaming returns Unsupported.
- Python/Node FFI rewired: PyVolumeFs/JsVolumeFs now hold (backend, name)
and reconstruct VolumeFs per call via the new pub
VolumeFs::with_backend constructor.
- Volume::path() is now Result-returning (errors with Unsupported on
cloud where bytes live in S3, not on the host).
Per the plan's D6.7 Layer 2a: the GlobalConfig singleton and the process-wide DB pool now live as fields on LocalBackend instead of behind crate::config::config() / crate::db::init_global() statics. - LocalBackend gains db() / config() accessors + path helpers (sandboxes_dir / volumes_dir / etc.) read off owned LocalConfig. - LocalBackend::builder() lands for programmatic config: home, *_dir, max_connections, default_cpus / memory / log_level. - LocalBackend::lazy() lazily initialises the DB pool on first access via tokio::sync::OnceCell; config is read eagerly from ~/.microsandbox/config.json. Preserves no-explicit-setup ergonomics. - LocalBackend::ambient() returns the process-wide instance for code paths that cannot thread a backend ref (serde defaults, FFI shims, CLI helpers). - Backend trait gains as_local() so dyn-Backend call sites that need local-only state can downcast cleanly. - Threaded &LocalBackend through every local-code helper. Free functions like create_local / start_local / get_local_handle_state / list_local_handle_state / remove_local / stop_local / kill_local / drain_local / pull_oci_image / validate_start_state / Volume::create_local / get_local / list_local / remove_local / VolumeHandle::from_local_model take &LocalBackend. - Snapshot module helpers + image module helpers consult LocalBackend::ambient() (they are local-only). - GlobalConfig renamed to LocalConfig. config()/CONFIG static and init_global()/GLOBAL_POOL static deleted from crate::config / crate::db. resolve_msb_path/resolve_libkrunfw_path take &LocalConfig. - CLI + FFI external callers (volume/snapshot/image commands, python snapshot.rs, helpers.rs, lib.rs, node-ts snapshot.rs, go lib.rs) reach the resolved paths via microsandbox::LocalBackend::ambient(). Process-level Layer 1 plumbing (SDK_LIBKRUNFW_PATH / SDK_MSB_PATH / keyring) stays in crate::config — those are documented process-wide state, not LocalBackend-instance state.
Per code review on commit f8f750a: - cloud_create_request_from_config now errors with Unsupported for replace_existing, init, pull_policy, registry_auth, insecure, ca_certs, and network (ports/secrets/custom DNS/host-CA trust). Previously these were silently dropped — user calls .port(8080, 80).create() got a cloud sandbox with no port forwarding silently. - SandboxStatus gains Created and Starting variants. The cloud status mapping no longer collapses these to Stopped, so sb.status() on a booting cloud sandbox reports the truth.
… method
Per code review: SandboxBackend::{remove,stop,kill,drain} and
VolumeBackend::remove previously took &self only, forcing the
LocalBackend impls to re-resolve crate::backend::default_backend()
to satisfy downstream needs. That re-resolution silently picked the
process-wide default even when the dispatcher had been called on a
non-default instance — corrupting the multi-backend testing path
D6.7 explicitly enables.
All lifecycle methods now take backend: Arc<dyn Backend> as the
first arg. CloudBackend impls accept the arg but ignore it (cloud
ops don't need the wrapping Arc). LocalBackend impls use self
directly and never call default_backend() from inside.
Per code review on commit 505df8c: D6.7's "bound to one backend instance" promise was broken because spawn_sandbox / snapshot::* / image::* / volume::fs::local::* / read_logs / all_sandbox_metrics all reached into LocalBackend::ambient() instead of using the &LocalBackend the trait dispatch already had in scope. Result: under set_default_backend(custom_local) or with_backend(custom_local, ...), the DB row got written to custom but the spawn looked up paths against ambient — broken. - Thread &LocalBackend through every local-only helper. CLI / FFI resolve default_backend().as_local() once at the entry point and pass it down via new resolve_local_backend / local_backend_ref helpers in cli/commands/common.rs. - Eliminate the LocalBackend::ambient() static AND ambient() method per Defect 2 Approach A. The DEFAULT static in backend/mod.rs is the single process-wide cell; default_backend().as_local() is the only path to a process-wide LocalBackend now. - Replace SandboxConfig serde defaults with compile-time constants (DEFAULT_CPUS, DEFAULT_MEMORY_MIB, default_metrics_sample_interval) so DB-row deserialization and sandbox_config_from_cloud don't trigger local-disk I/O. Defaults mirror SandboxDefaults::default() exactly; explicit-builder configs still flow through SandboxBuilder. - reap_stale_sandboxes now consults default_backend() and no-ops when the active backend is cloud (best-effort startup task). Defect 3 (Sandbox::stop/kill/drain routing through the trait) is NOT in this commit. Equivalence check between MessageType::Shutdown and external SIGTERM-to-libkrun-PID found they are NOT semantically equivalent: the Shutdown path runs libc::sync() then libc::reboot(RB_POWER_OFF) inside the guest (or signals the handoff init in handoff mode), flushing ext4 cleanly. External SIGTERM relies on libkrun's signal handler with no in-guest sync — risk of journal replay on next boot. Reported back for a decision on whether to keep the optimisation with a different trait shape. Adds test confirming with_backend(custom_local, async { ... }) correctly scopes volume FS ops to custom_local's volumes_dir. Test count: 223 (was 222).
…n shutdown Per code review on commit f8f750a: Sandbox::stop on local reached self.local_state().client directly, bypassing the SandboxBackend trait — violating D6.4's "methods route through Arc<dyn Backend>" invariant. Naive fix would have downgraded Sandbox::stop to SandboxHandle::stop's SIGTERM-via-PID behavior, losing the agent-channel Shutdown that runs in-guest sync() + clean reboot (prevents ext4 journal replay). This commit routes through the trait while preserving semantics: LocalBackend::SandboxBackend::stop now connects to the agent UDS at sandboxes_dir/<name>/runtime/agent.sock and sends MessageType::Shutdown, falling back to SIGTERM via PID if the socket isn't reachable. Same trait dispatch path now serves both Sandbox::stop and SandboxHandle::stop — both get clean shutdown. Side effect: SandboxHandle::stop is upgraded from raw SIGTERM (could risk journal replay) to clean Shutdown when the agent is reachable.
- Propagate serde error in SandboxHandle::from_cloud instead of silently emitting empty config_json. (P2) - sandbox_config_from_cloud now maps log_level back from cloud, and comments which SandboxConfig fields are synthesised defaults. (P2) - Document SandboxBackend::create's start asymmetry per D6.4 — cloud honours it, local always boots immediately. (P2) - SandboxHandle::updated_at returns a best-effort timestamp on cloud (stopped_at.or(started_at).or(created_at)). (P3) - Document Sandbox::status / last_error round-trip cost and point at Sandbox::get for batched reads. (P2) - Remove SandboxHandle::status alias; status_snapshot is the only spelling, making the snapshot semantic explicit vs. Sandbox::status which is async + fetch-live. Updated CLI + node/python/go FFI callers. (P2/P3) - Sandbox::remove_persisted takes &self instead of self so cloud callers don't lose ownership on the Unsupported branch. (P3)
- Unify VolumeBackend::fs_remove + fs_remove_dir into a single fs_remove(name, path, recursive) per plan Q2. VolumeFs's public remove / remove_dir surface stays the same. - VolumeHandle::from_local_model takes only Arc<dyn Backend> and derives the &LocalBackend internally — kills the "passing two inconsistent things" footgun the review flagged. - Standardise cloud Unsupported "available_when" strings on "when cloud volumes ship" for every volume op (was inconsistent with streaming's "when cloud streaming routes ship"). - LocalBackendBuilder gains setters for shell / workdir / metrics_sample_interval_ms / disable_metrics_sample / ca_certs / registry_hosts. - LocalBackendBuilder::build().await now merges with ~/.microsandbox/config.json — builder overrides win, persisted values fill the rest. Adds a unit test exercising the merge.
Completes Phase 2 of the SDK parity refactor — every per-sandbox method on Sandbox now dispatches through SandboxBackend. Resolves two P2 findings from code review: - A7: Sandbox::fs() / client() / client_arc() are infallible again. fs() returns SandboxFs<'_> directly; cloud sandboxes get a handle whose individual methods return Unsupported rather than the Result-at-construction surface that broke the public API contract. client() / client_arc() are local-only by their nature; they now panic with a documented assertion on cloud rather than returning Result, matching Rust convention for asserted-local accessors. - B1: VolumeFs::read_stream / write_stream now dispatch through VolumeBackend::fs_read_stream / fs_write_stream. Local does the work; cloud returns Unsupported per-method instead of short- circuiting at handle construction. The SandboxBackend trait gains exec / exec_stream / attach / logs / metrics / metrics_stream plus the full fs_* surface. LocalBackend's trait impl opens a per-call agent UDS connection (option A in the plan). The per-sandbox AgentClient stored on SandboxLocalState stays available for callers that explicitly want it via sb.local(), but the canonical path goes through the trait. CloudBackend's exec / fs / logs / metrics all return Unsupported per D14 until msb-cloud Phase 16 / B.1 ships. Sandbox::exec / exec_stream / attach / shell / shell_stream / logs / metrics / metrics_stream now dispatch through the trait. attach (host TTY raw mode + PTY bridging) moved to sandbox::attach::local; exec moved to sandbox::exec::local; fs moved to sandbox::fs::local; metrics gained local_metrics / local_metrics_stream helpers. ExecHandle / FsReadStream / FsWriteSink stay single-type with local-only internals. FsReadStream gained an internal Arc<AgentClient> to pin the per-call agent connection alive for the stream's duration. FFI bindings updated: python's PySandboxFs now holds backend + name instead of Arc<AgentClient>; node-ts strips .map_err on fs(); go drops the ? after sb.fs() and awaits sb.logs().
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.