feat(sandbox): add experimental StrandsShellSandbox (Python + TS)#2839
feat(sandbox): add experimental StrandsShellSandbox (Python + TS)#2839mkmeral wants to merge 1 commit into
Conversation
Adds StrandsShellSandbox, an experimental Sandbox backed by the optional Strands Shell package (an in-process, declaratively-mediated shell). The native shell is thread-pinned, so the Python implementation confines it to a dedicated worker thread (_ShellWorker). File ops use the native VFS API; execute_code stages source to a temp VFS file. Vends sandbox_bash/sandbox_file_editor via get_tools() using the canonical vended_tools factories, auto-registered by the agent. Both SDKs place the backend under experimental and name it for the product: - Python: strands.experimental.sandbox.strands_shell - TS: @strands-agents/sdk/experimental/sandbox/strands-shell strands-shell is an optional dependency (Python extra / TS optional peer dep, lazy-imported). Experimental: API subject to change.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
agent-of-mkmeral
left a comment
There was a problem hiding this comment.
Review — feat(sandbox): add experimental StrandsShellSandbox (Python + TS)
Reviewed head 6e0f615. Cloned the PR, read all 14 files end-to-end, ran the dependency-free worker tests (test_worker.py → 6/6 pass), and traced the implementation against the base Sandbox contract and the existing vended-tool factories in both SDKs. The real-shell tests are correctly skipif-gated on the optional package, so they don't run in this environment.
Overall: strong, well-documented PR. The thread-pinning design for the unsendable native shell is genuinely careful — the _ShellWorker + weakref.finalize + del factory + the "don't capture self" comment, backed by test_dropping_sandbox_terminates_worker_thread, is exactly the right shape and the regression test is excellent. The VFS-native file-ops detour around the missing base64 command is a sensible deviation from PosixShellSandbox, and staging code to a temp VFS file (instead of shell-escaping) is the right call.
A handful of things I'd want resolved before this ships — the first two are the important ones:
🔴 1. Python and TS vend tools under different names (and TS contradicts the PR description)
The PR description states: "Both vend sandbox_bash / sandbox_file_editor." That's true for Python (name="sandbox_bash" / "sandbox_file_editor", asserted in test_get_tools_returns_prefixed_bash_and_file_editor), but the TS side vends bash / fileEditor — it calls makeFileEditor(this, {description}) / makeBash(this, {description}) without a name, so it falls through to the factory defaults. The TS unit test even bakes this in: expect(names).toStrictEqual(['fileEditor', 'bash']).
Tool names are part of the agent-facing contract. A skill/prompt that references sandbox_bash works on Python and silently misses on TS. Pick one (the description says sandbox_*) and make both SDKs agree. Inline thread on the TS line.
🔴 2. Bind/Cred native objects are built on the main thread, then moved to the worker thread
The entire reason _ShellWorker exists is that strands_shell.Shell is PyO3-unsendable — it panics if touched off its creating thread. But in __init__, bind_objs and cred_objs (strands_shell.Bind(...) / strands_shell.Cred(...)) are constructed on the calling thread and then captured into _build, which runs on the worker thread and passes them into Shell(...). If Bind/Cred are also unsendable (very common for PyO3 types that aren't explicitly Send), this is a latent interpreter panic that CI won't catch because the integ tests are skipped here. The plain dicts (self._binds, self._credentials) are safe to capture — so the fix is to move the Bind/Cred construction inside _build. Inline thread with detail.
🟡 3. The vended sandbox_bash tool still exposes a timeout parameter that is a silent no-op
The class docstring correctly says per-call timeout is ignored, but the tool schema the model sees (make_bash → timeout: int = 120; TS sandboxBashInputSchema → timeout) still advertises it. The model will set timeout to bound a long command and be silently surprised when it has no effect. Consider surfacing "per-call timeout is ignored; the sandbox enforces a fixed Ns wall-clock" in the dynamic description suffix (you already build one), so the model isn't misled.
🟡 4. SANDBOX_BASH_DESCRIPTION tells the model state does not persist — but this sandbox is stateful
The shared SANDBOX_BASH_DESCRIPTION reads: "Each call runs in a fresh shell; state such as variables and the working directory does not persist across calls." For Strands Shell that's false — test_session_state_persists_across_calls proves export FOO=bar survives into the next call (session env/cwd persist; only the explicit cwd=/env= overrides are scoped via the subshell wrap). The model is being told the opposite of the truth, which will affect how it sequences commands. Since you already append a dynamic suffix, consider also correcting the persistence claim for this backend.
🟡 5. timeout validation is asymmetric across SDKs
test_missing_timeout_validation expects StrandsShellSandbox(timeout=0) → ValueError(match="positive, finite"), but there's no validation in Python __init__ — it's delegated to the native Shell constructor (surfaced through _ShellWorker.__init__). That works for Python, but TS creates the shell lazily (getShell() on first use), so new StrandsShellSandbox({timeout: 0}) constructs fine and only blows up on the first execute(). Different failure timing for the same misuse. Worth a note or a cheap eager check in TS.
Nits / things I liked
- ✅
execute_codewrite-failure-as-ExecutionResultpath (test_execute_code_reports_write_failure_as_result) + best-effort temp cleanup is solid, and symmetric across PY/TS. - ✅
list_filesmappingFileNotFoundError/ENOENT→SandboxPathNotFoundErrorkeeps the file-editor contract intact in both SDKs. - ✅ Credentials never leak into descriptions (
test_tool_descriptions_surface_urls_and_credentialsasserts"secret" not in description) — good defensive test. - Minor: the dynamic suffix says "Writes outside mounted paths are in-memory only" but doesn't note that writes inside a
direct(non-readonly) bind do reach the host. Could be worth one clause so the model understands the host-write surface.
None of these are merge-blockers on the architecture — #1 and #2 are correctness/consistency issues I'd fix first; #3–#5 are model-context/devx polish. Nice work overall.
agent-of-mkmeral
left a comment
There was a problem hiding this comment.
Two inline threads pinpointing the 🔴 findings from my summary.
| const suffix = this._dynamicInfo() | ||
| return [ | ||
| makeFileEditor(this, { description: `${DEFAULT_FILE_EDITOR_DESCRIPTION}${suffix}` }), | ||
| makeBash(this, { description: `${SANDBOX_BASH_DESCRIPTION}${suffix}` }), |
There was a problem hiding this comment.
🔴 Cross-SDK tool-name divergence. These calls omit name, so they fall through to the makeBash/makeFileEditor defaults bash / fileEditor. But the Python side uses name="sandbox_bash" / "sandbox_file_editor", and the PR description says both vend sandbox_*. The TS unit test bakes the mismatch in (expect(names).toStrictEqual(['fileEditor', 'bash'])). Tool names are an agent-facing contract — a skill referencing sandbox_bash works on PY and silently no-ops on TS. Pass { name: 'sandbox_bash', ... } / { name: 'sandbox_file_editor', ... } here (and update the test) so the two SDKs agree.
| ) | ||
| for b in self._binds | ||
| ] | ||
| cred_objs = [ |
There was a problem hiding this comment.
🔴 Unsendable objects built off the worker thread. bind_objs/cred_objs (strands_shell.Bind(...) / Cred(...)) are constructed here on the calling thread, then captured into _build, which runs on the dedicated worker thread and feeds them to Shell(...). The whole point of _ShellWorker is that the native PyO3 types are unsendable and panic when touched off their creating thread — if Bind/Cred are unsendable too (typical unless explicitly Send), this is a latent interpreter panic that CI won't catch because the integ tests are skipif-gated here. The plain dicts (self._binds, self._credentials) are safe to capture, so move the Bind/Cred construction inside _build so every native object is created on the worker thread.
| def test_construction_failure_does_not_leak(tmp_path): | ||
| # A bad bind source should fail construction; the worker thread is cleaned up. | ||
| with pytest.raises(Exception): # noqa: B017 - native error type is opaque | ||
| StrandsShellSandbox(binds=[{"source": "/nonexistent/path/xyz", "destination": "/w", "mode": "copy"}]) |
There was a problem hiding this comment.
Issue: The test name and comment claim "the worker thread is cleaned up," but the body only asserts that construction raises — it never verifies the worker thread actually terminated. If a future change leaked the thread on construction failure, this test would still pass.
Suggestion: Either assert no lingering strands-shell thread remains after the failure (as test_dropping_sandbox_terminates_worker_thread does), or trim the comment to match what's actually verified.
| *, | ||
| binds: Sequence[dict[str, Any]] | None = None, | ||
| credentials: Sequence[dict[str, str]] | None = None, | ||
| allowed_urls: Sequence[str] | None = None, |
There was a problem hiding this comment.
Issue: binds/credentials are typed as Sequence[dict[str, Any]], then accessed by string keys (b["source"], c["url"], etc.). This loses type safety on a public constructor — typos and missing required keys surface only as runtime KeyErrors, and the Any value type works against the repo guidance to avoid Any. The TS side already models these as proper interfaces (ShellBindConfig, ShellCredConfig).
Suggestion: Define TypedDicts (e.g. BindSpec, CredentialSpec) for these, mirroring the TS interfaces. That documents the required/optional keys, gives editor autocomplete, and keeps the two SDKs symmetric — at no runtime cost.
| entries = await sandbox.list_files("/tmp") | ||
| sized = next(f for f in entries if f.name == "sized.txt") | ||
| assert sized.is_dir is False | ||
| assert sized.size == 5 |
There was a problem hiding this comment.
Issue: This asserts is_dir and size field-by-field, which silently ignores any other (unexpected or regressed) fields on FileInfo. The TS equivalent already asserts the whole shape with toStrictEqual([{ name: 'a.txt', isDir: false, size: 3 }]).
Suggestion: Assert the full object in one equality check, e.g. assert sized == FileInfo(name="sized.txt", is_dir=False, size=5). Same applies to the streaming test (test_streaming_yields_chunks_then_result), which uses a loose any(...) rather than asserting the exact chunk sequence the TS test pins down.
| self._binds = [dict(b) for b in (binds or [])] | ||
| self._allowed_urls = list(allowed_urls or []) | ||
| self._credentials = [dict(c) for c in (credentials or [])] | ||
| self._timeout = timeout |
There was a problem hiding this comment.
Issue: The docstring documents Raises: ValueError: If timeout is not a positive, finite number, but nothing in this constructor validates timeout — the check is delegated entirely to the native strands_shell.Shell. The matching unit test (test_missing_timeout_validation) asserts on the substring "positive, finite", which is the external package's error text, so the SDK's documented contract is coupled to a dependency's message wording. It's also asymmetric with TS, which constructs the shell lazily and so wouldn't reject a bad timeout until first use.
Suggestion: Validate timeout here in the SDK (e.g. raise ValueError when not None and not a positive finite float) so the SDK owns the contract it documents, the error is raised eagerly and consistently across both SDKs, and the test no longer depends on a third-party message.
|
Assessment: Comment Solid, well-documented experimental feature. The thread-pinning design is clearly explained, the optional-dependency handling is clean in both SDKs, and the cross-SDK mirroring is faithful. My feedback is mostly about tightening the public contract and a few test-rigor points — none are blocking for an Review themes
Nicely done overall — the design notes in the PR description made this an easy review. |
Description
Strands Shell is a Bourne-compatible shell that runs entirely in-process — no
fork, noexec, no syscalls — and mediates access declaratively: the agent only reaches the host paths you bind, the URLs you allowlist, andcredentials it never sees. That makes it a natural execution backend for agents that need a real shell (run tests, grep, curl, edit files)
without handing over the host.
The SDK already has a
Sandboxabstraction with Docker and SSH backends. This addsStrandsShellSandbox, aSandboxbacked by StrandsShell, in both the Python and TypeScript SDKs. It ships under
experimentalbecause the configuration surface is still settling and wewant to iterate with community feedback before committing to it.
Because Strands Shell is an optional, separately-installed package, it is an optional dependency in both SDKs — the SDK imports cleanly
without it, and the sandbox raises a clear install hint only when constructed.
Public API Changes
Python — new
strands.experimental.sandbox.StrandsShellSandbox(requirespip install strands-agents[shell]):TypeScript — new
StrandsShellSandboxat@strands-agents/sdk/experimental/sandbox/strands-shell(requires the optional@strands-agents/shellpeer dependency):Both vend
sandbox_bash/sandbox_file_editor(built from the existingmake_bash/make_file_editorvended-tool factories) via thesandbox's tool-vending hook (
get_tools()/getTools()), so an agent constructed with the sandbox registers them automatically. Tooldescriptions surface the sandbox's mounts, timeout, and allowlists so the model knows what it can reach; the injected credential values
are never included.
Design notes for reviewers
base64command, so the basePosixShellSandboxfile-transfer path doesn't apply.File ops use the shell's native VFS API instead (which also yields real
sizemetadata), andexecute_codestages the source to a tempVFS file and runs the interpreter against it.
Shell(PyO3) is unsendable — it panics if created, used, or dropped on a thread otherthan its creator — while the SDK runs each
agent()call's event loop on a fresh thread. The Python sandbox therefore confines the shellto one dedicated worker thread (
_ShellWorker) and routes every operation through it, including teardown via aweakref.finalizethatdrops the shell on its owning thread. The TS native binding manages its own thread, so the TS sandbox just awaits it.
timeoutis intentionally ignored. Strands Shell enforces a single wall-clock timeout set at construction; the per-calltimeoutfrom the baseSandboxcontract has no effect here. This is documented on the class in both SDKs.Related Issues
Documentation PR
Type of Change
New feature
Testing
Unit tests cover the thread-pinned worker lifecycle (including a regression test that the worker thread terminates and the shell is
dropped on GC), command/code execution, native file ops, cwd/env scoping, error mapping, and the vended tool descriptions. An end-to-end
test drives a scripted model through the auto-registered tools (create a file via the editor, read it back via bash). Integration tests
exercise the same paths against the real
strands-shell/@strands-agents/shellpackage and are skipped when it isn't installed.hatch run prepareChecklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your
choice.