Skip to content

feat(sandbox): add experimental StrandsShellSandbox (Python + TS)#2839

Open
mkmeral wants to merge 1 commit into
strands-agents:mainfrom
mkmeral:feat/strands-shell-sandbox
Open

feat(sandbox): add experimental StrandsShellSandbox (Python + TS)#2839
mkmeral wants to merge 1 commit into
strands-agents:mainfrom
mkmeral:feat/strands-shell-sandbox

Conversation

@mkmeral

@mkmeral mkmeral commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description

Strands Shell is a Bourne-compatible shell that runs entirely in-process — no fork, no
exec, no syscalls — and mediates access declaratively: the agent only reaches the host paths you bind, the URLs you allowlist, and
credentials it never sees. That makes it a natural execution backend for agents that need a real shell (run tests, grep, curl, edit files)
without handing over the host.

The SDK already has a Sandbox abstraction with Docker and SSH backends. This adds StrandsShellSandbox, a Sandbox backed by Strands
Shell, in both the Python and TypeScript SDKs. It ships under experimental because the configuration surface is still settling and we
want to iterate with community feedback before committing to it.

Because Strands Shell is an optional, separately-installed package, it is an optional dependency in both SDKs — the SDK imports cleanly
without it, and the sandbox raises a clear install hint only when constructed.

Public API Changes

Python — new strands.experimental.sandbox.StrandsShellSandbox (requires pip install strands-agents[shell]):

from strands import Agent
from strands.experimental.sandbox import StrandsShellSandbox

sandbox = StrandsShellSandbox(
    binds=[{"source": "/my/project", "destination": "/workspace", "mode": "copy"}],
    timeout=30.0,
)
# Vends sandbox_bash and sandbox_file_editor, auto-registered by the agent.
agent = Agent(sandbox=sandbox)
agent("List the Python files in /workspace and summarize them")

TypeScript — new StrandsShellSandbox at @strands-agents/sdk/experimental/sandbox/strands-shell (requires the optional
@strands-agents/shell peer dependency):

import { Agent } from '@strands-agents/sdk'
import { StrandsShellSandbox } from '@strands-agents/sdk/experimental/sandbox/strands-shell'

const sandbox = new StrandsShellSandbox({
  binds: [{ source: '/my/project', destination: '/workspace', mode: 'copy' }],
  timeout: 30,
})
const agent = new Agent({ sandbox })
await agent.invoke('List the files in /workspace and summarize them')

Both vend sandbox_bash / sandbox_file_editor (built from the existing make_bash / make_file_editor vended-tool factories) via the
sandbox's tool-vending hook (get_tools() / getTools()), so an agent constructed with the sandbox registers them automatically. Tool
descriptions surface the sandbox's mounts, timeout, and allowlists so the model knows what it can reach; the injected credential values
are never included.

Design notes for reviewers

  • File and code operations. Strands Shell ships no base64 command, so the base PosixShellSandbox file-transfer path doesn't apply.
    File ops use the shell's native VFS API instead (which also yields real size metadata), and execute_code stages the source to a temp
    VFS file and runs the interpreter against it.
  • Thread-pinning (Python). The native Python Shell (PyO3) is unsendable — it panics if created, used, or dropped on a thread other
    than its creator — while the SDK runs each agent() call's event loop on a fresh thread. The Python sandbox therefore confines the shell
    to one dedicated worker thread (_ShellWorker) and routes every operation through it, including teardown via a weakref.finalize that
    drops the shell on its owning thread. The TS native binding manages its own thread, so the TS sandbox just awaits it.
  • Per-call timeout is intentionally ignored. Strands Shell enforces a single wall-clock timeout set at construction; the per-call
    timeout from the base Sandbox contract has no effect here. This is documented on the class in both SDKs.

Related Issues

Documentation PR

Type of Change

New feature

Testing

Unit tests cover the thread-pinned worker lifecycle (including a regression test that the worker thread terminates and the shell is
dropped on GC), command/code execution, native file ops, cwd/env scoping, error mapping, and the vended tool descriptions. An end-to-end
test drives a scripted model through the auto-registered tools (create a file via the editor, read it back via bash). Integration tests
exercise the same paths against the real strands-shell / @strands-agents/shell package and are skipped when it isn't installed.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have reviewed and understand every line of code in this PR, including any generated by AI tools, and I can explain why it works
  • My change is focused and reasonably small; I have split unrelated work into separate PRs
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your
choice.

Adds StrandsShellSandbox, an experimental Sandbox backed by the optional
Strands Shell package (an in-process, declaratively-mediated shell). The native
shell is thread-pinned, so the Python implementation confines it to a dedicated
worker thread (_ShellWorker). File ops use the native VFS API; execute_code
stages source to a temp VFS file. Vends sandbox_bash/sandbox_file_editor via
get_tools() using the canonical vended_tools factories, auto-registered by the
agent.

Both SDKs place the backend under experimental and name it for the product:
- Python: strands.experimental.sandbox.strands_shell
- TS: @strands-agents/sdk/experimental/sandbox/strands-shell

strands-shell is an optional dependency (Python extra / TS optional peer dep,
lazy-imported). Experimental: API subject to change.
@github-actions github-actions Bot added size/xl enhancement New feature or request area-server Related to strands Server, sandbox, runtime container strands-running labels Jun 16, 2026
@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 96.05263% with 6 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../src/strands/experimental/sandbox/strands_shell.py 96.26% 3 Missing and 1 partial ⚠️
...nds-py/src/strands/experimental/sandbox/_worker.py 95.12% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@agent-of-mkmeral agent-of-mkmeral left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — feat(sandbox): add experimental StrandsShellSandbox (Python + TS)

Reviewed head 6e0f615. Cloned the PR, read all 14 files end-to-end, ran the dependency-free worker tests (test_worker.py6/6 pass), and traced the implementation against the base Sandbox contract and the existing vended-tool factories in both SDKs. The real-shell tests are correctly skipif-gated on the optional package, so they don't run in this environment.

Overall: strong, well-documented PR. The thread-pinning design for the unsendable native shell is genuinely careful — the _ShellWorker + weakref.finalize + del factory + the "don't capture self" comment, backed by test_dropping_sandbox_terminates_worker_thread, is exactly the right shape and the regression test is excellent. The VFS-native file-ops detour around the missing base64 command is a sensible deviation from PosixShellSandbox, and staging code to a temp VFS file (instead of shell-escaping) is the right call.

A handful of things I'd want resolved before this ships — the first two are the important ones:

🔴 1. Python and TS vend tools under different names (and TS contradicts the PR description)

The PR description states: "Both vend sandbox_bash / sandbox_file_editor." That's true for Python (name="sandbox_bash" / "sandbox_file_editor", asserted in test_get_tools_returns_prefixed_bash_and_file_editor), but the TS side vends bash / fileEditor — it calls makeFileEditor(this, {description}) / makeBash(this, {description}) without a name, so it falls through to the factory defaults. The TS unit test even bakes this in: expect(names).toStrictEqual(['fileEditor', 'bash']).

Tool names are part of the agent-facing contract. A skill/prompt that references sandbox_bash works on Python and silently misses on TS. Pick one (the description says sandbox_*) and make both SDKs agree. Inline thread on the TS line.

🔴 2. Bind/Cred native objects are built on the main thread, then moved to the worker thread

The entire reason _ShellWorker exists is that strands_shell.Shell is PyO3-unsendable — it panics if touched off its creating thread. But in __init__, bind_objs and cred_objs (strands_shell.Bind(...) / strands_shell.Cred(...)) are constructed on the calling thread and then captured into _build, which runs on the worker thread and passes them into Shell(...). If Bind/Cred are also unsendable (very common for PyO3 types that aren't explicitly Send), this is a latent interpreter panic that CI won't catch because the integ tests are skipped here. The plain dicts (self._binds, self._credentials) are safe to capture — so the fix is to move the Bind/Cred construction inside _build. Inline thread with detail.

🟡 3. The vended sandbox_bash tool still exposes a timeout parameter that is a silent no-op

The class docstring correctly says per-call timeout is ignored, but the tool schema the model sees (make_bashtimeout: int = 120; TS sandboxBashInputSchematimeout) still advertises it. The model will set timeout to bound a long command and be silently surprised when it has no effect. Consider surfacing "per-call timeout is ignored; the sandbox enforces a fixed Ns wall-clock" in the dynamic description suffix (you already build one), so the model isn't misled.

🟡 4. SANDBOX_BASH_DESCRIPTION tells the model state does not persist — but this sandbox is stateful

The shared SANDBOX_BASH_DESCRIPTION reads: "Each call runs in a fresh shell; state such as variables and the working directory does not persist across calls." For Strands Shell that's falsetest_session_state_persists_across_calls proves export FOO=bar survives into the next call (session env/cwd persist; only the explicit cwd=/env= overrides are scoped via the subshell wrap). The model is being told the opposite of the truth, which will affect how it sequences commands. Since you already append a dynamic suffix, consider also correcting the persistence claim for this backend.

🟡 5. timeout validation is asymmetric across SDKs

test_missing_timeout_validation expects StrandsShellSandbox(timeout=0)ValueError(match="positive, finite"), but there's no validation in Python __init__ — it's delegated to the native Shell constructor (surfaced through _ShellWorker.__init__). That works for Python, but TS creates the shell lazily (getShell() on first use), so new StrandsShellSandbox({timeout: 0}) constructs fine and only blows up on the first execute(). Different failure timing for the same misuse. Worth a note or a cheap eager check in TS.


Nits / things I liked

  • execute_code write-failure-as-ExecutionResult path (test_execute_code_reports_write_failure_as_result) + best-effort temp cleanup is solid, and symmetric across PY/TS.
  • list_files mapping FileNotFoundError/ENOENTSandboxPathNotFoundError keeps the file-editor contract intact in both SDKs.
  • ✅ Credentials never leak into descriptions (test_tool_descriptions_surface_urls_and_credentials asserts "secret" not in description) — good defensive test.
  • Minor: the dynamic suffix says "Writes outside mounted paths are in-memory only" but doesn't note that writes inside a direct (non-readonly) bind do reach the host. Could be worth one clause so the model understands the host-write surface.

None of these are merge-blockers on the architecture — #1 and #2 are correctness/consistency issues I'd fix first; #3#5 are model-context/devx polish. Nice work overall.

@agent-of-mkmeral agent-of-mkmeral left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two inline threads pinpointing the 🔴 findings from my summary.

const suffix = this._dynamicInfo()
return [
makeFileEditor(this, { description: `${DEFAULT_FILE_EDITOR_DESCRIPTION}${suffix}` }),
makeBash(this, { description: `${SANDBOX_BASH_DESCRIPTION}${suffix}` }),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Cross-SDK tool-name divergence. These calls omit name, so they fall through to the makeBash/makeFileEditor defaults bash / fileEditor. But the Python side uses name="sandbox_bash" / "sandbox_file_editor", and the PR description says both vend sandbox_*. The TS unit test bakes the mismatch in (expect(names).toStrictEqual(['fileEditor', 'bash'])). Tool names are an agent-facing contract — a skill referencing sandbox_bash works on PY and silently no-ops on TS. Pass { name: 'sandbox_bash', ... } / { name: 'sandbox_file_editor', ... } here (and update the test) so the two SDKs agree.

)
for b in self._binds
]
cred_objs = [

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unsendable objects built off the worker thread. bind_objs/cred_objs (strands_shell.Bind(...) / Cred(...)) are constructed here on the calling thread, then captured into _build, which runs on the dedicated worker thread and feeds them to Shell(...). The whole point of _ShellWorker is that the native PyO3 types are unsendable and panic when touched off their creating thread — if Bind/Cred are unsendable too (typical unless explicitly Send), this is a latent interpreter panic that CI won't catch because the integ tests are skipif-gated here. The plain dicts (self._binds, self._credentials) are safe to capture, so move the Bind/Cred construction inside _build so every native object is created on the worker thread.

def test_construction_failure_does_not_leak(tmp_path):
# A bad bind source should fail construction; the worker thread is cleaned up.
with pytest.raises(Exception): # noqa: B017 - native error type is opaque
StrandsShellSandbox(binds=[{"source": "/nonexistent/path/xyz", "destination": "/w", "mode": "copy"}])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The test name and comment claim "the worker thread is cleaned up," but the body only asserts that construction raises — it never verifies the worker thread actually terminated. If a future change leaked the thread on construction failure, this test would still pass.

Suggestion: Either assert no lingering strands-shell thread remains after the failure (as test_dropping_sandbox_terminates_worker_thread does), or trim the comment to match what's actually verified.

*,
binds: Sequence[dict[str, Any]] | None = None,
credentials: Sequence[dict[str, str]] | None = None,
allowed_urls: Sequence[str] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: binds/credentials are typed as Sequence[dict[str, Any]], then accessed by string keys (b["source"], c["url"], etc.). This loses type safety on a public constructor — typos and missing required keys surface only as runtime KeyErrors, and the Any value type works against the repo guidance to avoid Any. The TS side already models these as proper interfaces (ShellBindConfig, ShellCredConfig).

Suggestion: Define TypedDicts (e.g. BindSpec, CredentialSpec) for these, mirroring the TS interfaces. That documents the required/optional keys, gives editor autocomplete, and keeps the two SDKs symmetric — at no runtime cost.

entries = await sandbox.list_files("/tmp")
sized = next(f for f in entries if f.name == "sized.txt")
assert sized.is_dir is False
assert sized.size == 5

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This asserts is_dir and size field-by-field, which silently ignores any other (unexpected or regressed) fields on FileInfo. The TS equivalent already asserts the whole shape with toStrictEqual([{ name: 'a.txt', isDir: false, size: 3 }]).

Suggestion: Assert the full object in one equality check, e.g. assert sized == FileInfo(name="sized.txt", is_dir=False, size=5). Same applies to the streaming test (test_streaming_yields_chunks_then_result), which uses a loose any(...) rather than asserting the exact chunk sequence the TS test pins down.

self._binds = [dict(b) for b in (binds or [])]
self._allowed_urls = list(allowed_urls or [])
self._credentials = [dict(c) for c in (credentials or [])]
self._timeout = timeout

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The docstring documents Raises: ValueError: If timeout is not a positive, finite number, but nothing in this constructor validates timeout — the check is delegated entirely to the native strands_shell.Shell. The matching unit test (test_missing_timeout_validation) asserts on the substring "positive, finite", which is the external package's error text, so the SDK's documented contract is coupled to a dependency's message wording. It's also asymmetric with TS, which constructs the shell lazily and so wouldn't reject a bad timeout until first use.

Suggestion: Validate timeout here in the SDK (e.g. raise ValueError when not None and not a positive finite float) so the SDK owns the contract it documents, the error is raised eagerly and consistently across both SDKs, and the test no longer depends on a third-party message.

@github-actions

Copy link
Copy Markdown
Contributor

Assessment: Comment

Solid, well-documented experimental feature. The thread-pinning design is clearly explained, the optional-dependency handling is clean in both SDKs, and the cross-SDK mirroring is faithful. My feedback is mostly about tightening the public contract and a few test-rigor points — none are blocking for an experimental module.

Review themes
  • API design: The constructor accepts loosely-typed Sequence[dict[str, Any]] for binds/credentials on the Python side where TS uses real interfaces. TypedDicts would restore type safety and keep the SDKs symmetric.
  • Contract ownership: The documented timeout ValueError isn't actually enforced by the SDK — it's delegated to the native package's error text, which a unit test then matches on. Validating in-SDK makes the contract real, eager, and consistent across Python/TS.
  • Test rigor: A couple of Python tests assert field-by-field or with loose any(...) where the TS suite already pins the full object/sequence; one test's name over-claims what it verifies.
  • Process: This adds a new public class customers will use. Per team/API_BAR_RAISING.md that's a "moderate" change warranting an api/needs-review label + an API reviewer pass before merge — worth confirming since the label isn't set.

Nicely done overall — the design notes in the PR description made this an easy review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-server Related to strands Server, sandbox, runtime container enhancement New feature or request size/xl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants