Skip to content

feat: nested-Docker sandbox — --no-bwrap, greywall-netns-helper, --netns, --bridge-port#88

Draft
tito wants to merge 3 commits into
mainfrom
mathieu/feat-nested-sandbox
Draft

feat: nested-Docker sandbox — --no-bwrap, greywall-netns-helper, --netns, --bridge-port#88
tito wants to merge 3 commits into
mainfrom
mathieu/feat-nested-sandbox

Conversation

@tito
Copy link
Copy Markdown
Contributor

@tito tito commented Apr 23, 2026

Draft. Three stacked commits that together make greywall usable in environments where bubblewrap cannot create user namespaces — most notably inside Docker (including Docker Desktop) running as a non-root user, where uid_map writes are blocked by the host kernel regardless of cap_add SYS_ADMIN + seccomp=unconfined + apparmor=unconfined. See containers/bubblewrap#505 for the underlying limitation.

Each commit stands on its own feature; they are filed together because the later ones depend on the first. Happy to split if that makes review easier.

Commits

1. feat: --no-bwrap mode for nested-Docker / rootless environments (7c10e28)

Adds a --no-bwrap flag that skips bubblewrap entirely and enforces the sandbox using primitives that work for an unprivileged process:

  • Landlock — reused from the existing --landlock-apply wrapper.
  • seccomp-bpf — new direct loader via prctl(PR_SET_NO_NEW_PRIVS) + seccomp(SECCOMP_SET_MODE_FILTER) in a new internal/sandbox/linux_seccomp_apply.go. The existing generateBPFInstructions() was factored out of writeBPFProgram so both the file-writing path (bwrap --seccomp 3) and the direct-load path can share it.
  • env-var SOCKS5/HTTP proxyWrapCommandLinuxNoBwrap (new file) emits the proxy URL via ALL_PROXY / HTTPS_PROXY instead of bwrap's tun2socks. Fine for well-behaved HTTP clients; raw-socket bypass is possible at this layer (commit feat: Homebrew tap formula with brew-aware commands #2 addresses that).

Files:

cmd/greywall/main.go                         +33
internal/sandbox/linux_nobwrap.go            +69 (new)
internal/sandbox/linux_seccomp.go            +30 / −18 (refactor)
internal/sandbox/linux_seccomp_apply.go      +86 (new)
internal/sandbox/linux_seccomp_apply_stub.go  +8 (new)
internal/sandbox/linux_stub.go                +5
internal/sandbox/manager.go                  +25

Behaviour when --no-bwrap is unset: unchanged.

2. feat: greywall-netns-helper + --netns flag for transparent capture (5de0788)

Adds a separate helper binary greywall-netns-helper (installed with setcap cap_net_admin,cap_sys_admin+ep, never root) that:

  • create --proxy URLunshare(CLONE_NEWNET), bring up tun0 at 198.18.0.1/15, set default route via tun0, launch the existing embedded tun2socks inside the netns pointed at the SOCKS5 proxy, bind-mount /proc/self/ns/net to /run/greywall/ns-<uuid> to pin it, print the pin path, exit.
  • exec --netns PATH -- CMDsetns into the pin, drop all caps (effective + permitted + inheritable + ambient + bounding set), syscall.Exec the user command. Rejects paths outside /run/greywall so the file caps can't be leveraged to enter arbitrary netns.
  • destroy PATH — SIGTERM the recorded tun2socks pid, unmount and remove the pin.

The main greywall binary gets a paired --netns <path> flag that, when combined with --no-bwrap, prefixes the command chain with greywall-netns-helper exec --netns <path> -- so the wrapped process enters the prepared netns before Landlock/seccomp are applied.

Result: kernel-enforced egress capture (every TCP/UDP packet goes through tun0 → tun2socks → SOCKS5) without requiring the wrapped process to hold any capabilities and without relying on it honoring proxy env vars. Raw sockets are no longer a bypass.

Ambient-caps dance for CAP_NET_ADMIN is used so ip/tun2socks children of the helper inherit the capability without the helper having to reimplement the netlink calls in Go.

3. feat(netns-helper): --bridge-port (e735ca7)

Optional --bridge-port N flag on create that makes a TCP server inside the netns reachable from the host netns (typical use: a sandboxed server whose API is consumed by a trusted host-netns orchestrator). Mechanism:

  • Inside netns: socat UNIX-LISTEN:<pin>.sock → TCP:127.0.0.1:N
  • Host netns (sibling spawned before unshare): socat TCP4-LISTEN:N,bind=127.0.0.1 → UNIX-CONNECT:<pin>.sock
  • The shared Unix socket lives on the regular filesystem, so it crosses the netns boundary transparently.

destroy now reads a multi-line PID sidecar (<pin>.pid carrying tun2socks + both socat pids) and SIGTERMs every recorded pid. The _bridge-host internal subcommand validates that the socket path sits alongside a valid pin so the file caps can't be abused to proxy arbitrary Unix sockets.

Security notes

  • The main greywall binary stays uncap'd; all privileged setup is isolated to the tiny greywall-netns-helper binary.
  • greywall-netns-helper exec validates its --netns path is under /run/greywall/ and drops every cap set (effective, permitted, inheritable, ambient, bounding) before syscall.Exec.
  • greywall-netns-helper _bridge-host (internal) validates its --socket path similarly.
  • File caps (cap_net_admin,cap_sys_admin+ep) are the principle of least privilege for the helper: CAP_SYS_ADMIN is required to unshare(CLONE_NEWNET) and setns; CAP_NET_ADMIN for ip tuntap/addr/link/route. Nothing else.

Install delta

  • Linux: package dependencies gain iproute2 (the ip tool) and libcap2-bin (at install-time, for setcap). tun2socks binary is already embedded in the greywall source tree for amd64/arm64.
  • A writable /run/greywall/ directory owned by the invoking user must exist at runtime (helper does not attempt to chown). Suggested deployment: mkdir -p /run/greywall && chown $USER:$USER /run/greywall, or a systemd-tmpfiles drop-in.

Testing

All three commits verified end-to-end inside a plain python:3.12-slim Docker container running as a non-root user with the usual cap_add SYS_ADMIN NET_ADMIN + security_opt seccomp=unconfined apparmor=unconfined:

  • --no-bwrap alone: wrapped process has uid=1000, CapEff=0, NoNewPrivs=1, Seccomp=2, Seccomp_filters=1, Landlock rules enforced (write to /etc → EACCES), write to cwd allowed.
  • --no-bwrap --netns <pin>: same sandbox state plus the process netns differs from the host netns; ip -br link inside shows only lo + tun0, eth0 not visible.
  • --bridge-port 4096: a host-netns client connects to 127.0.0.1:4096 and talks to an HTTP server bound at 127.0.0.1:4096 inside the netns; round-trip confirmed with a standard health-check response.
  • Cleanup on destroy: all three pids (tun2socks + 2 socat) are gone, pin + .sock + .pid files unlinked.

Builds clean on go build ./... and GOOS=linux GOARCH=arm64 go build ./....

Test plan

  • go build ./... on darwin/arm64
  • GOOS=linux GOARCH=arm64 go build ./...
  • --no-bwrap -- <cmd> as non-root user in Docker → Landlock + seccomp applied
  • greywall-netns-helper create → netns pinned, tun2socks up, helper exits cleanly
  • greywall --no-bwrap --netns <pin> -- <cmd> → child runs in isolated netns with no caps
  • greywall-netns-helper destroy → all pids SIGTERM'd, files cleaned up
  • --bridge-port N → bidirectional TCP reachable from host netns
  • Cross-review of the setcap'd helper's security posture by maintainers

Mathieu Virbel and others added 3 commits April 22, 2026 18:25
Bubblewrap cannot create user namespaces inside Docker Desktop's VM
when running as a non-root user (uid_map write is blocked regardless
of cap_add SYS_ADMIN + seccomp/apparmor unconfined). This commit adds
a --no-bwrap flag that skips bubblewrap entirely and enforces the
sandbox using primitives that work unprivileged:

- Landlock (already applied via the internal --landlock-apply wrapper)
- seccomp-bpf (loaded directly via prctl + SECCOMP_SET_MODE_FILTER)
- env-var-based SOCKS5/HTTP proxy (HTTP libraries honor ALL_PROXY)

Layout:

- internal/sandbox/linux_seccomp.go: factored out generateBPFInstructions()
  so the BPF program can be produced in memory without writing a file.
- internal/sandbox/linux_seccomp_apply.go (NEW): ApplySeccompFilter loads
  the filter directly into the current process via the seccomp() syscall.
  Idempotent wrt PR_SET_NO_NEW_PRIVS (Landlock's Apply already sets it).
- internal/sandbox/linux_nobwrap.go (NEW): WrapCommandLinuxNoBwrap emits
  a short shell script that exports GREYWALL_CONFIG_JSON + proxy env
  vars, then execs `greywall --landlock-apply --seccomp -- bash -c ...`.
- cmd/greywall/main.go: added --no-bwrap flag; extended runLandlockWrapper
  to accept --seccomp which triggers ApplySeccompFilter after Landlock.
- internal/sandbox/manager.go: Manager.noBwrap + SetNoBwrap; dispatches
  to WrapCommandLinuxNoBwrap in WrapCommand; skips proxy/DNS bridge
  initialization in no-bwrap mode (no Unix-socket bind target anyway).

What this path gives up vs. full bwrap:
- mount namespace / FS view isolation (Landlock denies, doesn't hide)
- PID namespace
- transparent tun2socks capture (needs a netns; out of scope here —
  follow-up work will add a --netns flag + a privileged netns helper)

What it keeps:
- Landlock filesystem access control
- seccomp syscall denial (ptrace, mount, kexec, TIOCSTI, ...)
- env-based proxy routing
- zero privileges required on the wrapper or wrapped process
Adds Stage B of --no-bwrap: a separate setcap'd helper binary that
builds a persistent network namespace with tun2socks, paired with a
new --netns flag on greywall that routes the sandbox through it.
Together with --no-bwrap this restores kernel-enforced egress
capture (all traffic from the sandbox → tun0 → tun2socks → SOCKS5
proxy) without requiring the sandboxed command to hold any
privileges and without relying on the process to honor proxy env
vars.

greywall-netns-helper subcommands:

  create --proxy URL [--tun2socks PATH]
    unshare CLONE_NEWNET, bring up tun0 at 198.18.0.1/15, add
    default route via tun0, launch tun2socks inside the netns,
    pin at /run/greywall/ns-<uuid> via bind-mount, print pin path.
    Needs CAP_NET_ADMIN + CAP_SYS_ADMIN (via file caps, not root).
    Ambient-caps raise CAP_NET_ADMIN so ip/tun2socks children
    inherit it.

  exec --netns PATH -- CMD [ARGS...]
    setns into the pinned netns, clear all cap sets (effective,
    permitted, inheritable, ambient) + bounding set, then
    syscall.Exec CMD. Strictly rejects netns paths outside
    /run/greywall to prevent abuse of the file caps for arbitrary
    namespace entry.

  destroy PATH
    SIGTERM the recorded tun2socks pid, unmount and unlink the pin
    and its .pid sidecar.

Greywall CLI:

  --netns <path>        Require --no-bwrap. Inserts
                        `greywall-netns-helper exec --netns <path> --`
                        in front of the landlock-apply wrapper chain.

  --netns-helper <path> Override helper location (default: PATH).

When --netns is set, the env-var SOCKS5 proxy injection in
WrapCommandLinuxNoBwrap is skipped: traffic is already captured by
tun0 inside the netns, and ALL_PROXY would just double-proxy to an
unreachable localhost port.

Installation requires:

  setcap cap_net_admin,cap_sys_admin+ep /usr/local/bin/greywall-netns-helper

and a writable /run/greywall (deployment-side; the helper itself
does not attempt to chown /run).

Verified in Docker (non-root agent user, no bubblewrap):

  * helper create → pin created, tun2socks running in netns
  * helper exec   → child runs with CapEff=0 inside netns
  * full chain    → greywall --no-bwrap --netns <path> yields
                    uid=1000, CapEff=0, NoNewPrivs=1, Seccomp=2,
                    Seccomp_filters=1, isolated netns (tun0 + lo,
                    no eth0). Parent shell's netns unaffected.
…socket

Adds an optional `--bridge-port N` flag to `create` that lets a host-netns
client reach a TCP listener inside the pinned netns. Mechanism:

1. Before unshare(CLONE_NEWNET), `create` spawns a sibling via an internal
   `_bridge-host` subcommand that stays in the host netns. It waits for
   the shared Unix socket to appear, drops all caps, then execs socat
   TCP4-LISTEN:N,bind=127.0.0.1 -> UNIX-CONNECT:<pin>.sock.
2. After entering the new netns, `create` also spawns socat
   UNIX-LISTEN:<pin>.sock -> TCP:127.0.0.1:N inside the netns, so
   incoming connections to the host port land on the in-netns TCP port.

`destroy` now reads a multi-line pid sidecar and SIGTERMs every recorded
pid (host-bridge, tun2socks, inside-bridge), then removes the pin, pidfile
and .sock. `_bridge-host` validates that the socket lives alongside a
valid pin path so the setcap'd helper can't be leveraged to proxy
arbitrary Unix sockets.

Use case: an orchestrator that drives an HTTP/RPC server inside the
pinned sandbox netns from a host-netns control process. Without the
bridge, dearmail-style designs (the sandboxed process exposes an API
that the trusted orchestrator consumes) couldn't survive the netns
isolation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant