Agent Diagnostic
GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)
Summary
The GPU-enabled gateway fails to start on a DGX Spark (Founders Edition) when NemoClaw runs openshell gateway start --name nemoclaw --gpu. The non-GPU gateway works perfectly — openshell sandbox create succeeds and I can connect to the sandbox without issues.
Environment
- Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
- Architecture: aarch64
- OS: DGX OS (Ubuntu-based)
- Kernel: (run
uname -r and paste here)
- OpenShell version: 0.0.6
- Docker:
- Cgroup Driver: systemd
- Cgroup Version: 2
- NVIDIA Container Toolkit: 1.19.0
- NVIDIA GPU: 1 GPU detected, 124610 MB VRAM
Steps to Reproduce
- Install OpenShell v0.0.6 via
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
- Confirm non-GPU gateway works:
openshell sandbox create → succeeds, sandbox created and connectable
- Stop the existing gateway:
openshell gateway stop
- Clone NemoClaw:
git clone https://github.com/NVIDIA/NemoClaw.git
- Run
cd NemoClaw && ./install.sh
- Installer reaches step [2/7] and runs
openshell gateway start --name nemoclaw --gpu
Expected Behaviour
GPU-enabled gateway starts successfully.
Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"
Additional Context
- The DGX Spark uses cgroup v2 with the systemd driver (
stat -fc %T /sys/fs/cgroup/ returns cgroup2fs).
- The non-GPU gateway works fine on this system, so the issue appears specific to the
--gpu gateway variant.
- The cgroup error mentions
/sys/fs/cgroup/kubepods/pids.max which looks like a cgroup v1 path — the kubelet inside the GPU gateway container may not be correctly configured for cgroup v2.
- NVIDIA Container Toolkit 1.19.0 is installed and functional.
- This was tested on the day of the NemoClaw GTC 2026 announcement (16 March 2026).
Description
Expected Behaviour
GPU-enabled gateway starts successfully.
Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup ["kubepods"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"
Reproduction Steps
Steps to Reproduce
Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
Stop the existing gateway: openshell gateway stop
Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
Run cd NemoClaw && ./install.sh
Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu
Environment
Environment
Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
Architecture: aarch64
OS: DGX OS (Ubuntu-based)
Kernel: (run uname -r and paste here)
OpenShell version: 0.0.6
Docker:
Cgroup Driver: systemd
Cgroup Version: 2
NVIDIA Container Toolkit: 1.19.0
NVIDIA GPU: 1 GPU detected, 124610 MB VRAM
Logs
Agent-First Checklist
Agent Diagnostic
GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)
Summary
The GPU-enabled gateway fails to start on a DGX Spark (Founders Edition) when NemoClaw runs
openshell gateway start --name nemoclaw --gpu. The non-GPU gateway works perfectly —openshell sandbox createsucceeds and I can connect to the sandbox without issues.Environment
uname -rand paste here)Steps to Reproduce
curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | shopenshell sandbox create→ succeeds, sandbox created and connectableopenshell gateway stopgit clone https://github.com/NVIDIA/NemoClaw.gitcd NemoClaw && ./install.shopenshell gateway start --name nemoclaw --gpuExpected Behaviour
GPU-enabled gateway starts successfully.
Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Key error from container logs:
Additional Context
stat -fc %T /sys/fs/cgroup/returnscgroup2fs).--gpugateway variant./sys/fs/cgroup/kubepods/pids.maxwhich looks like a cgroup v1 path — the kubelet inside the GPU gateway container may not be correctly configured for cgroup v2.Description
Expected Behaviour
GPU-enabled gateway starts successfully.
Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup ["kubepods"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"
Reproduction Steps
Steps to Reproduce
Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
Stop the existing gateway: openshell gateway stop
Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
Run cd NemoClaw && ./install.sh
Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu
Environment
Environment
Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
Architecture: aarch64
OS: DGX OS (Ubuntu-based)
Kernel: (run uname -r and paste here)
OpenShell version: 0.0.6
Docker:
Cgroup Driver: systemd
Cgroup Version: 2
NVIDIA Container Toolkit: 1.19.0
NVIDIA GPU: 1 GPU detected, 124610 MB VRAM
Logs
Agent-First Checklist
debug-openshell-cluster,debug-inference,openshell-cli)