Skip to content

bug: GPU gateway fails on DGX Spark - missing cgroup controller #374

Description

@lupinrider

Agent Diagnostic

GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)

Summary

The GPU-enabled gateway fails to start on a DGX Spark (Founders Edition) when NemoClaw runs openshell gateway start --name nemoclaw --gpu. The non-GPU gateway works perfectly — openshell sandbox create succeeds and I can connect to the sandbox without issues.

Environment

  • Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
  • Architecture: aarch64
  • OS: DGX OS (Ubuntu-based)
  • Kernel: (run uname -r and paste here)
  • OpenShell version: 0.0.6
  • Docker:
    • Cgroup Driver: systemd
    • Cgroup Version: 2
  • NVIDIA Container Toolkit: 1.19.0
  • NVIDIA GPU: 1 GPU detected, 124610 MB VRAM

Steps to Reproduce

  1. Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
  2. Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
  3. Stop the existing gateway: openshell gateway stop
  4. Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
  5. Run cd NemoClaw && ./install.sh
  6. Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu

Expected Behaviour

GPU-enabled gateway starts successfully.

Actual Behaviour

Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:

Error:   × K8s namespace not ready
  ╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
      (status=EXITED, exit_code=1)

Key error from container logs:

E0316 21:10:05.545118     118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156     118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"

Additional Context

  • The DGX Spark uses cgroup v2 with the systemd driver (stat -fc %T /sys/fs/cgroup/ returns cgroup2fs).
  • The non-GPU gateway works fine on this system, so the issue appears specific to the --gpu gateway variant.
  • The cgroup error mentions /sys/fs/cgroup/kubepods/pids.max which looks like a cgroup v1 path — the kubelet inside the GPU gateway container may not be correctly configured for cgroup v2.
  • NVIDIA Container Toolkit 1.19.0 is installed and functional.
  • This was tested on the day of the NemoClaw GTC 2026 announcement (16 March 2026).

Description

Expected Behaviour
GPU-enabled gateway starts successfully.

Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error: × K8s namespace not ready
╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
(status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118 118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156 118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup ["kubepods"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"

Reproduction Steps

Steps to Reproduce

Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
Stop the existing gateway: openshell gateway stop
Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
Run cd NemoClaw && ./install.sh
Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu

Environment

Environment

Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
Architecture: aarch64
OS: DGX OS (Ubuntu-based)
Kernel: (run uname -r and paste here)
OpenShell version: 0.0.6
Docker:

Cgroup Driver: systemd
Cgroup Version: 2

NVIDIA Container Toolkit: 1.19.0
NVIDIA GPU: 1 GPU detected, 124610 MB VRAM

Logs

Agent-First Checklist

  • I pointed my agent at the repo and had it investigate this issue
  • I loaded relevant skills (e.g., debug-openshell-cluster, debug-inference, openshell-cli)
  • My agent could not resolve this — the diagnostic above explains why

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions