bug: GPU gateway fails on DGX Spark - missing cgroup controller

### Agent Diagnostic

# GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)

## Summary

The GPU-enabled gateway fails to start on a DGX Spark (Founders Edition) when NemoClaw runs `openshell gateway start --name nemoclaw --gpu`. The non-GPU gateway works perfectly — `openshell sandbox create` succeeds and I can connect to the sandbox without issues.

## Environment

- **Hardware:** NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
- **Architecture:** aarch64
- **OS:** DGX OS (Ubuntu-based)
- **Kernel:** *(run `uname -r` and paste here)*
- **OpenShell version:** 0.0.6
- **Docker:**
  - Cgroup Driver: systemd
  - Cgroup Version: 2
- **NVIDIA Container Toolkit:** 1.19.0
- **NVIDIA GPU:** 1 GPU detected, 124610 MB VRAM

## Steps to Reproduce

1. Install OpenShell v0.0.6 via `curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh`
2. Confirm non-GPU gateway works: `openshell sandbox create` → succeeds, sandbox created and connectable
3. Stop the existing gateway: `openshell gateway stop`
4. Clone NemoClaw: `git clone https://github.com/NVIDIA/NemoClaw.git`
5. Run `cd NemoClaw && ./install.sh`
6. Installer reaches step [2/7] and runs `openshell gateway start --name nemoclaw --gpu`

## Expected Behaviour

GPU-enabled gateway starts successfully.

## Actual Behaviour

Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:

```
Error:   × K8s namespace not ready
  ╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
      (status=EXITED, exit_code=1)
```

Key error from container logs:

```
E0316 21:10:05.545118     118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156     118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"
```

## Additional Context

- The DGX Spark uses cgroup v2 with the systemd driver (`stat -fc %T /sys/fs/cgroup/` returns `cgroup2fs`).
- The non-GPU gateway works fine on this system, so the issue appears specific to the `--gpu` gateway variant.
- The cgroup error mentions `/sys/fs/cgroup/kubepods/pids.max` which looks like a cgroup v1 path — the kubelet inside the GPU gateway container may not be correctly configured for cgroup v2.
- NVIDIA Container Toolkit 1.19.0 is installed and functional.
- This was tested on the day of the NemoClaw GTC 2026 announcement (16 March 2026).

### Description

Expected Behaviour
GPU-enabled gateway starts successfully.

Actual Behaviour
Gateway container starts but the K3s cluster inside it fails at kubelet startup with missing cgroup controllers:
Error:   × K8s namespace not ready
  ╰─▶ gateway container is not running while waiting for namespace 'openshell': container exited
      (status=EXITED, exit_code=1)
Key error from container logs:
E0316 21:10:05.545118     118 cgroup_manager_linux.go:406] "Unhandled Error" err="cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubepods/pids.max: no such file or directory"
E0316 21:10:05.545156     118 kubelet.go:1745] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: error validating root container [kubepods] : cgroup [\"kubepods\"] has some missing controllers: cpu, cpuset, hugetlb, memory, pids"


### Reproduction Steps

Steps to Reproduce

Install OpenShell v0.0.6 via curl -LsSf https://raw.githubusercontent.com/NVIDIA/OpenShell/main/install.sh | sh
Confirm non-GPU gateway works: openshell sandbox create → succeeds, sandbox created and connectable
Stop the existing gateway: openshell gateway stop
Clone NemoClaw: git clone https://github.com/NVIDIA/NemoClaw.git
Run cd NemoClaw && ./install.sh
Installer reaches step [2/7] and runs openshell gateway start --name nemoclaw --gpu

### Environment

Environment

Hardware: NVIDIA DGX Spark (Founders Edition), GB10 Grace Blackwell Superchip, 128GB unified memory
Architecture: aarch64
OS: DGX OS (Ubuntu-based)
Kernel: (run uname -r and paste here)
OpenShell version: 0.0.6
Docker:

Cgroup Driver: systemd
Cgroup Version: 2


NVIDIA Container Toolkit: 1.19.0
NVIDIA GPU: 1 GPU detected, 124610 MB VRAM

### Logs

```shell

```

### Agent-First Checklist

- [x] I pointed my agent at the repo and had it investigate this issue
- [x] I loaded relevant skills (e.g., `debug-openshell-cluster`, `debug-inference`, `openshell-cli`)
- [x] My agent could not resolve this — the diagnostic above explains why

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: GPU gateway fails on DGX Spark - missing cgroup controller #374

Agent Diagnostic

GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)

Summary

Environment

Steps to Reproduce

Expected Behaviour

Actual Behaviour

Additional Context

Description

Reproduction Steps

Environment

Logs

Agent-First Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

bug: GPU gateway fails on DGX Spark - missing cgroup controller #374

Description

Agent Diagnostic

GPU gateway fails on DGX Spark — missing cgroup controllers (kubepods)

Summary

Environment

Steps to Reproduce

Expected Behaviour

Actual Behaviour

Additional Context

Description

Reproduction Steps

Environment

Logs

Agent-First Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions