Skip to content

Add no-disable-device-node-modification-hook nvcdi feature flag#1833

Closed
jkjk-ant wants to merge 1 commit into
NVIDIA:mainfrom
jkjk-ant:feat/disable-device-node-modification-hook-flag
Closed

Add no-disable-device-node-modification-hook nvcdi feature flag#1833
jkjk-ant wants to merge 1 commit into
NVIDIA:mainfrom
jkjk-ant:feat/disable-device-node-modification-hook-flag

Conversation

@jkjk-ant
Copy link
Copy Markdown

@jkjk-ant jkjk-ant commented May 18, 2026

The disable-device-node-modification CDI hook bind-mounts a tmpfs file over /proc/driver/nvidia/params inside the container. With procMount: Unmasked (Kubernetes 1.34+), that overmount makes the kernel's mount_too_revealing() check reject any subsequent fresh procfs mount from a less-privileged namespace — for example a nested user namespace created by bubblewrap. Workloads that sandbox themselves inside a GPU container can no longer mount procfs:

bwrap: Can't mount proc on /newroot/proc: Operation not permitted

The hook can be skipped in static cdi mode via nvidia-ctk cdi generate --disable-hooks, but jit-cdi mode (the default since 1.18.0) has no way to suppress an individual hook.

Setting NVreg_ModifyDeviceFiles=0 on the host (which the hook short-circuits on) is not viable on systems with NVSwitch: fabricmanager fails to initialize with NV_ERR_INVALID_STATE when the parameter is set, even with device nodes pre-created via udev or mknod. (Tested on H100 SXM with driver 570.195.03 — /dev/nvidia* and /dev/nvidia-nvswitch* device files were all present and correct, but nv-fabricmanager still returned request to acquire required privileges to access NVSwitch devices failed.)

This adds a nvcdi feature flag, following the no-additional-gids-for-device-nodes naming pattern, that suppresses the hook:

[nvidia-container-runtime.modes.jit-cdi]
nvcdi-feature-flags = ["no-disable-device-node-modification-hook"]

The hook's purpose is to prevent in-container nvidia-smi/libnvidia-ml from creating extra /dev/nvidiaN device nodes (#927). That prevention is already enforced by cgroup device controls in container runtimes, so disabling the hook does not affect device isolation in those environments. Because the hook does provide defense-in-depth where cgroup enforcement is absent or misconfigured, the flag is opt-in and off by default.

Verification

On a Kubernetes 1.34 node with the patched runtime and the flag enabled, a pod with procMount: Unmasked:

  • /proc/self/mountinfo shows no submount under /proc/driver/nvidia (the hook never ran)
  • /proc/driver/nvidia/params reads the host's real value (ModifyDeviceFiles: 1)
  • bwrap --unshare-all --bind / / --proc /proc --dev /dev /bin/true succeeds
  • nvidia-smi -L returns the allocated GPU
  • fabricmanager and persistenced unaffected

Without the flag, the same pod sees the hook's overmount and bwrap fails with Operation not permitted.

(Open to a different name if no-disable- reads too awkwardly — the inner hook name's disable- prefix makes most options collide.)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jkjk-ant jkjk-ant force-pushed the feat/disable-device-node-modification-hook-flag branch from 36a659a to 1e7875f Compare May 18, 2026 12:03
The disable-device-node-modification CDI hook bind-mounts a tmpfs file over
/proc/driver/nvidia/params inside the container. With procMount: Unmasked
(Kubernetes 1.34+), that overmount makes the kernel's mount_too_revealing()
check reject any subsequent fresh procfs mount from a less-privileged
namespace -- for example a nested user namespace created by bubblewrap.
Workloads that sandbox themselves inside a GPU container can no longer
mount procfs.

The hook can be skipped in static cdi mode via nvidia-ctk cdi generate
--disable-hooks, but jit-cdi mode (the default since 1.18.0) has no way to
suppress an individual hook. Setting NVreg_ModifyDeviceFiles=0 on the host
(which the hook short-circuits on) is not viable on systems with NVSwitch:
fabricmanager fails to initialize with NV_ERR_INVALID_STATE when the
parameter is set, even with device nodes pre-created via udev or mknod.

This adds a nvcdi feature flag, following the no-additional-gids-for-device-nodes
naming pattern, that suppresses the hook in jit-cdi mode:

  [nvidia-container-runtime.modes.jit-cdi]
  nvcdi-feature-flags = ["no-disable-device-node-modification-hook"]

The hook's purpose is to prevent in-container nvidia-smi/libnvidia-ml from
creating extra /dev/nvidiaN device nodes. That prevention is already
enforced by cgroup device controls in container runtimes, so disabling the
hook does not affect device isolation. The flag is opt-in and off by
default.

Signed-off-by: Jack Kleeman <jkjk@anthropic.com>
@jkjk-ant jkjk-ant force-pushed the feat/disable-device-node-modification-hook-flag branch from 1e7875f to 1d543ab Compare May 18, 2026 12:10
@jkjk-ant jkjk-ant changed the title Add disable-device-node-modification-hook nvcdi feature flag Add no-disable-device-node-modification-hook nvcdi feature flag May 18, 2026
@jkjk-ant jkjk-ant marked this pull request as ready for review May 18, 2026 12:32
@jkjk-ant jkjk-ant closed this May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant