Add no-disable-device-node-modification-hook nvcdi feature flag#1833
Closed
jkjk-ant wants to merge 1 commit into
Closed
Add no-disable-device-node-modification-hook nvcdi feature flag#1833jkjk-ant wants to merge 1 commit into
jkjk-ant wants to merge 1 commit into
Conversation
36a659a to
1e7875f
Compare
The disable-device-node-modification CDI hook bind-mounts a tmpfs file over /proc/driver/nvidia/params inside the container. With procMount: Unmasked (Kubernetes 1.34+), that overmount makes the kernel's mount_too_revealing() check reject any subsequent fresh procfs mount from a less-privileged namespace -- for example a nested user namespace created by bubblewrap. Workloads that sandbox themselves inside a GPU container can no longer mount procfs. The hook can be skipped in static cdi mode via nvidia-ctk cdi generate --disable-hooks, but jit-cdi mode (the default since 1.18.0) has no way to suppress an individual hook. Setting NVreg_ModifyDeviceFiles=0 on the host (which the hook short-circuits on) is not viable on systems with NVSwitch: fabricmanager fails to initialize with NV_ERR_INVALID_STATE when the parameter is set, even with device nodes pre-created via udev or mknod. This adds a nvcdi feature flag, following the no-additional-gids-for-device-nodes naming pattern, that suppresses the hook in jit-cdi mode: [nvidia-container-runtime.modes.jit-cdi] nvcdi-feature-flags = ["no-disable-device-node-modification-hook"] The hook's purpose is to prevent in-container nvidia-smi/libnvidia-ml from creating extra /dev/nvidiaN device nodes. That prevention is already enforced by cgroup device controls in container runtimes, so disabling the hook does not affect device isolation. The flag is opt-in and off by default. Signed-off-by: Jack Kleeman <jkjk@anthropic.com>
1e7875f to
1d543ab
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The
disable-device-node-modificationCDI hook bind-mounts a tmpfs file over/proc/driver/nvidia/paramsinside the container. WithprocMount: Unmasked(Kubernetes 1.34+), that overmount makes the kernel'smount_too_revealing()check reject any subsequent fresh procfs mount from a less-privileged namespace — for example a nested user namespace created by bubblewrap. Workloads that sandbox themselves inside a GPU container can no longer mount procfs:The hook can be skipped in static
cdimode vianvidia-ctk cdi generate --disable-hooks, butjit-cdimode (the default since 1.18.0) has no way to suppress an individual hook.Setting
NVreg_ModifyDeviceFiles=0on the host (which the hook short-circuits on) is not viable on systems with NVSwitch: fabricmanager fails to initialize withNV_ERR_INVALID_STATEwhen the parameter is set, even with device nodes pre-created via udev ormknod. (Tested on H100 SXM with driver 570.195.03 —/dev/nvidia*and/dev/nvidia-nvswitch*device files were all present and correct, butnv-fabricmanagerstill returnedrequest to acquire required privileges to access NVSwitch devices failed.)This adds a
nvcdifeature flag, following theno-additional-gids-for-device-nodesnaming pattern, that suppresses the hook:The hook's purpose is to prevent in-container
nvidia-smi/libnvidia-mlfrom creating extra/dev/nvidiaNdevice nodes (#927). That prevention is already enforced by cgroup device controls in container runtimes, so disabling the hook does not affect device isolation in those environments. Because the hook does provide defense-in-depth where cgroup enforcement is absent or misconfigured, the flag is opt-in and off by default.Verification
On a Kubernetes 1.34 node with the patched runtime and the flag enabled, a pod with
procMount: Unmasked:/proc/self/mountinfoshows no submount under/proc/driver/nvidia(the hook never ran)/proc/driver/nvidia/paramsreads the host's real value (ModifyDeviceFiles: 1)bwrap --unshare-all --bind / / --proc /proc --dev /dev /bin/truesucceedsnvidia-smi -Lreturns the allocated GPUWithout the flag, the same pod sees the hook's overmount and
bwrapfails withOperation not permitted.(Open to a different name if
no-disable-reads too awkwardly — the inner hook name'sdisable-prefix makes most options collide.)