1. Quick Debug Information
- OS/Version: Rocky 8.10
- Kernel Version: 4.18.0-553.111.1.el8_10.x86_64
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-docker with nvidia-container-runtime
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s 1.34 (also fails on 1.32)
2. Issue or feature description
With MIG enabled on A100 and MIG_STRATEGY=single:
E0407 94 main.go:188] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: error building device map: error building device map from config.resources: error building MIG device map: error visiting devices: error visiting device: error visiting device: error visiting MIG device: error visiting MIG device: error getting MIG profile for MIG device at index '(0, 0)': error getting parent memory info: Insufficient Permissions
Commit e3323ce seems relevant, as nvidia-smi inside the container also fails to get memory info (N/A) though it works on the host. When running without MIG, the warning is logged, but with MIG this is an error.
This used to work with the exact same config and hardware with some earlier combination of k8s-device-plugin/cuda/docker/kernel but I haven't been able to find a working one now.
1. Quick Debug Information
2. Issue or feature description
With MIG enabled on A100 and MIG_STRATEGY=single:
Commit e3323ce seems relevant, as
nvidia-smiinside the container also fails to get memory info (N/A) though it works on the host. When running without MIG, the warning is logged, but with MIG this is an error.This used to work with the exact same config and hardware with some earlier combination of k8s-device-plugin/cuda/docker/kernel but I haven't been able to find a working one now.