Skip to content

[Bug]: MIG: error getting parent memory info: Insufficient Permissions #1683

@dylex

Description

@dylex

1. Quick Debug Information

  • OS/Version: Rocky 8.10
  • Kernel Version: 4.18.0-553.111.1.el8_10.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-docker with nvidia-container-runtime
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s 1.34 (also fails on 1.32)

2. Issue or feature description

With MIG enabled on A100 and MIG_STRATEGY=single:

E0407 94 main.go:188] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: error building device map: error building device map from config.resources: error building MIG device map: error visiting devices: error visiting device: error visiting device: error visiting MIG device: error visiting MIG device: error getting MIG profile for MIG device at index '(0, 0)': error getting parent memory info: Insufficient Permissions

Commit e3323ce seems relevant, as nvidia-smi inside the container also fails to get memory info (N/A) though it works on the host. When running without MIG, the warning is logged, but with MIG this is an error.

This used to work with the exact same config and hardware with some earlier combination of k8s-device-plugin/cuda/docker/kernel but I haven't been able to find a working one now.

Metadata

Metadata

Assignees

Labels

bugIssue/PR to expose/discuss/fix a bugneeds-triageissue or PR has not been assigned a priority-px label

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions