Skip to content

When KUBERNETES_VIRTUAL_GPUS is enabled, the GPU does not report all the card indicators. #611

@zhyocean

Description

@zhyocean

What is the version?

4.4.2-4.7.0

What happened?

When KUBERNETES_VIRTUAL_GPUS is enabled, if I have eight GPU cards and 3 of them are in use while 5 are not in use at the moment, then in the reported indicators, the data for the unused GPU cards will not show a value of 0.

What did you expect to happen?

I hope that the unused GPU cards can also report indicators with a value of 0.

What is the GPU model?

H200、4090、5090

What is the environment?

Kubernetes cluster 1.27.5
The GPU node is Ubuntu 22.04

How did you deploy the dcgm-exporter and what is the configuration?

Through gpu-operator helm chart, everything is default values except added env var KUBERNETES_VIRTUAL_GPUS

How to reproduce the issue?

On the GPU node, running the pod causes some GPU cards to be utilized while the remaining cards remain unused. By checking the /metrics interface, it can be observed that only the utilized GPU metrics are reported.

# HELP DCGM_FI_DEV_FB_USED Framebuffer memory used (in MiB).
# TYPE DCGM_FI_DEV_FB_USED gauge
DCGM_FI_DEV_FB_USED{gpu="3",UUID="GPU-dfgfdg5-8e72-af03-1f8c-2e64646392bf",pci_bus_id="00000000:5A:00.0",device="nvidia3",modelName="NVIDIA GeForce RTX 5090",Hostname="test-app-64-46-msxf",DCGM_FI_DRIVER_VERSION="570.172.08",container="qwen3-8b",namespace="acs-env-test1-150",pod="qwen3-8b-5ffbf94f64-5w4vn",pod_uid=""} 27213
DCGM_FI_DEV_FB_USED{gpu="4",UUID="GPU-7050631b-6524-21e3-41cf-6e08fc89220d",pci_bus_id="00000000:98:00.0",device="nvidia4",modelName="NVIDIA GeForce RTX 5090",Hostname="test-app-64-46-msxf",DCGM_FI_DRIVER_VERSION="570.172.08",container="test-1705",namespace="acs-env-test2-150",pod="test-1705-867cf98c48-mv2v7",pod_uid=""} 30645
DCGM_FI_DEV_FB_USED{gpu="5",UUID="GPU-f7219cd7-e3ef-3b1b-4fa2-e84e8d5d4ce5",pci_bus_id="00000000:B8:00.0",device="nvidia5",modelName="NVIDIA GeForce RTX 5090",Hostname="test-app-64-46-msxf",DCGM_FI_DRIVER_VERSION="570.172.08",container="test-1705",namespace="acs-env-test2-150",pod="test-1705-867cf98c48-cz2ps",pod_uid=""} 30665
# HELP DCGM_FI_DEV_VGPU_LICENSE_STATUS vGPU License status
# TYPE DCGM_FI_DEV_VGPU_LICENSE_STATUS gauge
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="3",UUID="GPU-8cae1965-8e72-af03-1f8c-2e64646392bf",pci_bus_id="00000000:5A:00.0",device="nvidia3",modelName="NVIDIA GeForce RTX 5090",Hostname="test-app-64-46-msxf",DCGM_FI_DRIVER_VERSION="570.172.08",container="qwen3-8b",namespace="acs-env-test1-150",pod="qwen3-8b-5ffbf94f64-5w4vn",pod_uid=""} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="4",UUID="GPU-7050631b-6524-21e3-41cf-6e08fc89220d",pci_bus_id="00000000:98:00.0",device="nvidia4",modelName="NVIDIA GeForce RTX 5090",Hostname="test-app-64-46-msxf",DCGM_FI_DRIVER_VERSION="570.172.08",container="test-1705",namespace="acs-env-test2-150",pod="test-1705-867cf98c48-mv2v7",pod_uid=""} 0
DCGM_FI_DEV_VGPU_LICENSE_STATUS{gpu="5",UUID="GPU-f7219cd7-e3ef-3b1b-4fa2-e84e8d5d4ce5",pci_bus_id="00000000:B8:00.0",device="nvidia5",modelName="NVIDIA GeForce RTX 5090",Hostname="test-app-64-46-msxf",DCGM_FI_DRIVER_VERSION="570.172.08",container="test-1705",namespace="acs-env-test2-150",pod="test-1705-867cf98c48-cz2ps",pod_uid=""} 0

Anything else we need to know?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions