What is the version?
4.4.1-4.5.2
What happened?
DCGM Exporter currently does not provide accurate per-pod GPU utilization metrics when multiple pods share a single GPU via
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html.
When the --kubernetes-virtual-gpus=true flag is enabled and 3 pods share a GPU with time-slicing, all pods receive identical device-level utilization values:
Current Behavior:
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", gpu="0",...} 0.714
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", gpu="0",...} 0.714 # Same value
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", gpu="0",...} 0.714 # Same value
This occurs because DCGM provides device-level metrics, which the current implementation duplicates across all sharing pods. This makes it impossible to monitor individual workload GPU consumption
What did you expect to happen?
When --kubernetes-virtual-gpus=true is enabled and multiple pods share a GPU, each pod should report its own actual GPU utilization, not the same device-level value copied across all pods.
Expected metrics:
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", namespace="default", container="main", gpu="0",...} 0.24
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", namespace="default", container="main", gpu="0",...} 0.31
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18
Each value should reflect the individual pod's GPU consumption.
What is the GPU model?
Tesla T4
What is the environment?
Kubernetes environment
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
I'm willing to submit a pull request to fix this bug if the proposed approach is acceptable.
Proposed Solution:
Enhance DCGM Exporter to track actual per-pod GPU utilization when --kubernetes-virtual-gpus=true is enabled:
- Collect per-process metrics using NVML's nvmlDeviceGetProcessUtilization() API to obtain GPU utilization for each process running on the GPU
- Map processes to pods
- Export pod-specific metrics with the same DCGM_FI_PROF_GR_ENGINE_ACTIVE metric name but with actual per-pod utilization values
Expected output when 3 pods share a GPU:
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", namespace="default", container="main", gpu="0",...} 0.24
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", namespace="default", container="main", gpu="0",...} 0.31
DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18
If changing the existing DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is not desired, we can introduce a new metric name specifically for per-process data (e.g.,
DCGM_FI_PROC_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18) to clearly distinguish between device-level and process-level metrics.
This issue also affects other metrics when --kubernetes-virtual-gpus=true is enabled, such as DCGM_FI_DEV_FB_USED (per-pod memory usage). The PR implementation would address these additional metrics as well
What is the version?
4.4.1-4.5.2
What happened?
DCGM Exporter currently does not provide accurate per-pod GPU utilization metrics when multiple pods share a single GPU via
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html.
When the
--kubernetes-virtual-gpus=trueflag is enabled and 3 pods share a GPU with time-slicing, all pods receive identical device-level utilization values:Current Behavior:
This occurs because DCGM provides device-level metrics, which the current implementation duplicates across all sharing pods. This makes it impossible to monitor individual workload GPU consumption
What did you expect to happen?
When --kubernetes-virtual-gpus=true is enabled and multiple pods share a GPU, each pod should report its own actual GPU utilization, not the same device-level value copied across all pods.
Expected metrics:
Each value should reflect the individual pod's GPU consumption.
What is the GPU model?
Tesla T4
What is the environment?
Kubernetes environment
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
I'm willing to submit a pull request to fix this bug if the proposed approach is acceptable.
Proposed Solution:
Enhance DCGM Exporter to track actual per-pod GPU utilization when
--kubernetes-virtual-gpus=trueis enabled:Expected output when 3 pods share a GPU:
If changing the existing
DCGM_FI_PROF_GR_ENGINE_ACTIVEmetric is not desired, we can introduce a new metric name specifically for per-process data (e.g.,DCGM_FI_PROC_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18) to clearly distinguish between device-level and process-level metrics.This issue also affects other metrics when
--kubernetes-virtual-gpus=trueis enabled, such asDCGM_FI_DEV_FB_USED(per-pod memory usage). The PR implementation would address these additional metrics as well