Skip to content

--kubernetes-virtual-gpus exports identical values for all pods instead of per-pod utilization #587

@krystiancastai

Description

@krystiancastai

What is the version?

4.4.1-4.5.2

What happened?

DCGM Exporter currently does not provide accurate per-pod GPU utilization metrics when multiple pods share a single GPU via
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html.

When the --kubernetes-virtual-gpus=true flag is enabled and 3 pods share a GPU with time-slicing, all pods receive identical device-level utilization values:

Current Behavior:

  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", gpu="0",...} 0.714
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", gpu="0",...} 0.714  # Same value
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", gpu="0",...} 0.714  # Same value

This occurs because DCGM provides device-level metrics, which the current implementation duplicates across all sharing pods. This makes it impossible to monitor individual workload GPU consumption

What did you expect to happen?

When --kubernetes-virtual-gpus=true is enabled and multiple pods share a GPU, each pod should report its own actual GPU utilization, not the same device-level value copied across all pods.

Expected metrics:

  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", namespace="default", container="main", gpu="0",...} 0.24
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", namespace="default", container="main", gpu="0",...} 0.31
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18

Each value should reflect the individual pod's GPU consumption.

What is the GPU model?

Tesla T4

What is the environment?

Kubernetes environment

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

I'm willing to submit a pull request to fix this bug if the proposed approach is acceptable.

Proposed Solution:
Enhance DCGM Exporter to track actual per-pod GPU utilization when --kubernetes-virtual-gpus=true is enabled:

  1. Collect per-process metrics using NVML's nvmlDeviceGetProcessUtilization() API to obtain GPU utilization for each process running on the GPU
  2. Map processes to pods
  3. Export pod-specific metrics with the same DCGM_FI_PROF_GR_ENGINE_ACTIVE metric name but with actual per-pod utilization values

Expected output when 3 pods share a GPU:

  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", namespace="default", container="main", gpu="0",...} 0.24
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", namespace="default", container="main", gpu="0",...} 0.31
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18

If changing the existing DCGM_FI_PROF_GR_ENGINE_ACTIVE metric is not desired, we can introduce a new metric name specifically for per-process data (e.g.,
DCGM_FI_PROC_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18) to clearly distinguish between device-level and process-level metrics.

This issue also affects other metrics when --kubernetes-virtual-gpus=true is enabled, such as DCGM_FI_DEV_FB_USED (per-pod memory usage). The PR implementation would address these additional metrics as well

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions