--kubernetes-virtual-gpus exports identical values for all pods instead of per-pod utilization

### What is the version?

4.4.1-4.5.2

### What happened?

DCGM Exporter currently does not provide accurate per-pod GPU utilization metrics when multiple pods share a single GPU via
  https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html.

When the `--kubernetes-virtual-gpus=true` flag is enabled and 3 pods share a GPU with time-slicing, all pods receive identical device-level utilization values:

  Current Behavior:
```
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", gpu="0",...} 0.714
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", gpu="0",...} 0.714  # Same value
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", gpu="0",...} 0.714  # Same value
```

  This occurs because DCGM provides device-level metrics, which the current implementation duplicates across all sharing pods. This makes it impossible to monitor individual workload GPU consumption


### What did you expect to happen?

When --kubernetes-virtual-gpus=true is enabled and multiple pods share a GPU, each pod should report its own actual GPU utilization, not the same device-level value copied across all pods.

  Expected metrics:
```
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", namespace="default", container="main", gpu="0",...} 0.24
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", namespace="default", container="main", gpu="0",...} 0.31
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18
```
  Each value should reflect the individual pod's GPU consumption.

### What is the GPU model?

Tesla T4

### What is the environment?

Kubernetes environment

### How did you deploy the dcgm-exporter and what is the configuration?

_No response_

### How to reproduce the issue?

_No response_

### Anything else we need to know?

I'm willing to submit a pull request to fix this bug if the proposed approach is acceptable.

**Proposed Solution:**
Enhance DCGM Exporter to track actual per-pod GPU utilization when `--kubernetes-virtual-gpus=true` is enabled:

  1. Collect per-process metrics using NVML's nvmlDeviceGetProcessUtilization() API to obtain GPU utilization for each process running on the GPU
  2. Map processes to pods
  3. Export pod-specific metrics with the same DCGM_FI_PROF_GR_ENGINE_ACTIVE metric name but with actual per-pod utilization values


Expected output when 3 pods share a GPU:
```
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-1", namespace="default", container="main", gpu="0",...} 0.24
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-2", namespace="default", container="main", gpu="0",...} 0.31
  DCGM_FI_PROF_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18
```


If changing the existing `DCGM_FI_PROF_GR_ENGINE_ACTIVE` metric is not desired, we can introduce a new metric name specifically for per-process data (e.g.,
`DCGM_FI_PROC_GR_ENGINE_ACTIVE{pod="gpu-workload-3", namespace="default", container="main", gpu="0",...} 0.18`) to clearly distinguish between device-level and process-level metrics.

This issue also affects other metrics when `--kubernetes-virtual-gpus=true` is enabled, such as `DCGM_FI_DEV_FB_USED` (per-pod memory usage). The PR implementation would address these additional metrics as well

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--kubernetes-virtual-gpus exports identical values for all pods instead of per-pod utilization #587

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

--kubernetes-virtual-gpus exports identical values for all pods instead of per-pod utilization #587

Description

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions