All pods show identical power with gpu time slicing, and KUBERNETES_VIRTUAL_GPUS=true

### What is the version?

version: 6f3d599d-amd64, commit: 6f3d599

### What happened?

version from dcgm exporter logs: 
```
time="2025-11-11T05:49:39Z" level=info msg="version: 6f3d599d-amd64, commit: 6f3d599"
```

The openshift cluster has one node with a gpu. aws instance type: `g4dn.xlarge`, GPU: `Tesla-T4`

The gpu time slicing is enabled. the gpu is time sliced into 4.

```
❯ ./verify-gpu-timeslicing.sh
=== GPU Time-Slicing Verification ===

[1/5] Checking time-slicing ConfigMap...
✓ ConfigMap 'time-slicing-config' exists

Configuration:
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"any":"version: v1\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com/gpu\n      replicas: 4"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"time-slicing-config","namespace":"nvidia-gpu-operator"}}
  creationTimestamp: "2025-11-11T05:57:32Z"
  name: time-slicing-config
  namespace: nvidia-gpu-operator
  resourceVersion: "41584"
  uid: 16ab5125-6a07-40f0-b907-98d786a4099c

[2/5] Checking ClusterPolicy configuration...
Device Plugin Configuration:
  ✓ Config name:    time-slicing-config
  ✓ Default config: any

DCGM Exporter Configuration:
---
    dcgmExporter:
      enabled: true
      env:
      - name: KUBERNETES_VIRTUAL_GPUS
        value: "true"
      serviceMonitor:
        additionalLabels:
          openshift.io/user-monitoring: "true"
        enabled: true
        interval: 30s
    devicePlugin:
---

Verification:
  ✓ DCGM Exporter enabled
  ✓ KUBERNETES_VIRTUAL_GPUS: true
  ✓ ServiceMonitor enabled (interval: 30s)

[3/5] Checking GPU nodes...
✓ Found 1 GPU node(s):
  - ip-10-0-23-187.us-west-2.compute.internal
    Instance: g4dn.xlarge, GPU: Tesla-T4

[4/5] Verifying GPU allocatable capacity...

Node: ip-10-0-23-187.us-west-2.compute.internal
  Physical GPUs:        1
  Allocatable GPUs:     4
  Sharing Strategy:     time-slicing
  Configured Replicas:  4
  ✓ Time-slicing is ACTIVE (1 × 4 = 4)

[5/5] Checking device plugin pods...
✓ Device plugin pods running:
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
nvidia-device-plugin-daemonset-5cth2   2/2     Running   0          106m   10.130.2.29   ip-10-0-23-187.us-west-2.compute.internal   <none>           <none>

=== Summary ===

Physical GPUs:     1
Allocatable GPUs:  4

✓ Time-slicing is ACTIVE with 4x replication
```

there are two pods running which has requested for 1 gpu each. 
```
❯ oc get pods -A -o json | jq -r '.items[] | select(any(.spec.containers[]; .resources.limits["nvidia.com/gpu"])) | "\(.metadata.namespace)/\(.metadata.name)"'
default/gpu-burn-4p4rl
default/pytorch-gpu-long
```
The `nvidia-gpu-operator` namespace is being scraped by UWM.  A grafana dashboard is setup to show gpu power using the promql query `DCGM_FI_DEV_POWER_USAGE`

The graph shows 2 time series with identical power usage.

<img width="1682" height="967" alt="Image" src="https://github.com/user-attachments/assets/e4b43b5f-7fbc-4694-9e08-218e4bced632" />

It seems that the metric is always showing value of overall gpu power. with timeslicing enabled, it creates new series for each pod using the gpu, but the value attached to the series is same irrespective of the pod.



### What did you expect to happen?

power usage for each pod shown separately.

### What is the GPU model?

aws instance type: `g4dn.xlarge`, GPU: `Tesla-T4`


### What is the environment?

```
❯ oc version
Client Version: 4.20.0
Kustomize Version: v5.6.0
Server Version: 4.20.0
Kubernetes Version: v1.33.5
```

### How did you deploy the dcgm-exporter and what is the configuration?

using NVIDIA GPU Operator from Openshift 

### How to reproduce the issue?

- install nvidia gpu operator
- enable time slicing
- setup UWM and grafana dashboard
- run more than one pod which requests 1 `nvidia.com/gpu` resource each
```
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
```
- check `DCGM_FI_DEV_POWER_USAGE` metric in dashboard.
the dashboard should show 2 time series with `exported_pod`, `exported_namespace` labels referring to the workloads. but both of the time series shows same value.

### Anything else we need to know?

no.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All pods show identical power with gpu time slicing, and KUBERNETES_VIRTUAL_GPUS=true #582

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

All pods show identical power with gpu time slicing, and KUBERNETES_VIRTUAL_GPUS=true #582

Description

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions