Skip to content

All pods show identical power with gpu time slicing, and KUBERNETES_VIRTUAL_GPUS=true #582

@vimalk78

Description

@vimalk78

What is the version?

version: 6f3d599d-amd64, commit: 6f3d599

What happened?

version from dcgm exporter logs:

time="2025-11-11T05:49:39Z" level=info msg="version: 6f3d599d-amd64, commit: 6f3d599"

The openshift cluster has one node with a gpu. aws instance type: g4dn.xlarge, GPU: Tesla-T4

The gpu time slicing is enabled. the gpu is time sliced into 4.

❯ ./verify-gpu-timeslicing.sh
=== GPU Time-Slicing Verification ===

[1/5] Checking time-slicing ConfigMap...
✓ ConfigMap 'time-slicing-config' exists

Configuration:
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"any":"version: v1\nsharing:\n  timeSlicing:\n    resources:\n    - name: nvidia.com/gpu\n      replicas: 4"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"time-slicing-config","namespace":"nvidia-gpu-operator"}}
  creationTimestamp: "2025-11-11T05:57:32Z"
  name: time-slicing-config
  namespace: nvidia-gpu-operator
  resourceVersion: "41584"
  uid: 16ab5125-6a07-40f0-b907-98d786a4099c

[2/5] Checking ClusterPolicy configuration...
Device Plugin Configuration:
  ✓ Config name:    time-slicing-config
  ✓ Default config: any

DCGM Exporter Configuration:
---
    dcgmExporter:
      enabled: true
      env:
      - name: KUBERNETES_VIRTUAL_GPUS
        value: "true"
      serviceMonitor:
        additionalLabels:
          openshift.io/user-monitoring: "true"
        enabled: true
        interval: 30s
    devicePlugin:
---

Verification:
  ✓ DCGM Exporter enabled
  ✓ KUBERNETES_VIRTUAL_GPUS: true
  ✓ ServiceMonitor enabled (interval: 30s)

[3/5] Checking GPU nodes...
✓ Found 1 GPU node(s):
  - ip-10-0-23-187.us-west-2.compute.internal
    Instance: g4dn.xlarge, GPU: Tesla-T4

[4/5] Verifying GPU allocatable capacity...

Node: ip-10-0-23-187.us-west-2.compute.internal
  Physical GPUs:        1
  Allocatable GPUs:     4
  Sharing Strategy:     time-slicing
  Configured Replicas:  4
  ✓ Time-slicing is ACTIVE (1 × 4 = 4)

[5/5] Checking device plugin pods...
✓ Device plugin pods running:
NAME                                   READY   STATUS    RESTARTS   AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
nvidia-device-plugin-daemonset-5cth2   2/2     Running   0          106m   10.130.2.29   ip-10-0-23-187.us-west-2.compute.internal   <none>           <none>

=== Summary ===

Physical GPUs:     1
Allocatable GPUs:  4

✓ Time-slicing is ACTIVE with 4x replication

there are two pods running which has requested for 1 gpu each.

❯ oc get pods -A -o json | jq -r '.items[] | select(any(.spec.containers[]; .resources.limits["nvidia.com/gpu"])) | "\(.metadata.namespace)/\(.metadata.name)"'
default/gpu-burn-4p4rl
default/pytorch-gpu-long

The nvidia-gpu-operator namespace is being scraped by UWM. A grafana dashboard is setup to show gpu power using the promql query DCGM_FI_DEV_POWER_USAGE

The graph shows 2 time series with identical power usage.

Image

It seems that the metric is always showing value of overall gpu power. with timeslicing enabled, it creates new series for each pod using the gpu, but the value attached to the series is same irrespective of the pod.

What did you expect to happen?

power usage for each pod shown separately.

What is the GPU model?

aws instance type: g4dn.xlarge, GPU: Tesla-T4

What is the environment?

❯ oc version
Client Version: 4.20.0
Kustomize Version: v5.6.0
Server Version: 4.20.0
Kubernetes Version: v1.33.5

How did you deploy the dcgm-exporter and what is the configuration?

using NVIDIA GPU Operator from Openshift

How to reproduce the issue?

  • install nvidia gpu operator
  • enable time slicing
  • setup UWM and grafana dashboard
  • run more than one pod which requests 1 nvidia.com/gpu resource each
    resources:
      limits:
        nvidia.com/gpu: "1"
      requests:
        nvidia.com/gpu: "1"
  • check DCGM_FI_DEV_POWER_USAGE metric in dashboard.
    the dashboard should show 2 time series with exported_pod, exported_namespace labels referring to the workloads. but both of the time series shows same value.

Anything else we need to know?

no.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions