What is the version?
version: 6f3d599d-amd64, commit: 6f3d599
What happened?
version from dcgm exporter logs:
time="2025-11-11T05:49:39Z" level=info msg="version: 6f3d599d-amd64, commit: 6f3d599"
The openshift cluster has one node with a gpu. aws instance type: g4dn.xlarge, GPU: Tesla-T4
The gpu time slicing is enabled. the gpu is time sliced into 4.
❯ ./verify-gpu-timeslicing.sh
=== GPU Time-Slicing Verification ===
[1/5] Checking time-slicing ConfigMap...
✓ ConfigMap 'time-slicing-config' exists
Configuration:
data:
any: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4
kind: ConfigMap
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"v1","data":{"any":"version: v1\nsharing:\n timeSlicing:\n resources:\n - name: nvidia.com/gpu\n replicas: 4"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"time-slicing-config","namespace":"nvidia-gpu-operator"}}
creationTimestamp: "2025-11-11T05:57:32Z"
name: time-slicing-config
namespace: nvidia-gpu-operator
resourceVersion: "41584"
uid: 16ab5125-6a07-40f0-b907-98d786a4099c
[2/5] Checking ClusterPolicy configuration...
Device Plugin Configuration:
✓ Config name: time-slicing-config
✓ Default config: any
DCGM Exporter Configuration:
---
dcgmExporter:
enabled: true
env:
- name: KUBERNETES_VIRTUAL_GPUS
value: "true"
serviceMonitor:
additionalLabels:
openshift.io/user-monitoring: "true"
enabled: true
interval: 30s
devicePlugin:
---
Verification:
✓ DCGM Exporter enabled
✓ KUBERNETES_VIRTUAL_GPUS: true
✓ ServiceMonitor enabled (interval: 30s)
[3/5] Checking GPU nodes...
✓ Found 1 GPU node(s):
- ip-10-0-23-187.us-west-2.compute.internal
Instance: g4dn.xlarge, GPU: Tesla-T4
[4/5] Verifying GPU allocatable capacity...
Node: ip-10-0-23-187.us-west-2.compute.internal
Physical GPUs: 1
Allocatable GPUs: 4
Sharing Strategy: time-slicing
Configured Replicas: 4
✓ Time-slicing is ACTIVE (1 × 4 = 4)
[5/5] Checking device plugin pods...
✓ Device plugin pods running:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nvidia-device-plugin-daemonset-5cth2 2/2 Running 0 106m 10.130.2.29 ip-10-0-23-187.us-west-2.compute.internal <none> <none>
=== Summary ===
Physical GPUs: 1
Allocatable GPUs: 4
✓ Time-slicing is ACTIVE with 4x replication
there are two pods running which has requested for 1 gpu each.
❯ oc get pods -A -o json | jq -r '.items[] | select(any(.spec.containers[]; .resources.limits["nvidia.com/gpu"])) | "\(.metadata.namespace)/\(.metadata.name)"'
default/gpu-burn-4p4rl
default/pytorch-gpu-long
The nvidia-gpu-operator namespace is being scraped by UWM. A grafana dashboard is setup to show gpu power using the promql query DCGM_FI_DEV_POWER_USAGE
The graph shows 2 time series with identical power usage.
It seems that the metric is always showing value of overall gpu power. with timeslicing enabled, it creates new series for each pod using the gpu, but the value attached to the series is same irrespective of the pod.
What did you expect to happen?
power usage for each pod shown separately.
What is the GPU model?
aws instance type: g4dn.xlarge, GPU: Tesla-T4
What is the environment?
❯ oc version
Client Version: 4.20.0
Kustomize Version: v5.6.0
Server Version: 4.20.0
Kubernetes Version: v1.33.5
How did you deploy the dcgm-exporter and what is the configuration?
using NVIDIA GPU Operator from Openshift
How to reproduce the issue?
- install nvidia gpu operator
- enable time slicing
- setup UWM and grafana dashboard
- run more than one pod which requests 1
nvidia.com/gpu resource each
resources:
limits:
nvidia.com/gpu: "1"
requests:
nvidia.com/gpu: "1"
- check
DCGM_FI_DEV_POWER_USAGE metric in dashboard.
the dashboard should show 2 time series with exported_pod, exported_namespace labels referring to the workloads. but both of the time series shows same value.
Anything else we need to know?
no.
What is the version?
version: 6f3d599d-amd64, commit: 6f3d599
What happened?
version from dcgm exporter logs:
The openshift cluster has one node with a gpu. aws instance type:
g4dn.xlarge, GPU:Tesla-T4The gpu time slicing is enabled. the gpu is time sliced into 4.
there are two pods running which has requested for 1 gpu each.
The
nvidia-gpu-operatornamespace is being scraped by UWM. A grafana dashboard is setup to show gpu power using the promql queryDCGM_FI_DEV_POWER_USAGEThe graph shows 2 time series with identical power usage.
It seems that the metric is always showing value of overall gpu power. with timeslicing enabled, it creates new series for each pod using the gpu, but the value attached to the series is same irrespective of the pod.
What did you expect to happen?
power usage for each pod shown separately.
What is the GPU model?
aws instance type:
g4dn.xlarge, GPU:Tesla-T4What is the environment?
How did you deploy the dcgm-exporter and what is the configuration?
using NVIDIA GPU Operator from Openshift
How to reproduce the issue?
nvidia.com/gpuresource eachDCGM_FI_DEV_POWER_USAGEmetric in dashboard.the dashboard should show 2 time series with
exported_pod,exported_namespacelabels referring to the workloads. but both of the time series shows same value.Anything else we need to know?
no.