What is the version?
3.3.5-3.4.1
What happened?
Metrics like DCGM_FI_PROF_GR_ENGINE_ACTIVE are only exposed for one single pod even though there are multiple pods that use the same GPU
What did you expect to happen?
Metrics for the all the pods should be exposed
What is the GPU model?
Tesla T4
What is the environment?
GKE
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
- Enable time-slicing using device plugin
- Deploy DCGM and dcgm-exporter
- Deploy app that uses GPU
- Check metrics
Anything else we need to know?
From the debug log
time="2024-04-05T13:49:04Z" level=debug msg="Device to pod mapping: map[nvidia0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu1:{Name:gpu-pod-c69f6664f-2v922 Namespace:default Container:extractor} nvidia0/vgpu2:{Name:gpu-pod-c69f6664f-wrcxw Namespace:default Container:extractor} nvidia0/vgpu3:{Name:gpu-pod-c69f6664f-ffs8r Namespace:default Container:extractor}]"
What is the version?
3.3.5-3.4.1
What happened?
Metrics like
DCGM_FI_PROF_GR_ENGINE_ACTIVEare only exposed for one single pod even though there are multiple pods that use the same GPUWhat did you expect to happen?
Metrics for the all the pods should be exposed
What is the GPU model?
Tesla T4
What is the environment?
GKE
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
Anything else we need to know?
From the debug log