Skip to content

Per pod metrics not exposed with time-slicing enabled #307

@ThisIsQasim

Description

@ThisIsQasim

What is the version?

3.3.5-3.4.1

What happened?

Metrics like DCGM_FI_PROF_GR_ENGINE_ACTIVE are only exposed for one single pod even though there are multiple pods that use the same GPU

What did you expect to happen?

Metrics for the all the pods should be exposed

What is the GPU model?

Tesla T4

What is the environment?

GKE

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

  • Enable time-slicing using device plugin
  • Deploy DCGM and dcgm-exporter
  • Deploy app that uses GPU
  • Check metrics

Anything else we need to know?

From the debug log

time="2024-04-05T13:49:04Z" level=debug msg="Device to pod mapping: map[nvidia0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu0:{Name:gpu-pod-c69f6664f-vkkcb Namespace:default Container:extractor} nvidia0/vgpu1:{Name:gpu-pod-c69f6664f-2v922 Namespace:default Container:extractor} nvidia0/vgpu2:{Name:gpu-pod-c69f6664f-wrcxw Namespace:default Container:extractor} nvidia0/vgpu3:{Name:gpu-pod-c69f6664f-ffs8r Namespace:default Container:extractor}]"

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions