Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated  / false values

### What is the version?

3.3.5-3.4.1

### What happened?

When activating MIG we saw duplicated and plain wrong metrics in the provided Grafana dashboard (https://github.com/NVIDIA/dcgm-exporter/tree/main/grafana).

The issue seems to be two-fold, with Grafana as well as the raw metrics themselves:

1. **Firstly the dashboard**: Legends, ... and PromQL queries used to fetch metrics do not take MIG into account. So metrics returning MIG subdevices (`GPU_I_ID`) are not considered. 
GPU metrics regarding have not been up

2. **Secondly the metrics**: 

- Even if the queries were updated via e.g some aggregations like `max()`, `avg() `or `sum()` to avoid duplication, there are some metrics reported back per `GPU_I_ID`, that do not have this granularity. See me comment https://github.com/NVIDIA/dcgm-exporter/issues/257#issuecomment-2210537130. So if the power draw is not measured per `GPU_I_ID` you cannot return it individually as you would be returning false values. 
- Reading https://github.com/NVIDIA/DCGM/issues/80#issuecomment-1550452634 it seems the GPU metrics should be replaced by `DCGM_FI_PROF_*` . 
- But apparently there are even more open discussions around that: https://github.com/NVIDIA/DCGM/issues/138, https://github.com/NVIDIA/DCGM/issues/64 and https://github.com/NVIDIA/DCGM/issues/48.
- This comment by @bstollenvidia seems to sum up quite nicely how things work: https://github.com/NVIDIA/DCGM/issues/64#issuecomment-1400811885



### What did you expect to happen?

Provided MIG and other ways of partitioning GPUs (vGPU, time-slicing, ...) is quite common, I'd expect the exporter and the provided dashboard to take those into account.

Metrics that are available per-subdevice should be returned, if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

### What is the GPU model?

H100s, using different MIG profiles and whole GPUs

### What is the environment?

Kubernetes

### How did you deploy the dcgm-exporter and what is the configuration?

Kubernetes with GPU-Operator

### How to reproduce the issue?

Enable MIG on a GPU and look at the dashboard.

### Anything else we need to know?

There are multiple issues with DCGM or the operator open:

* https://github.com/NVIDIA/DCGM/issues/80
* https://github.com/NVIDIA/gpu-operator/issues/798

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

Description

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions