Skip to content

Duplicated, missing or wrong metrics if using MIG, Grafana dashboard showing wrong duplicated / false values #353

@frittentheke

Description

@frittentheke

What is the version?

3.3.5-3.4.1

What happened?

When activating MIG we saw duplicated and plain wrong metrics in the provided Grafana dashboard (https://github.com/NVIDIA/dcgm-exporter/tree/main/grafana).

The issue seems to be two-fold, with Grafana as well as the raw metrics themselves:

  1. Firstly the dashboard: Legends, ... and PromQL queries used to fetch metrics do not take MIG into account. So metrics returning MIG subdevices (GPU_I_ID) are not considered.
    GPU metrics regarding have not been up

  2. Secondly the metrics:

What did you expect to happen?

Provided MIG and other ways of partitioning GPUs (vGPU, time-slicing, ...) is quite common, I'd expect the exporter and the provided dashboard to take those into account.

Metrics that are available per-subdevice should be returned, if they are just duplicates of each other, they should be dropped and only returned per "main" GPU.

What is the GPU model?

H100s, using different MIG profiles and whole GPUs

What is the environment?

Kubernetes

How did you deploy the dcgm-exporter and what is the configuration?

Kubernetes with GPU-Operator

How to reproduce the issue?

Enable MIG on a GPU and look at the dashboard.

Anything else we need to know?

There are multiple issues with DCGM or the operator open:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions