GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb)

### Ask your question


## Description

### Problem
We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:
- 1x `4g.40gb`
- 1x `2g.20gb`
- 1x `1g.20gb`

We are using the `gpu-burn` container image to fully stress the `4g.40gb` MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.

### MIG Configuration
```yaml
mig-enabled: true
mig-devices:
  "4g.40gb": 1
  "2g.20gb": 1
  "1g.20gb": 1
```

### Pod Manifest
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-stress-burn-single
  namespace: mlops-development
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: iankoulski/gpuburn
      command: ["/app/gpu_burn"]
      args:
        - "-tc"        # Tensor Core enabled
        - "-d"         # Double Precision enabled
        - "14400"      # 4 hours (in seconds)
      resources:
        limits:
          nvidia.com/mig-4g.40gb: "1"
        requests:
          nvidia.com/mig-4g.40gb: "1"
```

### Expected Behavior
- While `gpu-burn` is running:
  - DCGM metrics should show ~100% GPU utilization for the allocated MIG instance

### Actual Behavior
- Pod starts and `gpu_burn` runs successfully
- However:
  - GPU utilization in DCGM metrics appears low 
  - It never reaches ~100%

### Questions / Suspicions
1. Does DCGM exporter report utilization differently for MIG devices?
2. Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
3. Is a multi-process workload or additional flags required?

### Additional Context
- **GPU:** NVIDIA A100
- **MIG:** Enabled
- **Environment:** Kubernetes with NVIDIA device plugin
<img width="348" height="156" alt="Image" src="https://github.com/user-attachments/assets/45dae432-fc5c-4f29-b22d-081610fc6145" />

- **Monitoring:** DCGM Exporter

### Screenshots

<img width="2249" height="710" alt="Image" src="https://github.com/user-attachments/assets/8cd5e6ff-3db4-4671-8857-8cf30f1121c1" />

Environment

    GPU: NVIDIA A100 80GB PCIe
    Driver: 550.54.15
    CUDA: 12.4
    MIG profile: 4g.40gb

nvidia-smi Output:

<img width="571" height="447" alt="Image" src="https://github.com/user-attachments/assets/935c4d8f-e8ab-4c3d-84de-46c51bdf14fb" />

Captured on the worker node while gpu-burn was running:


![Image](https://github.com/user-attachments/assets/00515e15-0d03-4c46-a70d-7bb6e31f31ba)

<img width="1682" height="518" alt="Image" src="https://github.com/user-attachments/assets/5bc15a04-df24-45b3-b3e0-6df1afe50441" />

Raw DCGM Metric:

![Image](https://github.com/user-attachments/assets/d7dde5da-1a1e-4490-a55a-841c17d2e796)

I can also provide the gpu-burn command, pod specification, and additional diagnostics if needed to help reproduce the test.

Please let me know if you need any further data or validation with another workload.

Thanks.

 

 






Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #639

Ask your question

Description

Problem

MIG Configuration

Pod Manifest

Expected Behavior

Actual Behavior

Questions / Suspicions

Additional Context

Screenshots

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #639

Description

Ask your question

Description

Problem

MIG Configuration

Pod Manifest

Expected Behavior

Actual Behavior

Questions / Suspicions

Additional Context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions