Ask your question
Description
Problem
We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:
- 1x
4g.40gb
- 1x
2g.20gb
- 1x
1g.20gb
We are using the gpu-burn container image to fully stress the 4g.40gb MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.
MIG Configuration
mig-enabled: true
mig-devices:
"4g.40gb": 1
"2g.20gb": 1
"1g.20gb": 1
Pod Manifest
apiVersion: v1
kind: Pod
metadata:
name: gpu-stress-burn-single
namespace: mlops-development
spec:
restartPolicy: Never
containers:
- name: gpu
image: iankoulski/gpuburn
command: ["/app/gpu_burn"]
args:
- "-tc" # Tensor Core enabled
- "-d" # Double Precision enabled
- "14400" # 4 hours (in seconds)
resources:
limits:
nvidia.com/mig-4g.40gb: "1"
requests:
nvidia.com/mig-4g.40gb: "1"
Expected Behavior
- While
gpu-burn is running:
- DCGM metrics should show ~100% GPU utilization for the allocated MIG instance
Actual Behavior
- Pod starts and
gpu_burn runs successfully
- However:
- GPU utilization in DCGM metrics appears low
- It never reaches ~100%
Questions / Suspicions
- Does DCGM exporter report utilization differently for MIG devices?
- Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
- Is a multi-process workload or additional flags required?
Additional Context
- GPU: NVIDIA A100
- MIG: Enabled
- Environment: Kubernetes with NVIDIA device plugin
- Monitoring: DCGM Exporter
Screenshots
Environment
GPU: NVIDIA A100 80GB PCIe
Driver: 550.54.15
CUDA: 12.4
MIG profile: 4g.40gb
nvidia-smi Output:
Captured on the worker node while gpu-burn was running:

Raw DCGM Metric:

I can also provide the gpu-burn command, pod specification, and additional diagnostics if needed to help reproduce the test.
Please let me know if you need any further data or validation with another workload.
Thanks.
Ask your question
Description
Problem
We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:
4g.40gb2g.20gb1g.20gbWe are using the
gpu-burncontainer image to fully stress the4g.40gbMIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.MIG Configuration
Pod Manifest
Expected Behavior
gpu-burnis running:Actual Behavior
gpu_burnruns successfullyQuestions / Suspicions
Additional Context
Screenshots
Environment
nvidia-smi Output:
Captured on the worker node while gpu-burn was running:
Raw DCGM Metric:
I can also provide the gpu-burn command, pod specification, and additional diagnostics if needed to help reproduce the test.
Please let me know if you need any further data or validation with another workload.
Thanks.