Skip to content

GPU utilization does not reach 100% in DCGM metrics when running GPU Burn on A100 MIG (4g.40gb ; 2g-20gb) #639

@dogukanpolatel

Description

@dogukanpolatel

Ask your question

Description

Problem

We are running an NVIDIA A100 GPU with MIG (Multi-Instance GPU) enabled. The GPU is partitioned as follows:

  • 1x 4g.40gb
  • 1x 2g.20gb
  • 1x 1g.20gb

We are using the gpu-burn container image to fully stress the 4g.40gb MIG instance and expect to observe ~100% GPU utilization in DCGM metrics. However, although the pod runs successfully, GPU utilization does not reach 100% in DCGM exporter/monitoring metrics.

MIG Configuration

mig-enabled: true
mig-devices:
  "4g.40gb": 1
  "2g.20gb": 1
  "1g.20gb": 1

Pod Manifest

apiVersion: v1
kind: Pod
metadata:
  name: gpu-stress-burn-single
  namespace: mlops-development
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: iankoulski/gpuburn
      command: ["/app/gpu_burn"]
      args:
        - "-tc"        # Tensor Core enabled
        - "-d"         # Double Precision enabled
        - "14400"      # 4 hours (in seconds)
      resources:
        limits:
          nvidia.com/mig-4g.40gb: "1"
        requests:
          nvidia.com/mig-4g.40gb: "1"

Expected Behavior

  • While gpu-burn is running:
    • DCGM metrics should show ~100% GPU utilization for the allocated MIG instance

Actual Behavior

  • Pod starts and gpu_burn runs successfully
  • However:
    • GPU utilization in DCGM metrics appears low
    • It never reaches ~100%

Questions / Suspicions

  1. Does DCGM exporter report utilization differently for MIG devices?
  2. Are DCGM metrics calculated at physical GPU level instead of per-MIG instance?
  3. Is a multi-process workload or additional flags required?

Additional Context

  • GPU: NVIDIA A100
  • MIG: Enabled
  • Environment: Kubernetes with NVIDIA device plugin
Image
  • Monitoring: DCGM Exporter

Screenshots

Image

Environment

GPU: NVIDIA A100 80GB PCIe
Driver: 550.54.15
CUDA: 12.4
MIG profile: 4g.40gb

nvidia-smi Output:

Image

Captured on the worker node while gpu-burn was running:

Image

Image

Raw DCGM Metric:

Image

I can also provide the gpu-burn command, pod specification, and additional diagnostics if needed to help reproduce the test.

Please let me know if you need any further data or validation with another workload.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions