Skip to content

Missing metrics DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS #646

@aliya-do

Description

@aliya-do

What is the version?

4.4.1-4.5.2 & 4.5.2-4.8.1

What happened?

When curl-ing the dcgm-exporter's metrics endpoint, we get no results for DCGM_FI_DEV_RETIRED_SBE, DCGM_FI_DEV_RETIRED_DBE and DCGM_FI_DEV_XID_ERRORS .

# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_SBE
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69847    0 69847    0     0  2490k      0 --:--:-- --:--:-- --:--:-- 2526k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_RETIRED_DBE
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69850    0 69850    0     0  1420k      0 --:--:-- --:--:-- --:--:-- 1451k
# curl localhost:5000/metrics | grep DCGM_FI_DEV_XID_ERRORS
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69850    0 69850    0     0  2651k      0 --:--:-- --:--:-- --:--:-- 2728k

NOTE: we did verify other expected metrics do behave as expected.

What did you expect to happen?

When curl-ing our metrics endpoint we expect to see a value for each missing metric for each GPU.

What is the GPU model?

We've seen this issue on:

  • internal-a30-1x
  • h100 with 8 gpus configured

output of nvidia-smi for internal-a30-1x:

nvidia-smi
Wed Mar 25 19:21:45 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08              Driver Version: 575.57.08      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     On  |   00000000:83:00.0 Off |                    0 |
| N/A   30C    P0             30W /  165W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

output of nvidia-smi for VM running h100x8

nvidia-smi
Wed Mar 25 22:52:35 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          On  |   00000000:00:0A.0 Off |                    0 |
| N/A   58C    P0             80W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          On  |   00000000:00:0B.0 Off |                    0 |
| N/A   29C    P0             72W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          On  |   00000000:00:0C.0 Off |                    0 |
| N/A   29C    P0             71W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          On  |   00000000:00:0D.0 Off |                    0 |
| N/A   32C    P0             74W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          On  |   00000000:00:0E.0 Off |                    0 |
| N/A   55C    P0             75W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          On  |   00000000:00:0F.0 Off |                    0 |
| N/A   30C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          On  |   00000000:00:10.0 Off |                    0 |
| N/A   31C    P0             73W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          On  |   00000000:00:11.0 Off |                    0 |
| N/A   29C    P0             70W /  700W |       0MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

What is the environment?

we've seen this in both pod and virtual environments.

How did you deploy the dcgm-exporter and what is the configuration?

For our Kubernetes environment: nvidia-dcgm helm chart and configure it (which metrics to expose) with a k8s ConfigMap

For Virtual Machines, we built the dcgm-exporter from source and configure the metrics to expose via a csv file that is passed in to the dcgm-exporter via the -f argument.

How to reproduce the issue?

  • Install dcgm-exporter (we reproduced the issue with both 4.4.1-4.5.2 & 4.5.2-4.8.1 versions)
    Run like so:
/usr/bin/dcgm-exporter -f /etc/dcgm-exporter/dcgm-metrics.csv -c 1000 -a 127.0.0.1:5000

I've also attached the dcgm-metrics.csv config used to expose the metrics we want to see

dcgm-metrics.csv

Anything else we need to know?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions