Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions deployment/templates/metrics-configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ data:
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge, NVLink total throughput for all lanes (MiB/s, TX+RX combined).
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, gauge, NVLink throughput for lane 0 (MiB/s, TX+RX combined).

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
Expand Down
4 changes: 2 additions & 2 deletions deployment/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -322,8 +322,8 @@ kubernetesDRA:
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge, NVLink total throughput for all lanes (MiB/s, TX+RX combined).
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, gauge, NVLink throughput for lane 0 (MiB/s, TX+RX combined).

# VGPU License status
# DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
Expand Down
4 changes: 2 additions & 2 deletions etc/dcp-metrics-included.csv
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,8 @@ DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge, NVLink total throughput for all lanes (MiB/s, TX+RX combined).
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, gauge, NVLink throughput for lane 0 (MiB/s, TX+RX combined).

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
Expand Down
4 changes: 2 additions & 2 deletions etc/default-counters.csv
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ DCGM_FI_DEV_FB_RESERVED, gauge, Framebuffer memory reserved (in MiB).
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes.
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, counter, The number of bytes of active NVLink rx or tx data including both header and payload.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge, NVLink total throughput for all lanes (MiB/s, TX+RX combined).
# DCGM_FI_DEV_NVLINK_BANDWIDTH_L0, gauge, NVLink throughput for lane 0 (MiB/s, TX+RX combined).

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
Expand Down
2 changes: 1 addition & 1 deletion tests/integration/testdata/default-counters.csv
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ DCGM_FI_DEV_FB_USED, gauge, Frame buffer memory used (in MB).
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, gauge, NVLink total throughput for all lanes (MiB/s, TX+RX combined)

# VGPU License status
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
Expand Down