Skip to content

fix: correct Prometheus type for NVLink bandwidth fields from counter to gauge#658

Open
allenz92 wants to merge 1 commit into
NVIDIA:mainfrom
allenz92:fix/nvlink-bandwidth-prom-type
Open

fix: correct Prometheus type for NVLink bandwidth fields from counter to gauge#658
allenz92 wants to merge 1 commit into
NVIDIA:mainfrom
allenz92:fix/nvlink-bandwidth-prom-type

Conversation

@allenz92
Copy link
Copy Markdown

Summary

Fixes the Prometheus metric type for DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL
(and its commented-out per-lane variant DCGM_FI_DEV_NVLINK_BANDWIDTH_L0)
from counter to gauge.

Fixes #417

Root Cause

DCGM field 449 (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL) is an alias for
DCGM_FI_DEV_NVLINK_THROUGHPUT_TOTAL. Internally, DCGM reads the raw
NVML cumulative KiB counters (NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_RX
and NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_TX, field IDs 138/139, unit
documented as KiB in nvml.h) and converts them to an instantaneous
throughput rate in ReadAndCacheNvLinkBandwidth()
(dcgmlib/src/DcgmCacheManager.cpp):

double valueDbl = (double)(currentSum - prevValue->val2.i64);
valueDbl /= timeDiffSec;  // KiB/s
valueDbl /= 1000.0;       // MiB/s (approximate)

The published value is therefore an instantaneous MiB/s rate that
fluctuates with NVLink activity. It is a gauge, not a counter.

Impact

Users whose Grafana dashboards apply rate() or increase() to this
metric (which is the correct treatment for a counter) get near-zero or
meaningless results. The actual NVLink throughput is hidden.

Changes

  • Updated metric type from counter to gauge in all 5 config files
  • Updated help text to accurately describe the value as a throughput rate
    (MiB/s, TX+RX combined) rather than "number of NVLink bandwidth counters"

… to gauge

DCGM field 449 (DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL) is declared as
`counter` but DCGM internally computes an instantaneous throughput rate
(MiB/s) via ReadAndCacheNvLinkBandwidth() in DcgmCacheManager.cpp:

    double valueDbl = (currentKiB - prevKiB) / timeDiffSec / 1000.0;

The NVML source fields (NVML_FI_DEV_NVLINK_THROUGHPUT_DATA_RX/TX, field
IDs 138/139) are documented as cumulative KiB in nvml.h. DCGM converts
them to an instantaneous MiB/s rate, so the exported value fluctuates
with GPU load and can decrease between samples.

This makes the metric a gauge by definition (Prometheus docs: "A gauge
is a metric that represents a single numerical value that can arbitrarily
go up and down"). Marking it as counter causes dashboards that apply
rate() or increase() to produce near-zero or meaningless results.

Also corrects the help text to accurately describe the value as
throughput rate rather than "number of counters".

Fixes NVIDIA#417

Signed-off-by: Allen Zhou <loveinjavac@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dcgm-exporter counter value goes down

1 participant