Skip to content

Releases: NVIDIA/dcgm-exporter

4.5.3-4.8.2

07 May 00:28
691c927

Choose a tag to compare

  • Update to DCGM 4.5.3 and DCGM Exporter 4.8.2.
  • Improve GPU health metrics, including reporting GPU-wide health incidents such as fallen-off-bus XIDs.
  • Make /debug/pprof profiling endpoints opt-in via --enable-pprof / DCGM_EXPORTER_ENABLE_PPROF.
  • Add PodMapper informer caching for Kubernetes pod mapping (#626) (@jaeeyoungkim).
  • Add per-process GPU metrics for time-sharing and MIG (#594) (@krystiancastai).
  • Make Helm priorityClassName configurable with explicit defaults (#444) (@runzhliu).
  • Add MIG device support for HPC job labels (#602) (@jay-mckay).
  • Update go-dcgm field metadata handling, deprecated field alias resolution, health constants, policy registration handling, and version info APIs.
  • Document IPv6 address formats for remote hostengine and metrics listen addresses.
  • Refresh dependencies, container base images, Docker image references, Helm chart values, Kubernetes manifests, and tests for this release.

4.5.2-4.8.1

09 Feb 15:43
52ffa18

Choose a tag to compare

  • Update to DCGM 4.5.2, latest Go 1.24, and base containers
  • Fix distroless symlink issue
  • Fix for parsing blank XIDs
  • Fix for nvlink entities starting at offset 1

4.5.1-4.8.0

28 Jan 22:10
3cb017b

Choose a tag to compare

  • Update to DCGM 4.5.1
  • Enabled monitoring of GPU bind/unbind events and automatic reloading (@nvvfedorov) - beta
  • Sync default metric watchlist for docker and helm (@faizan-exe)
  • Fix health endpoint behavior (@Alja9)
  • Increase default memory limit to 512Mi (@faizan-exe)
  • Make scrapeTimeout configurable (@faizan-exe)
  • Fix P2P Status mappings (@wkd-woo)

NOTICE: Helm chart now uses distroless container by default

4.4.2-4.7.1

10 Dec 15:30
b921c57

Choose a tag to compare

  • Update Go-DCGM
  • Update XID error texts based off "XID errors v580" (#588)
  • FIX: fix for vailation time us->ns (#589)
  • feat: updating readme to include OpenObserve blog and dashboards for … (#580)

4.4.2-4.7.0

18 Nov 17:47
54267f2

Choose a tag to compare

4.4.1-4.6.0

13 Oct 17:54
13ad457

Choose a tag to compare

  • Add allow list for pod label filtering (#564)
  • handle uninitialized map (#563)
  • Add hostPID field to values.yaml and DaemonSet template for Helm (#503)
  • feat(dcgm-exporter): add option to fail on nvml provider init error (#557)
  • Improved support for GPU NvLink monitoring

4.4.1-4.5.2

17 Sep 15:00
a5a5aa8

Choose a tag to compare

  • Follow FHS convention for logging: #556
  • Fix: enable metrics without pods using kubernetes-enable-{dra,virtual-gpus} #554
  • Add disable startup validate flag #555
  • hpc: reduce error logging if jobs directory does not exist

4.4.0-4.5.0

19 Aug 16:10
4ecf9b6

Choose a tag to compare

  • Update to DCGM 4.4 and Cuda 13.0
  • Kubernetes UID support (@andrew-leung)
  • Create distroless container target

4.3.1-4.4.0

07 Aug 15:12
6949141

Choose a tag to compare

  • Update To DCGM 4.3.1
  • Update podapi for DRA
  • Enable DCGM_EXP_P2P_STATUS for reporting GPU peer-to-peer nvlink status
  • Fix for empty HPC directory
  • Enable InitContainer support

4.2.3-4.2.0

11 Jul 14:25
9378144

Choose a tag to compare

DCGM-Exporter 4.2.3-4.2.0

  • [ISSUE-512] Added a new debugging facility to dump runtime objects into files
  • Kubernetes pod label support