Releases: NVIDIA/dcgm-exporter
Releases · NVIDIA/dcgm-exporter
4.5.3-4.8.2
- Update to DCGM 4.5.3 and DCGM Exporter 4.8.2.
- Improve GPU health metrics, including reporting GPU-wide health incidents such as fallen-off-bus XIDs.
- Make
/debug/pprofprofiling endpoints opt-in via--enable-pprof/DCGM_EXPORTER_ENABLE_PPROF. - Add PodMapper informer caching for Kubernetes pod mapping (#626) (@jaeeyoungkim).
- Add per-process GPU metrics for time-sharing and MIG (#594) (@krystiancastai).
- Make Helm
priorityClassNameconfigurable with explicit defaults (#444) (@runzhliu). - Add MIG device support for HPC job labels (#602) (@jay-mckay).
- Update go-dcgm field metadata handling, deprecated field alias resolution, health constants, policy registration handling, and version info APIs.
- Document IPv6 address formats for remote hostengine and metrics listen addresses.
- Refresh dependencies, container base images, Docker image references, Helm chart values, Kubernetes manifests, and tests for this release.
4.5.2-4.8.1
- Update to DCGM 4.5.2, latest Go 1.24, and base containers
- Fix distroless symlink issue
- Fix for parsing blank XIDs
- Fix for nvlink entities starting at offset 1
4.5.1-4.8.0
- Update to DCGM 4.5.1
- Enabled monitoring of GPU bind/unbind events and automatic reloading (@nvvfedorov) - beta
- Sync default metric watchlist for docker and helm (@faizan-exe)
- Fix health endpoint behavior (@Alja9)
- Increase default memory limit to 512Mi (@faizan-exe)
- Make scrapeTimeout configurable (@faizan-exe)
- Fix P2P Status mappings (@wkd-woo)
NOTICE: Helm chart now uses distroless container by default
4.4.2-4.7.1
4.4.2-4.7.0
- Security improvements in startup and pipeline @nvvfedorov
- Crash if collector cannot be initialized under startup validation (#578) @daveoy
- Track unallocated GPUs in DRA (#570) @JiangJiaWei1103
- Fix unbounded label cache size (#574) @andrew-leung
4.4.1-4.6.0
4.4.1-4.5.2
4.4.0-4.5.0
- Update to DCGM 4.4 and Cuda 13.0
- Kubernetes UID support (@andrew-leung)
- Create distroless container target
4.3.1-4.4.0
- Update To DCGM 4.3.1
- Update podapi for DRA
- Enable DCGM_EXP_P2P_STATUS for reporting GPU peer-to-peer nvlink status
- Fix for empty HPC directory
- Enable InitContainer support
4.2.3-4.2.0
DCGM-Exporter 4.2.3-4.2.0
- [ISSUE-512] Added a new debugging facility to dump runtime objects into files
- Kubernetes pod label support