I am encountering i/o timeout errors when Prometheus tries to scrape metrics from dcgm-exporter. Additionally, I am struggling to get Pod mapping working correctly (metrics are missing pod, namespace labels).
Deployment method: Helm Chart 4.7.1
Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.1-4.8.0-distroless
Kubernetes: 1.33.6
GPU: Nvidia H100 (4 MIG)
values.yaml
kubernetes:
enablePodLabels: true
enablePodUID: true
rbac:
create: true
image:
repository: nvcr.io/nvidia/k8s/dcgm-exporter
pullPolicy: IfNotPresent
tag: 4.5.1-4.8.0-distroless
extraEnv:
- name: DCGM_EXPORTER_INTERVAL
value: "1000"
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
value: "device-name"
serviceMonitor:
enabled: true
interval: 60s
honorLabels: true
additionalLabels:
release: kube-prometheus-stack
LOGS
time=2026-02-02T04:42:44.018Z level=ERROR msg="Failed to write response." error="write tcp 10.42.0.54:9400->10.42.0.66:47334: i/o timeout"
time=2026-02-02T04:42:44.018Z level=INFO msg="http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/internal/pkg/server.(*MetricsServer).Metrics (server.go:257)
I am encountering i/o timeout errors when Prometheus tries to scrape metrics from dcgm-exporter. Additionally, I am struggling to get Pod mapping working correctly (metrics are missing pod, namespace labels).
Deployment method: Helm Chart 4.7.1
Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.1-4.8.0-distroless
Kubernetes: 1.33.6
GPU: Nvidia H100 (4 MIG)
values.yaml
LOGS