Skip to content

Scrape failed: "i/o timeout" and missing Pod metadata #625

@jpboy8

Description

@jpboy8

I am encountering i/o timeout errors when Prometheus tries to scrape metrics from dcgm-exporter. Additionally, I am struggling to get Pod mapping working correctly (metrics are missing pod, namespace labels).

Deployment method: Helm Chart 4.7.1
Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.5.1-4.8.0-distroless
Kubernetes: 1.33.6
GPU: Nvidia H100 (4 MIG)

values.yaml

kubernetes:
  enablePodLabels: true
  enablePodUID: true
  rbac:
    create: true
image:
  repository: nvcr.io/nvidia/k8s/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 4.5.1-4.8.0-distroless
extraEnv:
  - name: DCGM_EXPORTER_INTERVAL
    value: "1000"
  - name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
    value: "device-name"
serviceMonitor:
  enabled: true
  interval: 60s
  honorLabels: true
  additionalLabels:
    release: kube-prometheus-stack

LOGS

time=2026-02-02T04:42:44.018Z level=ERROR msg="Failed to write response." error="write tcp 10.42.0.54:9400->10.42.0.66:47334: i/o timeout"
time=2026-02-02T04:42:44.018Z level=INFO msg="http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/internal/pkg/server.(*MetricsServer).Metrics (server.go:257)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions