Skip to content

DCGM Exporter DaemonSet crashes in Kubernetes due to missing ServiceAccount token (/var/run/secrets/.../token) #624

@MepHist2721Y

Description

@MepHist2721Y

Summary

dcgm-exporter fails to run reliably as a DaemonSet in Kubernetes when DCGM_EXPORTER_KUBERNETES is enabled.
The container crashes with:open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
Even when a ServiceAccount exists and is referenced, the token is not mounted or is ignored, causing the exporter to crash repeatedly.

Environment

Kubernetes version: v1.28.15
OS: Ubuntu 22.04
GPU nodes: NVIDIA DGX A100 (SXM4 40GB)
Driver: 535.xx
CUDA: 12.2
MIG mode: Enabled (mixed)
Container runtime: containerd
Helm chart: dcgm-exporter
Image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.2-4.7.1-ubuntu22.04

Expected Behavior

The dcgm-exporter DaemonSet should:

  • Start successfully on all GPU nodes
  • Collect GPU / MIG metrics
  • Optionally map GPU usage to Kubernetes pods when DCGM_EXPORTER_KUBERNETES=true

Actual Behavior

  • Pods enter CrashLoopBackOff

  • Logs consistently show: ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

  • This happens even when:
    - A ServiceAccount exists
    - The ServiceAccount is explicitly set in Helm values
    - automountServiceAccountToken is enabled

Pod Logs

time=2026-01-29T07:46:24Z level=INFO msg="Starting dcgm-exporter"
time=2026-01-29T07:46:31Z level=INFO msg="DCGM successfully initialized!"
time=2026-01-29T07:46:32Z level=INFO msg="Collecting DCP Metrics"
time=2026-01-29T07:46:32Z level=ERROR msg="open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

Helm Values used

arguments:
  - "-c=1000"
  - "--collectors=nvml,dcgm"

extraEnv:
  - name: DCGM_EXPORTER_KUBERNETES
    value: "true"

nodeSelector:
  nvidia.com/gpu.present: "true"

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

serviceAccount:
  create: false
  name: dcgm-exporter

automountServiceAccountToken: true

Additional Observations

  • nvidia-dcgm.service is running correctly on the host:

    systemctl status nvidia-dcgm
    Active: active (running)
    
    
  • /var/lib/kubelet/pod-resources is mounted correctly

  • GPU labels (nvidia.com/gpu.present=true) are present

  • Node has no taints

  • Issue occurs on multiple nodes

  • Disabling Kubernetes mode:

    DCGM_EXPORTER_KUBERNETES=false

    avoids the token error, but then pod-level metrics are unavailable

Questions

  • How to make this pod up and running?
  • Is DCGM_EXPORTER_KUBERNETES=true strictly required for dcgm-exporter DaemonSet?
  • Is there a supported way to run dcgm-exporter in Kubernetes without requiring a ServiceAccount token?
  • Is Kubernetes v1.28 officially supported for pod-level GPU metrics?

Thank you

Thanks for maintaining dcgm-exporter — any guidance on the correct configuration or confirmation of a bug would be appreciated.

@NVIDIA
@kubernetes
@helm
@prometheus
@grafana

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions