Skip to content

"Couldn't get pod metadata" and client-go throttling when using Kubernetes pod labels #551

@Li357

Description

@Li357

What is the version?

4.4.0-4.5.0

What happened?

I am using DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=true but getting some client-go throttling (pods and namespaces removed). I added ClusterRole and ClusterRoleBindings manually since I'm deploying via GPU Operator.

time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-container-toolkit-daemonset-2vqvd namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-mig-manager-lxtgc namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-operator-validator-5wcrx namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
E0902 06:37:39.954272       1 request.go:1360] "Unexpected error when reading response body" err="context deadline exceeded"
time=2025-09-02T06:37:39.954Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-dcgm-exporter-vq78p namespace=... error="unexpected error when reading response body. Please retry. Original error: context deadline exceeded"
time=2025-09-02T06:37:39.955Z level=ERROR msg="Failed to write response." error="write tcp 10.68.3.25:9400->10.68.0.19:38888: i/o timeout"
time=2025-09-02T06:37:39.955Z level=INFO msg="http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/internal/pkg/server.(*MetricsServer).Metrics (server.go:196)"
E0902 06:37:40.155182       1 request.go:1360] "Unexpected error when reading response body" err="context deadline exceeded"
time=2025-09-02T06:37:40.155Z level=WARN msg="Couldn't get pod metadata" pod=kube-prometheus-stack-prometheus-node-exporter-n6cqg namespace=... error="unexpected error when reading response body. Please retry. Original error: context deadline exceeded"
I0902 06:37:40.343074       1 request.go:752] "Waited before sending request" delay="9.990030013s" reason="client-side throttling, not priority and fairness" verb="GET" URL="https://34.118.224.1:443/api/v1/namespaces/default/pods/..."
time=2025-09-02T06:37:40.353Z level=WARN msg="Couldn't get pod metadata" pod=admin-shell-msc2q namespace=default error="Get \"https://34.118.224.1:443/api/v1/namespaces/default/pods/...\": context deadline exceeded"

It seems like the initial few requests fail ("client-side throttling, not priority and fairness") then it reaches the QPS and throttles on the client.

What did you expect to happen?

There shouldn't be this level of throttling, I have < 50 pods. Or if there was a way to change the default QPS? Or maybe there is some other underlying error that's causing the first requests to fail?

What is the GPU model?

Tested on A100 40GB, A100 80GB, H100 80GB.

What is the environment?

Via GPU operator's DCGM exporter daemonset. Here's the container spec:

      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DCGM_EXPORTER_COLLECTORS
          value: /etc/dcgm-exporter/dcgm-metrics.csv
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS
          value: "true"
        - name: DCGM_EXPORTER_INTERVAL
          value: "1000"
        image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.0-4.5.0-ubuntu22.04

How did you deploy the dcgm-exporter and what is the configuration?

DCGM Exporter deployed via GPU Operator with extra ClusterRole and ClusterRoleBindings to enable querying pods:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dcgm-exporter-read-pods
rules:
- apiGroups:
  - ""
  - resource.k8s.io
  resources:
  - pods
  - resourceslices
  verbs:
  - get
  - list
  - watch

How to reproduce the issue?

No response

Anything else we need to know?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions