"Couldn't get pod metadata" and client-go throttling when using Kubernetes pod labels

### What is the version?

4.4.0-4.5.0

### What happened?

I am using `DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=true` but getting some client-go throttling (pods and namespaces removed). I added ClusterRole and ClusterRoleBindings manually since I'm deploying via GPU Operator.

```
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-container-toolkit-daemonset-2vqvd namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-mig-manager-lxtgc namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-operator-validator-5wcrx namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
E0902 06:37:39.954272       1 request.go:1360] "Unexpected error when reading response body" err="context deadline exceeded"
time=2025-09-02T06:37:39.954Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-dcgm-exporter-vq78p namespace=... error="unexpected error when reading response body. Please retry. Original error: context deadline exceeded"
time=2025-09-02T06:37:39.955Z level=ERROR msg="Failed to write response." error="write tcp 10.68.3.25:9400->10.68.0.19:38888: i/o timeout"
time=2025-09-02T06:37:39.955Z level=INFO msg="http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/internal/pkg/server.(*MetricsServer).Metrics (server.go:196)"
E0902 06:37:40.155182       1 request.go:1360] "Unexpected error when reading response body" err="context deadline exceeded"
time=2025-09-02T06:37:40.155Z level=WARN msg="Couldn't get pod metadata" pod=kube-prometheus-stack-prometheus-node-exporter-n6cqg namespace=... error="unexpected error when reading response body. Please retry. Original error: context deadline exceeded"
I0902 06:37:40.343074       1 request.go:752] "Waited before sending request" delay="9.990030013s" reason="client-side throttling, not priority and fairness" verb="GET" URL="https://34.118.224.1:443/api/v1/namespaces/default/pods/..."
time=2025-09-02T06:37:40.353Z level=WARN msg="Couldn't get pod metadata" pod=admin-shell-msc2q namespace=default error="Get \"https://34.118.224.1:443/api/v1/namespaces/default/pods/...\": context deadline exceeded"
```

It seems like the initial few requests fail ("client-side throttling, not priority and fairness") then it reaches the QPS and throttles on the client.

### What did you expect to happen?

There shouldn't be this level of throttling, I have < 50 pods. Or if there was a way to change the default QPS? Or maybe there is some other underlying error that's causing the first requests to fail?

### What is the GPU model?

Tested on A100 40GB, A100 80GB, H100 80GB.

### What is the environment?

Via GPU operator's DCGM exporter daemonset. Here's the container spec:

```
      containers:
      - env:
        - name: DCGM_EXPORTER_LISTEN
          value: :9400
        - name: DCGM_EXPORTER_KUBERNETES
          value: "true"
        - name: DCGM_EXPORTER_COLLECTORS
          value: /etc/dcgm-exporter/dcgm-metrics.csv
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS
          value: "true"
        - name: DCGM_EXPORTER_INTERVAL
          value: "1000"
        image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.0-4.5.0-ubuntu22.04
```

### How did you deploy the dcgm-exporter and what is the configuration?

DCGM Exporter deployed via GPU Operator with extra ClusterRole and ClusterRoleBindings to enable querying pods:

```
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: dcgm-exporter-read-pods
rules:
- apiGroups:
  - ""
  - resource.k8s.io
  resources:
  - pods
  - resourceslices
  verbs:
  - get
  - list
  - watch
```

### How to reproduce the issue?

_No response_

### Anything else we need to know?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Couldn't get pod metadata" and client-go throttling when using Kubernetes pod labels #551

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

"Couldn't get pod metadata" and client-go throttling when using Kubernetes pod labels #551

Description

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions