What is the version?
4.4.0-4.5.0
What happened?
I am using DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=true but getting some client-go throttling (pods and namespaces removed). I added ClusterRole and ClusterRoleBindings manually since I'm deploying via GPU Operator.
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-container-toolkit-daemonset-2vqvd namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-mig-manager-lxtgc namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
time=2025-09-02T06:37:39.937Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-operator-validator-5wcrx namespace=... error="client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline"
E0902 06:37:39.954272 1 request.go:1360] "Unexpected error when reading response body" err="context deadline exceeded"
time=2025-09-02T06:37:39.954Z level=WARN msg="Couldn't get pod metadata" pod=nvidia-dcgm-exporter-vq78p namespace=... error="unexpected error when reading response body. Please retry. Original error: context deadline exceeded"
time=2025-09-02T06:37:39.955Z level=ERROR msg="Failed to write response." error="write tcp 10.68.3.25:9400->10.68.0.19:38888: i/o timeout"
time=2025-09-02T06:37:39.955Z level=INFO msg="http: superfluous response.WriteHeader call from github.com/NVIDIA/dcgm-exporter/internal/pkg/server.(*MetricsServer).Metrics (server.go:196)"
E0902 06:37:40.155182 1 request.go:1360] "Unexpected error when reading response body" err="context deadline exceeded"
time=2025-09-02T06:37:40.155Z level=WARN msg="Couldn't get pod metadata" pod=kube-prometheus-stack-prometheus-node-exporter-n6cqg namespace=... error="unexpected error when reading response body. Please retry. Original error: context deadline exceeded"
I0902 06:37:40.343074 1 request.go:752] "Waited before sending request" delay="9.990030013s" reason="client-side throttling, not priority and fairness" verb="GET" URL="https://34.118.224.1:443/api/v1/namespaces/default/pods/..."
time=2025-09-02T06:37:40.353Z level=WARN msg="Couldn't get pod metadata" pod=admin-shell-msc2q namespace=default error="Get \"https://34.118.224.1:443/api/v1/namespaces/default/pods/...\": context deadline exceeded"
It seems like the initial few requests fail ("client-side throttling, not priority and fairness") then it reaches the QPS and throttles on the client.
What did you expect to happen?
There shouldn't be this level of throttling, I have < 50 pods. Or if there was a way to change the default QPS? Or maybe there is some other underlying error that's causing the first requests to fail?
What is the GPU model?
Tested on A100 40GB, A100 80GB, H100 80GB.
What is the environment?
Via GPU operator's DCGM exporter daemonset. Here's the container spec:
containers:
- env:
- name: DCGM_EXPORTER_LISTEN
value: :9400
- name: DCGM_EXPORTER_KUBERNETES
value: "true"
- name: DCGM_EXPORTER_COLLECTORS
value: /etc/dcgm-exporter/dcgm-metrics.csv
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS
value: "true"
- name: DCGM_EXPORTER_INTERVAL
value: "1000"
image: nvcr.io/nvidia/k8s/dcgm-exporter:4.4.0-4.5.0-ubuntu22.04
How did you deploy the dcgm-exporter and what is the configuration?
DCGM Exporter deployed via GPU Operator with extra ClusterRole and ClusterRoleBindings to enable querying pods:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: dcgm-exporter-read-pods
rules:
- apiGroups:
- ""
- resource.k8s.io
resources:
- pods
- resourceslices
verbs:
- get
- list
- watch
How to reproduce the issue?
No response
Anything else we need to know?
No response
What is the version?
4.4.0-4.5.0
What happened?
I am using
DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=truebut getting some client-go throttling (pods and namespaces removed). I added ClusterRole and ClusterRoleBindings manually since I'm deploying via GPU Operator.It seems like the initial few requests fail ("client-side throttling, not priority and fairness") then it reaches the QPS and throttles on the client.
What did you expect to happen?
There shouldn't be this level of throttling, I have < 50 pods. Or if there was a way to change the default QPS? Or maybe there is some other underlying error that's causing the first requests to fail?
What is the GPU model?
Tested on A100 40GB, A100 80GB, H100 80GB.
What is the environment?
Via GPU operator's DCGM exporter daemonset. Here's the container spec:
How did you deploy the dcgm-exporter and what is the configuration?
DCGM Exporter deployed via GPU Operator with extra ClusterRole and ClusterRoleBindings to enable querying pods:
How to reproduce the issue?
No response
Anything else we need to know?
No response