What is the version?
4.2.3-4.1.1
What happened?
There are a large number of pods running on one GPU node. When accessing the dcgm-exporter to query metrics of the node, an internal server error will be obtained.
$ kubectl get po -n monitor-system -owide
NAME READY STATUS RESTARTS AGE IP NODE
prometheus-node-exporter-mb9d8 1/1 Running 0 98d 10.11.11.12 node1
$ curl 10.11.11.12:9400/metrics
internal server error
The dcgm-exporeter logs
......
time=2025-06-06T06:31:16.632Z level=ERROR msg="Failed to apply transformations on metrics" error="failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4753177 vs. 4194304)" fieldEntityGroup=GPU metrics="map[{Fi
......
What did you expect to happen?
Querying the pod list by kubelet.sock will not report an error.
What is the GPU model?
No response
What is the environment?
No response
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
Create a large number of Pods on a gpu node. Then deploy the dcgm-exporter.
Anything else we need to know?
No response
What is the version?
4.2.3-4.1.1
What happened?
There are a large number of pods running on one GPU node. When accessing the dcgm-exporter to query metrics of the node, an
internal server errorwill be obtained.The dcgm-exporeter logs
What did you expect to happen?
Querying the pod list by kubelet.sock will not report an error.
What is the GPU model?
No response
What is the environment?
No response
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
Create a large number of Pods on a gpu node. Then deploy the dcgm-exporter.
Anything else we need to know?
No response