Skip to content

When the number of Pods on a node is large, an ResourceExhausted error occurs when list pods #509

@halcyon-r

Description

@halcyon-r

What is the version?

4.2.3-4.1.1

What happened?

There are a large number of pods running on one GPU node. When accessing the dcgm-exporter to query metrics of the node, an internal server error will be obtained.

$ kubectl get po -n monitor-system -owide
NAME                                                                     READY   STATUS    RESTARTS         AGE   IP                        NODE
prometheus-node-exporter-mb9d8        1/1          Running   0                           98d   10.11.11.12     node1
$ curl 10.11.11.12:9400/metrics
internal server error

The dcgm-exporeter logs

......
time=2025-06-06T06:31:16.632Z level=ERROR msg="Failed to apply transformations on metrics" error="failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4753177 vs. 4194304)" fieldEntityGroup=GPU metrics="map[{Fi
......

What did you expect to happen?

Querying the pod list by kubelet.sock will not report an error.

What is the GPU model?

No response

What is the environment?

No response

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

Create a large number of Pods on a gpu node. Then deploy the dcgm-exporter.

Anything else we need to know?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions