When the number of Pods on a node is large, an ResourceExhausted error occurs when list pods

### What is the version?

4.2.3-4.1.1

### What happened?

There are a large number of pods running on one  GPU node.  When accessing the dcgm-exporter to query metrics of the node, an `internal server error` will be obtained.
```
$ kubectl get po -n monitor-system -owide
NAME                                                                     READY   STATUS    RESTARTS         AGE   IP                        NODE
prometheus-node-exporter-mb9d8        1/1          Running   0                           98d   10.11.11.12     node1
$ curl 10.11.11.12:9400/metrics
internal server error
```

The dcgm-exporeter logs 
```
......
time=2025-06-06T06:31:16.632Z level=ERROR msg="Failed to apply transformations on metrics" error="failure getting pod resources; err: rpc error: code = ResourceExhausted desc = grpc: received message larger than max (4753177 vs. 4194304)" fieldEntityGroup=GPU metrics="map[{Fi
......

```



### What did you expect to happen?

Querying the pod list by kubelet.sock will not report an error.

### What is the GPU model?

_No response_

### What is the environment?

_No response_

### How did you deploy the dcgm-exporter and what is the configuration?

_No response_

### How to reproduce the issue?

Create a large number of Pods on a gpu node. Then deploy the dcgm-exporter.

### Anything else we need to know?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When the number of Pods on a node is large, an ResourceExhausted error occurs when list pods #509

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

When the number of Pods on a node is large, an ResourceExhausted error occurs when list pods #509

Description

What is the version?

What happened?

What did you expect to happen?

What is the GPU model?

What is the environment?

How did you deploy the dcgm-exporter and what is the configuration?

How to reproduce the issue?

Anything else we need to know?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions