Connections entering unstable state with waiting_for_connection error

## Current Behavior
Beginning in versions 0.38.0 and 0.39.0, we've randomly observed the following error from PGAdapter:
```
CallOptions deadline exceeded after 54.999971981s. Name resolution delay 0.000000000 seconds. [closed=[], open=[[buffered_nanos=55012051961, waiting_for_connection]]]
```

The deadline of ~55 seconds is expected since we run `SET statement_timeout = 55000;` on every Read query to prevent Spanner from running queries longer than necessary.

When a PGAdapter instance enters this unstable state, it seems some connections in the pool are tainted while others work just fine. We can run `psql` against a faulted instance and run simple queries (e.g. `select * from table limit 1;`) successfully a few times, but it will stall randomly and eventually time out with the `waiting_for_connection` error. This stall occurs without a `statement_timeout` set as well, we just won't see the error come through as it's attempting to run indefinitely.

We have also seen this on write statements (e.g. UPDATE, INSERT) albeit with a lower timeout, 30 seconds, which does seem to match [Spanner gRPC configuration defaults](https://github.com/googleapis/googleapis/blob/c532f355b2bae18fdff19ced316897433de5f093/google/spanner/v1/spanner_grpc_service_config.json#L103).

Reproducing the error state is quite difficult since we get all sorts of traffic to each pod and it only seems to manifest itself against one or two pods out of 120 at a time. It will generally occur 24 hours after a new Deployment, but sometimes will show up in as little as a few hours.

Reverting to version 0.37.0 fixes this behavior (so far at least).

## Context (Environment)

- Running PGAdapter 0.38.0 or 0.39.0 as a sidecar in GKE
- 120 Pods in the Deployment
- PGAdapter configured with 1 vCPU and 2GB Memory limits (actual CPU use hovers at ~ 100mCPU)
- PGAdapter executed with the following args:
```
- args:
  - -p
  - <REDACTED>
  - -i
  - <REDACTED>
  - -d
  - <REDACTED>
  - -enable_otel
  - -otel_trace_ratio=0.05

Environment Variables:
JDK_JAVA_OPTIONS = "-Xmx1600M -Xms1600M -XshowSettings:vm" 
```

- PGAdapter is running as a sidecar with a Node.js app using Knex.js as a query builder and connection pooler (using tarn.js under the hood)
- The application creates two Knex.js connection pools, one for reading and one for writing where the read pool is configured to use `SET statement_timeout = 55000;`
- Knex.js read pool is configured with 30/75 min/max connections
- Knex.js write pool is configured with 40/40 min/max connections
- Both pools point to the same PGAdapter instance
- PGAdapter session values are kept at the default, which we believe is 100/400/4 min/max/gRPC Channels

## Other Information
- CPU resources seem fine as it hovers around 100mCPU
- Memory resources seem fine based on values reported in the thread dumps and no indication of aggressive GC
- Attempted to issue [RESET ALL](https://github.com/GoogleCloudPlatform/pgadapter/pull/1904) to see if it was a state issue but there was no improvement

We were able to gather some thread dumps (`kill -3 <PID>`) in case you're able to glean insight from these:
- [pgadapter-0.37-good-thread-dump.txt](https://gist.github.com/joshbautista/dca9323b0f3f9d7f7a01a5541f2f0917#file-pgadapter-0-37-good-thread-dump-txt) - PGAdapter 0.37.0 in a good state running for a few days
- [pgadapter-0.39-broken-thread-dump.txt](https://gist.github.com/joshbautista/dca9323b0f3f9d7f7a01a5541f2f0917#file-pgadapter-0-39-broken-thread-dump-txt) - PGAdapter 0.39.0 in a broken state sitting idle
- [pgadapter-0.39-broken-thread-dump-during-query.txt](https://gist.github.com/joshbautista/dca9323b0f3f9d7f7a01a5541f2f0917#file-pgadapter-0-39-broken-thread-dump-during-query-txt) - PGAdapter 0.39.0 in a broken state while psql is waiting for a query to execute
- [pgadapter-0.39-good-thread-dump-new-pod.txt](https://gist.github.com/joshbautista/dca9323b0f3f9d7f7a01a5541f2f0917#file-pgadapter-0-39-good-thread-dump-new-pod-txt) - PGAdapter 0.39.0 in a good state running for 30 minutes
- [pgadapter-0.39-good-thread-dump-old-pod.txt](https://gist.github.com/joshbautista/dca9323b0f3f9d7f7a01a5541f2f0917#file-pgadapter-0-39-good-thread-dump-old-pod-txt) - PGAdapter 0.39.0 in a good state running for at least 3 days

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connections entering unstable state with waiting_for_connection error #2422

Current Behavior

Context (Environment)

Other Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Connections entering unstable state with waiting_for_connection error #2422

Description

Current Behavior

Context (Environment)

Other Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions