Skip to content

fix: Widen wool keepalive margin to stop GOAWAY too_many_pings#73

Merged
conradbzura merged 1 commit into
masterfrom
fix/grpc-keepalive-too-many-pings
Jun 26, 2026
Merged

fix: Widen wool keepalive margin to stop GOAWAY too_many_pings#73
conradbzura merged 1 commit into
masterfrom
fix/grpc-keepalive-too-many-pings

Conversation

@conradbzura

Copy link
Copy Markdown
Collaborator

The worker dispatch channel ran on wool's default gRPC options, which set the client keepalive cadence (keepalive_time_ms=30s) exactly equal to the server's no-data ping floor
(http2_min_recv_ping_interval_without_data_ms=30s) with a 2-strike budget. A workflow dispatch is a long-lived stream-stream RPC that goes quiet during a subprocess stage (samtools/sort on a large file); with keepalive_permit_without_calls on, the client keeps pinging into that silence. Over Fargate's awsvpc ENI, inter-ping jitter lands pings a hair under the 30s floor, the server counts strikes, and after three it sends GOAWAY too_many_pings — surfacing on the API as UNAVAILABLE and failing the job.

Give the cadence real margin via a shared worker_grpc_options(): client pings once a minute, server floor drops to 20s, strike budget to 5. A ping can now arrive 3x early and still clear the floor. Wired into both worker entrypoints (worker_main for ECS, worker_lan for local) from one config point; the worker advertises the channel options to clients via discovery metadata, so both directions stay consistent.

The worker dispatch channel ran on wool's default gRPC options, which
set the client keepalive cadence (keepalive_time_ms=30s) exactly equal
to the server's no-data ping floor
(http2_min_recv_ping_interval_without_data_ms=30s) with a 2-strike
budget. A workflow dispatch is a long-lived stream-stream RPC that goes
quiet during a subprocess stage (samtools/sort on a large file); with
keepalive_permit_without_calls on, the client keeps pinging into that
silence. Over Fargate's awsvpc ENI, inter-ping jitter lands pings a hair
under the 30s floor, the server counts strikes, and after three it sends
GOAWAY too_many_pings — surfacing on the API as UNAVAILABLE and failing
the job.

Give the cadence real margin via a shared worker_grpc_options(): client
pings once a minute, server floor drops to 20s, strike budget to 5. A
ping can now arrive 3x early and still clear the floor. Wired into both
worker entrypoints (worker_main for ECS, worker_lan for local) from one
config point; the worker advertises the channel options to clients via
discovery metadata, so both directions stay consistent.
@conradbzura conradbzura merged commit 75c4fc9 into master Jun 26, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant