Summary
createPubSubToNode in pool.go creates a redis.Client with no timeout configuration:
directClient := redis.NewClient(&redis.Options{Addr: nodeAddr})
Problem
go-redis dials lazily — the actual TCP connection is established when the first command is sent (e.g., via pubsub.Subscribe), not at redis.NewClient time. Because no DialTimeout, ReadTimeout, or WriteTimeout are set, connections to unreachable nodes (temporarily down, restarting, or behind a slow network path) will block for the OS-level TCP timeout, which can be several minutes.
This affects:
resubscribeOnNewNode goroutines: the per-command 10s context helps, but the actual dial can outlive the context in some TCP stack implementations
- Manual calls to
SubscribeSync during node unavailability: the call blocks until the caller's context deadline, but the underlying event loop and metadata continue to live after the caller gives up
- Stall detection:
migrationStallCheck defaults to 2s, but the effective dial attempt lasts 5s+ with go-redis defaults, causing stall signals to fire before the first connection attempt even completes
Comparison
The topology fallback path at topology.go (seed node recovery) correctly sets explicit timeouts:
&redis.Options{
DialTimeout: 1 * time.Second,
ReadTimeout: 1 * time.Second,
WriteTimeout: 1 * time.Second,
}
createPubSubToNode should follow the same discipline.
Suggested Fix
Add configurable timeout fields (e.g., directClientDialTimeout, directClientReadTimeout, directClientWriteTimeout) to config with sensible defaults (e.g., 3s dial, 5s read/write), and apply them in createPubSubToNode:
directClient := redis.NewClient(&redis.Options{
Addr: nodeAddr,
DialTimeout: p.config.directClientDialTimeout,
ReadTimeout: p.config.directClientReadTimeout,
WriteTimeout: p.config.directClientWriteTimeout,
})
Expose via a WithDirectClientTimeout(dial, rw time.Duration) option.
Note: The stall check default (migrationStallCheck = 2s) should be re-evaluated relative to directClientDialTimeout to avoid false positive stall signals.
Related Issues
Summary
createPubSubToNodeinpool.gocreates aredis.Clientwith no timeout configuration:Problem
go-redis dials lazily — the actual TCP connection is established when the first command is sent (e.g., via
pubsub.Subscribe), not atredis.NewClienttime. Because noDialTimeout,ReadTimeout, orWriteTimeoutare set, connections to unreachable nodes (temporarily down, restarting, or behind a slow network path) will block for the OS-level TCP timeout, which can be several minutes.This affects:
resubscribeOnNewNodegoroutines: the per-command 10s context helps, but the actual dial can outlive the context in some TCP stack implementationsSubscribeSyncduring node unavailability: the call blocks until the caller's context deadline, but the underlying event loop and metadata continue to live after the caller gives upmigrationStallCheckdefaults to 2s, but the effective dial attempt lasts 5s+ with go-redis defaults, causing stall signals to fire before the first connection attempt even completesComparison
The topology fallback path at
topology.go(seed node recovery) correctly sets explicit timeouts:createPubSubToNodeshould follow the same discipline.Suggested Fix
Add configurable timeout fields (e.g.,
directClientDialTimeout,directClientReadTimeout,directClientWriteTimeout) toconfigwith sensible defaults (e.g., 3s dial, 5s read/write), and apply them increatePubSubToNode:Expose via a
WithDirectClientTimeout(dial, rw time.Duration)option.Note: The stall check default (
migrationStallCheck = 2s) should be re-evaluated relative todirectClientDialTimeoutto avoid false positive stall signals.Related Issues
pubSubMetadataentries described there. Fixing timeouts here reduces how long leaked resources are held.migrationTimeout: Slow dials caused by this issue directly increase the likelihood of hitting the migration timeout, which then triggers the permanent subscription loss described in Subscriptions permanently lost after migrationTimeout expires with no recovery path #4.