Skip to content

No dial/read/write timeouts on direct redis.Client in createPubSubToNode #2

Description

@erkattak

Summary

createPubSubToNode in pool.go creates a redis.Client with no timeout configuration:

directClient := redis.NewClient(&redis.Options{Addr: nodeAddr})

Problem

go-redis dials lazily — the actual TCP connection is established when the first command is sent (e.g., via pubsub.Subscribe), not at redis.NewClient time. Because no DialTimeout, ReadTimeout, or WriteTimeout are set, connections to unreachable nodes (temporarily down, restarting, or behind a slow network path) will block for the OS-level TCP timeout, which can be several minutes.

This affects:

  • resubscribeOnNewNode goroutines: the per-command 10s context helps, but the actual dial can outlive the context in some TCP stack implementations
  • Manual calls to SubscribeSync during node unavailability: the call blocks until the caller's context deadline, but the underlying event loop and metadata continue to live after the caller gives up
  • Stall detection: migrationStallCheck defaults to 2s, but the effective dial attempt lasts 5s+ with go-redis defaults, causing stall signals to fire before the first connection attempt even completes

Comparison

The topology fallback path at topology.go (seed node recovery) correctly sets explicit timeouts:

&redis.Options{
    DialTimeout:  1 * time.Second,
    ReadTimeout:  1 * time.Second,
    WriteTimeout: 1 * time.Second,
}

createPubSubToNode should follow the same discipline.

Suggested Fix

Add configurable timeout fields (e.g., directClientDialTimeout, directClientReadTimeout, directClientWriteTimeout) to config with sensible defaults (e.g., 3s dial, 5s read/write), and apply them in createPubSubToNode:

directClient := redis.NewClient(&redis.Options{
    Addr:         nodeAddr,
    DialTimeout:  p.config.directClientDialTimeout,
    ReadTimeout:  p.config.directClientReadTimeout,
    WriteTimeout: p.config.directClientWriteTimeout,
})

Expose via a WithDirectClientTimeout(dial, rw time.Duration) option.

Note: The stall check default (migrationStallCheck = 2s) should be re-evaluated relative to directClientDialTimeout to avoid false positive stall signals.


Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions