No dial/read/write timeouts on direct redis.Client in createPubSubToNode

## Summary

`createPubSubToNode` in `pool.go` creates a `redis.Client` with no timeout configuration:

```go
directClient := redis.NewClient(&redis.Options{Addr: nodeAddr})
```

## Problem

go-redis dials lazily — the actual TCP connection is established when the first command is sent (e.g., via `pubsub.Subscribe`), not at `redis.NewClient` time. Because no `DialTimeout`, `ReadTimeout`, or `WriteTimeout` are set, connections to unreachable nodes (temporarily down, restarting, or behind a slow network path) will block for the OS-level TCP timeout, which can be several minutes.

This affects:
- `resubscribeOnNewNode` goroutines: the per-command 10s context helps, but the actual dial can outlive the context in some TCP stack implementations
- Manual calls to `SubscribeSync` during node unavailability: the call blocks until the caller's context deadline, but the underlying event loop and metadata continue to live after the caller gives up
- Stall detection: `migrationStallCheck` defaults to 2s, but the effective dial attempt lasts 5s+ with go-redis defaults, causing stall signals to fire before the first connection attempt even completes

## Comparison

The topology fallback path at `topology.go` (seed node recovery) correctly sets explicit timeouts:
```go
&redis.Options{
    DialTimeout:  1 * time.Second,
    ReadTimeout:  1 * time.Second,
    WriteTimeout: 1 * time.Second,
}
```
`createPubSubToNode` should follow the same discipline.

## Suggested Fix

Add configurable timeout fields (e.g., `directClientDialTimeout`, `directClientReadTimeout`, `directClientWriteTimeout`) to `config` with sensible defaults (e.g., 3s dial, 5s read/write), and apply them in `createPubSubToNode`:

```go
directClient := redis.NewClient(&redis.Options{
    Addr:         nodeAddr,
    DialTimeout:  p.config.directClientDialTimeout,
    ReadTimeout:  p.config.directClientReadTimeout,
    WriteTimeout: p.config.directClientWriteTimeout,
})
```

Expose via a `WithDirectClientTimeout(dial, rw time.Duration)` option.

**Note:** The stall check default (`migrationStallCheck = 2s`) should be re-evaluated relative to `directClientDialTimeout` to avoid false positive stall signals.

---

## Related Issues

- **#3 — Connection leak after failure:** Connections that hang due to missing timeouts are the primary source of the leaked `pubSubMetadata` entries described there. Fixing timeouts here reduces how long leaked resources are held.
- **#4 — No recovery after `migrationTimeout`:** Slow dials caused by this issue directly increase the likelihood of hitting the migration timeout, which then triggers the permanent subscription loss described in #4.
- **#5 — Concurrent resubscription goroutines:** Long-running connection attempts caused by this issue extend the window in which two migration goroutines can overlap on the same hashslot, making the race in #5 more likely to manifest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No dial/read/write timeouts on direct redis.Client in createPubSubToNode #2

Summary

Problem

Comparison

Suggested Fix

Related Issues

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

No dial/read/write timeouts on direct redis.Client in createPubSubToNode #2

Description

Summary

Problem

Comparison

Suggested Fix

Related Issues

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions