feat(spanner): add DCP DirectPath fallback#14612
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements a DirectPath fallback mechanism for the Spanner client's dynamic channel pool, enabling a sticky switch to CloudPath when DirectPath experiences high error rates. The implementation introduces a dcpFallbackSlot to manage dual connection pools and a shared state for health tracking. Review feedback identifies a race condition in the fallback activation logic where non-atomic counter updates could lead to incorrect error rate calculations. Additionally, it is suggested to monitor SendMsg failures in the stream wrapper to ensure all relevant errors contribute to the fallback decision.
| func (s *dcpFallbackSlot) maybeActivateFallback() { | ||
| failures := s.state.primaryFailures.Load() | ||
| successes := s.state.primarySuccesses.Load() | ||
| total := failures + successes | ||
| if total == 0 || failures < uint64(directPathFallbackMinFailedCalls) { | ||
| return | ||
| } | ||
| if float32(failures)/float32(total) < directPathFallbackErrorRateThreshold { | ||
| return | ||
| } | ||
| s.state.fallbackActive.Store(true) | ||
| } |
There was a problem hiding this comment.
The fallback activation logic is susceptible to a race condition because primaryFailures and primarySuccesses are updated and reset independently. If a reset occurs between loading failures (line 993) and successes (line 994) in maybeActivateFallback, the calculated error rate can be incorrectly high (e.g., 100% if failures is from the old window and successes is reset to 0), leading to a premature and permanent (sticky) fallback to CloudPath. To ensure atomicity, consider grouping these counters and the timestamp into a single struct managed by an atomic.Pointer.
| type dcpFallbackMonitoredStream struct { | ||
| grpc.ClientStream | ||
| once sync.Once | ||
| record func(error) | ||
| } |
There was a problem hiding this comment.
The dcpFallbackMonitoredStream currently only monitors RecvMsg. Failures can also occur during SendMsg (e.g., due to connection issues), which should also contribute to the fallback decision. Please verify if this behavior is already guaranteed by an underlying or delegated component (such as a backing stream like watchStream) before assuming it is missing and implementing explicit monitoring here.
References
- When reviewing an iterator or stream for a specific behavior, do not assume the behavior is missing if it's not explicitly implemented in the struct. Verify if the behavior is guaranteed by an underlying or delegated component.
Split of #14604
Internal reference: go/go-dcp-design