Skip to content

feat(spanner): add DCP DirectPath fallback#14612

Open
rahul2393 wants to merge 2 commits into
mainfrom
dcp-split/2-directpath-fallback
Open

feat(spanner): add DCP DirectPath fallback#14612
rahul2393 wants to merge 2 commits into
mainfrom
dcp-split/2-directpath-fallback

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

Split of #14604

Internal reference: go/go-dcp-design

@rahul2393 rahul2393 requested review from a team as code owners May 19, 2026 12:44
@product-auto-label product-auto-label Bot added the api: spanner Issues related to the Spanner API. label May 19, 2026
@rahul2393 rahul2393 requested a review from olavloite May 19, 2026 12:44
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a DirectPath fallback mechanism for the Spanner client's dynamic channel pool, enabling a sticky switch to CloudPath when DirectPath experiences high error rates. The implementation introduces a dcpFallbackSlot to manage dual connection pools and a shared state for health tracking. Review feedback identifies a race condition in the fallback activation logic where non-atomic counter updates could lead to incorrect error rate calculations. Additionally, it is suggested to monitor SendMsg failures in the stream wrapper to ensure all relevant errors contribute to the fallback decision.

Comment on lines +992 to +1003
func (s *dcpFallbackSlot) maybeActivateFallback() {
failures := s.state.primaryFailures.Load()
successes := s.state.primarySuccesses.Load()
total := failures + successes
if total == 0 || failures < uint64(directPathFallbackMinFailedCalls) {
return
}
if float32(failures)/float32(total) < directPathFallbackErrorRateThreshold {
return
}
s.state.fallbackActive.Store(true)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The fallback activation logic is susceptible to a race condition because primaryFailures and primarySuccesses are updated and reset independently. If a reset occurs between loading failures (line 993) and successes (line 994) in maybeActivateFallback, the calculated error rate can be incorrectly high (e.g., 100% if failures is from the old window and successes is reset to 0), leading to a premature and permanent (sticky) fallback to CloudPath. To ensure atomicity, consider grouping these counters and the timestamp into a single struct managed by an atomic.Pointer.

Comment on lines +1012 to +1016
type dcpFallbackMonitoredStream struct {
grpc.ClientStream
once sync.Once
record func(error)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The dcpFallbackMonitoredStream currently only monitors RecvMsg. Failures can also occur during SendMsg (e.g., due to connection issues), which should also contribute to the fallback decision. Please verify if this behavior is already guaranteed by an underlying or delegated component (such as a backing stream like watchStream) before assuming it is missing and implementing explicit monitoring here.

References
  1. When reviewing an iterator or stream for a specific behavior, do not assume the behavior is missing if it's not explicitly implemented in the struct. Verify if the behavior is guaranteed by an underlying or delegated component.

Base automatically changed from dcp-split/1-dynamic-channel-pool to main May 26, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the Spanner API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant