Skip to content

feat(spanner): add dynamic channel pool#14611

Merged
rahul2393 merged 10 commits into
mainfrom
dcp-split/1-dynamic-channel-pool
May 26, 2026
Merged

feat(spanner): add dynamic channel pool#14611
rahul2393 merged 10 commits into
mainfrom
dcp-split/1-dynamic-channel-pool

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

Split of #14604

Internal reference: go/go-dcp-design

@rahul2393 rahul2393 requested review from a team as code owners May 19, 2026 12:42
@product-auto-label product-auto-label Bot added the api: spanner Issues related to the Spanner API. label May 19, 2026
@rahul2393 rahul2393 requested a review from olavloite May 19, 2026 12:44
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic gRPC channel pool (DCP) for the Spanner client, allowing it to scale the number of gRPC channels dynamically based on RPC load. The implementation includes a new dynamicChannelPool type with support for power-of-two least-busy and round-robin selection strategies, along with background workers for scaling up (event-driven) and scaling down (periodic). The PR also refactors internal interfaces to use a requestIDHeaderProvider instead of concrete gRPC clients to accommodate the dynamic nature of the pool. Integration points are added in the session manager and client configuration to enable this opt-in feature. I have no feedback to provide as there were no review comments to assess.

Comment thread spanner/client.go Outdated
// environment variable has been set or client has passed the opt-in
// option in ClientConfig.
endToEndTracingEnvironmentVariable := os.Getenv("SPANNER_ENABLE_END_TO_END_TRACING")
if config.EnableEndToEndTracing || endToEndTracingEnvironmentVariable == "true" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: to be consistent with how other env vars are being processed, this should also use a case-insensitive comparison

if cfg.DCPPrimeMaxAttempts == 0 {
cfg.DCPPrimeMaxAttempts = def.DCPPrimeMaxAttempts
}
switch {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a check here for DCPScaleDownCheckInterval <= 0?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

Comment thread spanner/client.go Outdated

dcpEnabled := config.DynamicChannelPoolConfig.DCPEnabled && gme == nil && !isExperimentalLocationAPIEnabledForConfig(config) && os.Getenv("SPANNER_EMULATOR_HOST") == ""
if dcpEnabled {
reqIDInjector := new(requestIDHeaderInjector)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newClientWithConfig(..) method is getting very long. Can we separate out this block into a separate method for initializing DCP? And maybe also put the other large branch of this if statement in a separate method (the one for else if isFallbackEnabled && isDirectPathEnabled)

Comment thread spanner/client.go Outdated
dial := func(dialCtx context.Context) (gtransport.ConnPool, error) {
return gtransport.DialPool(dialCtx, allClientOpts(1, config.Compression, config.EnableDirectAccess, dcpOpts...)...)
}
dcp, err := newDynamicChannelPool(ctx, sc, config.DynamicChannelPoolConfig, 0, dial)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, the 0 here is for initial (as in: initial number of channels). But that is also already included in the config, and the initial argument does not appear to be used anywhere. Can we remove it?

return
default:
}
p.dialMu.Lock()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lock seems to be held during the entire scale-up, including creating the connection and priming the channel (including any retries of the priming). Would it be possible to release the lock at least during the priming?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

Comment thread spanner/dynamic_channel_pool.go Outdated
type dcpStreamRef struct {
once sync.Once
finish func(error)
closed chan struct{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to be used. Is it something that is needed in a follow-up PR? If not, can we remove it?


type dcpStreamRef struct {
once sync.Once
finish func(error)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not appear to be used. Is it something that is needed in a follow-up PR? If not, can we remove it?

Comment thread spanner/dynamic_channel_pool.go Outdated
cancel context.CancelFunc
sc *sessionClient
database string
disableRouteToLeader bool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this field. You can just read it from sessionClient.

return ref
}

func (c *dcpSpannerClient) CreateSession(ctx context.Context, req *spannerpb.CreateSessionRequest, opts ...gax.CallOption) (*spannerpb.Session, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a TODO / issue to investigate if we can refactor this into using something like a gRPC interceptor or some other more generic solution, than writing a wrapper for each RPC?

return gsc.generateRequestIDHeaderInjector(), nil
}

func (c *dcpResolvingSpannerClient) CreateSession(ctx context.Context, req *spannerpb.CreateSessionRequest, opts ...gax.CallOption) (*spannerpb.Session, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here also in combination with the comment above (not for this pull request, but as a potential improvement in the future): use interceptors instead of repeating each RPC.

@rahul2393
Copy link
Copy Markdown
Contributor Author

rahul2393 commented May 21, 2026

DCP load validation summary

I ran a side-by-side stale query concurrency test using two Go probers:

Variant Client Channel config
Static baseline released Go Spanner client static channel pool
DCP this PR branch DCPMinChannels=4, DCPMaxChannels=256

Workload:

Setting Value
Query fixed-key stale query
Load mode concurrency
Load shape 10,50,100,200,300,400,500,400,300,200,100,50,10
Step duration 180s
Warmup discarded per step 60s
CPU limit removed from both pods(c2-standard-16)

Steady-state results:

Workers Static QPS Static avg latency DCP QPS DCP avg latency
100 ~3.8K ~26ms ~4.8K ~21ms
200 ~3.9K ~52ms ~11.1K ~18ms
300 ~3.7K ~81ms ~16.2K ~19ms
400 ~3.3K ~120ms ~23.3K ~17ms
500 ~3.3K ~151ms ~23.1K ~22ms

Observations:

Area Result
Static behavior Saturated around ~3–4K QPS; latency increased with concurrency.
DCP behavior Scaled channels and sustained ~23K QPS at 400–500 workers.
DCP latency Stayed around ~17–22ms at high concurrency.
Server-side latency spanner/gfe_latency stayed low.
Channel distribution Temporary per-channel load snapshots showed DCP scaling from 4 to 16 active channels.
Screenshot 2026-05-21 at 3 52 00 PM

@rahul2393 rahul2393 merged commit 51a53ce into main May 26, 2026
15 checks passed
@rahul2393 rahul2393 deleted the dcp-split/1-dynamic-channel-pool branch May 26, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api: spanner Issues related to the Spanner API.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants