Skip to content

perf(spanner) optimize Spanner built-in metrics hot path#14605

Draft
rahul2393 wants to merge 1 commit into
googleapis:mainfrom
rahul2393:irahul-metrics-hotpath-opt
Draft

perf(spanner) optimize Spanner built-in metrics hot path#14605
rahul2393 wants to merge 1 commit into
googleapis:mainfrom
rahul2393:irahul-metrics-hotpath-opt

Conversation

@rahul2393
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 commented May 17, 2026

Summary

  • avoid eager Header() calls on streaming read/query paths used by built-in and legacy GFE latency metrics
  • remove map/regexp allocations from server-timing parsing on the built-in metrics hot path
  • record built-in metrics with attribute.Set and avoid per-metric attribute slice/map construction
  • keep legacy OpenTelemetry GFE latency metrics intact while moving streaming header capture off the pre-Recv request path

Why

When built-in metrics and legacy OpenTelemetry Spanner metrics are enabled, streaming query/read currently calls Header() before the first Recv(). Under load, goroutine profiles showed many requests blocked in grpc.(*clientStream).Header, increasing request latency even when server/GFE latency was low. This change defers header capture until after the first Recv() returns, preserving metrics but removing the blocking header wait from request dispatch.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the metrics collection logic within the Spanner client to improve performance and consistency. Key changes include the introduction of a more efficient server timing header parser that avoids map allocations, the use of wrapper types for streaming clients to capture telemetry during result iteration, and improved resource cleanup during client initialization. The review identified two critical issues: a potential nil pointer dereference in the BatchWriteResponseIterator when metrics are disabled, and a possible out-of-bounds panic when processing OpenTelemetry attributes due to a fixed-size array.

Comment thread spanner/client.go
Comment on lines +1393 to +1396
mt.currOp.incrementAttemptCount()
mt.currOp.currAttempt = &attemptTracer{
startTime: time.Now(),
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This code will panic if built-in metrics are disabled. When tf.enabled is false, createBuiltinMetricsTracer returns an empty builtinMetricsTracer where currOp is nil. Accessing mt.currOp.incrementAttemptCount() and assigning to mt.currOp.currAttempt will result in a nil pointer dereference. These lines should be guarded by a check for mt.method != "" or mt.currOp != nil.

if mt.method != "" {
	mt.currOp.incrementAttemptCount()
	mt.currOp.currAttempt = &attemptTracer{
		startTime: time.Now(),
	}
}

Comment thread spanner/ot_metrics.go
Comment on lines +226 to +229
var attr [8]attribute.KeyValue
n := copy(attr[:], otConfig.attributeMap)
attr[n] = attributeKeyMethod.String(keyMethod)
otConfig.gfeLatency.Record(ctx, timing.gfeLatency.Milliseconds(), metric.WithAttributes(attr[:n+1]...))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a potential out-of-bounds panic here. If otConfig.attributeMap contains 8 or more elements, copy(attr[:], otConfig.attributeMap) will return 8, and attr[n] (where n=8) will exceed the bounds of the [8]attribute.KeyValue array. Given that attributeMap contains several standard tags and can be extended via commonTags, it is safer to use a larger buffer or use append with a stack-allocated slice.

var attrStack [16]attribute.KeyValue
attr := attrStack[:0]
attr = append(attr, otConfig.attributeMap...)
attr = append(attr, attributeKeyMethod.String(keyMethod))
otConfig.gfeLatency.Record(ctx, timing.gfeLatency.Milliseconds(), metric.WithAttributes(attr...))

@rahul2393 rahul2393 changed the title Optimize Spanner built-in metrics hot path perf(spanner) optimize Spanner built-in metrics hot path May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant