perf(spanner) optimize Spanner built-in metrics hot path#14605
perf(spanner) optimize Spanner built-in metrics hot path#14605rahul2393 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the metrics collection logic within the Spanner client to improve performance and consistency. Key changes include the introduction of a more efficient server timing header parser that avoids map allocations, the use of wrapper types for streaming clients to capture telemetry during result iteration, and improved resource cleanup during client initialization. The review identified two critical issues: a potential nil pointer dereference in the BatchWriteResponseIterator when metrics are disabled, and a possible out-of-bounds panic when processing OpenTelemetry attributes due to a fixed-size array.
| mt.currOp.incrementAttemptCount() | ||
| mt.currOp.currAttempt = &attemptTracer{ | ||
| startTime: time.Now(), | ||
| } |
There was a problem hiding this comment.
This code will panic if built-in metrics are disabled. When tf.enabled is false, createBuiltinMetricsTracer returns an empty builtinMetricsTracer where currOp is nil. Accessing mt.currOp.incrementAttemptCount() and assigning to mt.currOp.currAttempt will result in a nil pointer dereference. These lines should be guarded by a check for mt.method != "" or mt.currOp != nil.
if mt.method != "" {
mt.currOp.incrementAttemptCount()
mt.currOp.currAttempt = &attemptTracer{
startTime: time.Now(),
}
}| var attr [8]attribute.KeyValue | ||
| n := copy(attr[:], otConfig.attributeMap) | ||
| attr[n] = attributeKeyMethod.String(keyMethod) | ||
| otConfig.gfeLatency.Record(ctx, timing.gfeLatency.Milliseconds(), metric.WithAttributes(attr[:n+1]...)) |
There was a problem hiding this comment.
There is a potential out-of-bounds panic here. If otConfig.attributeMap contains 8 or more elements, copy(attr[:], otConfig.attributeMap) will return 8, and attr[n] (where n=8) will exceed the bounds of the [8]attribute.KeyValue array. Given that attributeMap contains several standard tags and can be extended via commonTags, it is safer to use a larger buffer or use append with a stack-allocated slice.
var attrStack [16]attribute.KeyValue
attr := attrStack[:0]
attr = append(attr, otConfig.attributeMap...)
attr = append(attr, attributeKeyMethod.String(keyMethod))
otConfig.gfeLatency.Record(ctx, timing.gfeLatency.Milliseconds(), metric.WithAttributes(attr...))
Summary
Header()calls on streaming read/query paths used by built-in and legacy GFE latency metricsattribute.Setand avoid per-metric attribute slice/map constructionRecvrequest pathWhy
When built-in metrics and legacy OpenTelemetry Spanner metrics are enabled, streaming query/read currently calls
Header()before the firstRecv(). Under load, goroutine profiles showed many requests blocked ingrpc.(*clientStream).Header, increasing request latency even when server/GFE latency was low. This change defers header capture until after the firstRecv()returns, preserving metrics but removing the blocking header wait from request dispatch.