Skip to content

Stamp SampleRate on shape spans from a tracestate rate hint#4562

Open
erik-the-implementer wants to merge 7 commits into
mainfrom
tracestate-sample-rate
Open

Stamp SampleRate on shape spans from a tracestate rate hint#4562
erik-the-implementer wants to merge 7 commits into
mainfrom
tracestate-sample-rate

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Motivation

The Cloudflare worker in front of Electric Cloud head-samples successful requests at 1:20 and keeps all >= 500 responses at sampleRate 1, weighting its exported spans accordingly. Electric only ever received the sampled/not-sampled bit via traceparent, which causes an ~20x undercount: Electric spans carry no SampleRate attribute, so aggregates over electric-region spans under-report worker traffic ~20x relative to the worker's own (weighted) spans.

Server-side errors are already captured as Sentry events, so this PR deliberately does not create extra spans to carry error attributes — it only stamps the sampling weight onto spans that are already being recorded.

The worker now sends its sampling rate alongside traceparent on every proxied request (regardless of the sampled flag) via the W3C tracestate header:

tracestate: electric=rate:<positive-integer>    # currently rate:20

Tracing backends that understand sampling weights natively read an integer span attribute named SampleRate and weight aggregates by it.

Behavior matrix (per shape GET request)

Remote parent Rate hint Response status Behavior
none (direct traffic) any Unchanged — spans recorded, no SampleRate attribute
sampled=1 rate:N < 500 Spans exported as today plus SampleRate = N on the root span and the stream_chunk child spans
sampled=1 rate:N >= 500 Spans exported as today plus SampleRate = 1 (error responses ignore the hint — mirrors the worker's keep-all-errors-at-rate-1 semantics)
sampled=0 any any Unchanged — no spans exported
any unparseable / rate:<1 any Hint ignored — behaves as if no hint was sent

Implementation

  • Electric.Plug.TraceContextPlug — after the existing traceparent extraction, parses the electric=rate:N member out of the (already-decoded) tracestate carried by the extracted span context, and stores %{sample_rate_hint: N} in conn.private. Exposes trace_context/1 and sample_rate_attrs/2 for downstream plugs.
  • Electric.Plug.ServeShapePlugadd_span_attrs_from_conn/1 (run at span start and again from emit_shape_telemetry/1 once the final response status is known) merges the SampleRate attribute into the conn-derived attributes and stamps them onto the current root span via add_span_attributes/1. When the remote parent was not sampled the parent-based sampler left no recording span, so the stamp is a no-op and nothing is exported — the deliberate volume win for unsampled traffic.
  • Electric.Shapes.Api.Response.send_stream/2 — the per-request shape_get.plug.stream_chunk child spans (which survive the flattening in Flatten 1:1 shape GET child spans into Plug_shape_get root-span attributes #4561 because they are 1:n, not 1:1) get the same SampleRate attribute at creation, from the stashed hint and the response status.

Notes / scope decisions

  • Sampler's exclude_spans semantics and the flattening work from the base branch are untouched.
  • Embedded usage (no MIX_TARGET=application, no OTel SDK): the propagator no-ops, so no trace context is ever stored and every path is unchanged.

Tests

  • Unit tests cover the tracestate parsing: valid electric=rate:N members, unparseable values, and rate:<1 all map to the expected hint / ignored behavior.

🤖 Generated with Claude Code

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.37%. Comparing base (0947230) to head (a87c415).
⚠️ Report is 6 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4562      +/-   ##
==========================================
+ Coverage   58.11%   58.37%   +0.26%     
==========================================
  Files         369      370       +1     
  Lines       40459    40674     +215     
  Branches    11471    11555      +84     
==========================================
+ Hits        23513    23744     +231     
+ Misses      16871    16856      -15     
+ Partials       75       74       -1     
Flag Coverage Δ
packages/agents 71.37% <ø> (ø)
packages/agents-mcp 77.70% <ø> (+0.15%) ⬆️
packages/agents-mobile 75.49% <ø> (ø)
packages/agents-runtime 82.52% <ø> (+0.37%) ⬆️
packages/agents-server 74.86% <ø> (-0.12%) ⬇️
packages/agents-server-ui 6.25% <ø> (+<0.01%) ⬆️
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.83% <ø> (+0.11%) ⬆️
packages/y-electric 56.05% <ø> (ø)
typescript 58.37% <ø> (+0.26%) ⬆️
unit-tests 58.37% <ø> (+0.26%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Since the last review the PR has been substantially simplified: commit a87c415d2 ("Drop 5xx error-tail span synthesis; keep SampleRate stamping") removes the entire synthesized-span machinery and the PR is now a focused, low-risk change — it only stamps the SampleRate attribute (from an upstream tracestate: electric=rate:N hint) onto spans that are already being recorded. This directly resolves the main open item from iteration 6. The remaining code is clean and merge-ready.

What's Working Well

  • The riskiest code is gone, not just retested. Iteration 6's open concern was that the synthesized 5xx error-tail span (forcing the sampled bit, parenting onto the remote span, backdating start_time, "exactly one span / no duplicates") was untested after the integration suite was deleted. Dropping that path entirely is the cleanest possible resolution — there is no longer any subtle OTel-SDK glue to cover. What remains is a plain add_span_attributes(%{"SampleRate" => N}) on an already-recording span, which is a no-op when the parent wasn't sampled.
  • Scope and motivation are now consistent. The PR body was rewritten to drop the "~95% missing error traces" goal and state explicitly that server-side errors are already captured as Sentry events, so no extra error-carrying spans are created. The behavior matrix, the changeset (@core/sync-service: patch), and the code all agree on the four cases now.
  • Defensive parsing, exhaustively tested. The sample_rate_hint/1 with chain rejects rate:0, negatives, floats (1.5), trailing garbage (20x), the empty value, and wrong member formats — all table-driven in trace_context_plug_test.exs, plus the rate:1 boundary and the multi-member / whitespace tracestate cases.
  • tracestate_value/2 is guarded correctly. Record.is_record(span_ctx, :span_ctx) gates the record access, and the result is normalized across binary / non-empty charlist / empty cases to nil. It's exercised transitively by every TraceContextPlug.call/2 test that asserts a parsed hint, so a wrong arity or return shape from :otel_tracestate.get/2 would fail the suite.

Issues Found

Critical (Must Fix)

None.

Important (Should Fix)

None. The iteration-6 "Important" item (no SDK-level coverage for the synthesized error-tail span) is moot now that the synthesis is removed.

Suggestions (Nice to Have)

  • Dangling semicolontrace_context_plug.ex:15: the moduledoc still ends a sentence with (see sample_rate_attrs/2); before the blank line and the final "Hints that are missing…" sentence. Leftover from an earlier bulleted list; reads as stray punctuation. Trivial.
  • No end-to-end assertion that an exported span carries SampleRate. trace_context_plug_test.exs covers the pure layer (parsing + sample_rate_attrs/2 computation) thoroughly, but nothing asserts that the attribute actually lands on the exported Plug_shape_get root span or the stream_chunk children. This was a real gap when the synthesis existed; with the synthesis gone the residual risk is low (the stamp is a one-line SDK call on an already-recording span), so I'd treat a small router-level test as nice-to-have rather than blocking.

Issue Conformance

No linked issue (flagged per project convention, as before). The PR description is thorough and now internally consistent after the rewrite — motivation (~20x undercount), full per-request behavior matrix, the Sentry-handles-errors rationale for not synthesizing spans, and the embedded-mode no-op note. The changeset accurately scopes the change. No stale references to the removed synthesis remain in the description.

Previous Review Status

Iteration 7 — change since last review: commit a87c415d2 ("Drop 5xx error-tail span synthesis; keep SampleRate stamping").

  • Resolved (the iteration-6 Important item): the synthesized 5xx error-tail span and its OpenTelemetry.export_unsampled_error_span/1 + export_unsampled_remote_span/4 helpers are deleted. I confirmed zero remaining references to export_unsampled* / error-tail synthesis anywhere in packages/sync-service, and open_telemetry.ex now adds only the tracestate_value/2 reader. The deleted integration suite is no longer relevant because the path it covered no longer exists.
  • Resolved (the iteration-6 description nit): the PR body no longer cites the deleted integration tests or the "~95% missing error traces" motivation; it now documents the deliberate decision to lean on Sentry for errors.
  • Still open (downgraded to Suggestion): the dangling semicolon at trace_context_plug.ex:15.
  • Not regressed: the iteration-5 fold into add_span_attrs_from_conn/1, uniform string attr keys in stream_chunk ("SampleRate" + "chunk_size"), and vendor-neutral language are all intact. The two call sites of add_span_attrs_from_conn/1 (span start with status == nil, and emit time with the final status) match the documented "stamp-then-overwrite" behavior.

The change is correct, well-scoped, and merge-ready; only the trivial doc nit remains.


Review iteration: 7 | 2026-06-14

Base automatically changed from flatten-shape-get-spans to main June 11, 2026 13:10
alco and others added 3 commits June 11, 2026 15:15
…tail

The Cloudflare worker in front of Electric head-samples successful
requests at 1:N (currently 1:20) and keeps all >= 500 responses at
sampleRate 1, weighting its exported spans accordingly. Electric only
received the sampled/not-sampled bit via traceparent, so:

 1. Electric spans carried no SampleRate attribute and electric-region
    aggregates under-reported worker traffic ~20x;
 2. ~95% of error traces had no server-side spans at all, because the
    worker's keep-on-error decision happens at export time, after a
    traceparent with sampled=0 was already propagated.

The worker now sends its rate hint on every proxied request via the
W3C tracestate header (member: `electric=rate:<N>`). This commit makes
Electric honor it:

 * TraceContextPlug parses the hint (missing/unparseable/rate < 1 hints
   are ignored) and stashes it in the conn together with the remote
   parent span context and its sampled flag.

 * For sampled remote parents, ServeShapePlug stamps `SampleRate` = N
   (status < 500) or `SampleRate` = 1 (status >= 500) on the
   Plug_shape_get root span once the response status is known; the
   shape_get.plug.stream_chunk child spans get the same attribute at
   creation.

 * For unsampled remote parents, a 5xx response now synthesizes a
   single root request span at response time with `SampleRate` = 1,
   carrying the standard root-span attributes, backdated to the request
   start, and parented onto the remote span context with the sampled
   trace flag forced on — so the parent-based sampler records it and it
   lands in the same trace as the worker's kept-on-error spans.
   Unsampled successful requests still export nothing.

Direct traffic (no remote parent) is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco force-pushed the tracestate-sample-rate branch from f1158ed to 7fe2939 Compare June 11, 2026 13:16
@alco alco self-assigned this Jun 14, 2026
@erik-the-implementer erik-the-implementer changed the title Stamp Honeycomb SampleRate from tracestate rate hint and export 5xx error tail for unsampled traces Stamp SampleRate on shape spans from a tracestate rate hint Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants