fix(otel): one v2 logger owns the global provider; scope tenant OTLP creds per exporter#30590
Conversation
… per exporter The proxy published the OTel global TracerProvider before callbacks were initialized, so no preset logger existed yet and a second generic logger was built that won the global provider. Server spans then exported through a different provider than the preset's gen-ai spans, orphaning the LLM span on the preset backend. Publish after callback init and reuse the already-built logger instead. Separately, per-request tenant OTLP credentials were stamped onto every OTLP exporter, leaking one backend's key onto a co-configured backend. Tag each exporter with the preset that contributed it and apply dynamic credentials only to the matching owner.
|
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Greptile SummaryThis PR fixes two bugs in the
Confidence Score: 5/5Safe to merge — both changes are narrowly scoped to the OTel V2 path behind the LITELLM_OTEL_V2 flag, and the new ExporterSpec.owner field defaults to None making it backwards-compatible. The startup-ordering fix is straightforward and the injectable helpers make the behavior verifiable without mutating global OTel state. The owner-matching logic in _config_with_headers is simple (exact string equality with a None default that preserves prior behavior for untagged exporters). All new and updated tests directly exercise the regression scenarios without mocking over the behaviors they are meant to guard. No files require special attention.
|
| Filename | Overview |
|---|---|
| litellm/proxy/proxy_server.py | Moves OTel V2 global-provider publish to after callback initialization; delegates to publish_global_otel_v2_provider with injected registered owner and setter — fixes the orphan-span root cause cleanly. |
| litellm/integrations/otel/logger.py | Adds select_global_otel_v2_logger and publish_global_otel_v2_provider with injected dependencies; selection prefers the registered canonical owner over a list scan, falling back correctly for the SDK path. |
| litellm/integrations/otel/plumbing/routing.py | _config_with_headers now filters on spec.owner == self._callback_name in addition to the OTLP-kind check, scoping per-request credentials to only the matching exporter — the core security fix for credential bleed. |
| litellm/integrations/otel/model/config.py | Adds ExporterSpec.owner (optional str, default None) with a clear description linking it to the credential-scoping mechanism; backwards-compatible field addition. |
| tests/test_litellm/integrations/otel/test_otel_v2_logger.py | Adds four well-scoped unit tests for select_global_otel_v2_logger and publish_global_otel_v2_provider covering reuse, preference ordering, fallback creation, and provider publishing. |
| tests/test_litellm/integrations/otel/test_otel_v2_dynamic.py | Updates existing test to add owner tags (required by new code), and adds test_dynamic_headers_do_not_leak_to_other_owners_exporter covering the cross-backend credential-leak regression. |
Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile
| wrapped = getattr(proxy_startup_event, "__wrapped__", proxy_startup_event) | ||
| source = inspect.getsource(wrapped) | ||
| init_pos = source.find("_initialize_startup_logging(") | ||
| publish_pos = source.find("select_global_otel_v2_logger(") | ||
| assert init_pos != -1, "callback init call not found in proxy_startup_event" | ||
| assert publish_pos != -1, "OTEL global publish not found in proxy_startup_event" | ||
| assert init_pos < publish_pos, ( | ||
| "OTEL global provider is published before callbacks are initialized; a " | ||
| "preset logger will not exist yet and a second generic logger will own " | ||
| "the global provider, orphaning gen-ai spans" | ||
| ) |
There was a problem hiding this comment.
Ordering invariant verified via source-text position
The test uses inspect.getsource() + .find() to assert that _initialize_startup_logging( appears before select_global_otel_v2_logger( in the function source. This will silently give a false-negative pass if either call is ever moved into a helper, renamed, or called through an alias — or if either string appears in a doc-string or comment earlier in the function body. A more robust approach would exercise the ordering at runtime: call a stub that records the call order through a side-effecting mock on both functions, then assert the sequence.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Type the logger-selection parameter as Sequence[object] (isinstance narrows it), cast the list[Any] global at the single call site, and pass model_copy a typed dict[str, str] update so no changed line carries an Any value.
select_global_otel_v2_logger consumes litellm._in_memory_loggers, a shared List[Any] global this change does not own. A cast doesn't satisfy the Any-discipline checker (it inspects the inner expression), and re-annotating the global is out of scope, so mark the single boundary line any-ok.
…helper The publish step lived inline in proxy_startup_event (a FastAPI lifespan unit tests do not execute), so its lines were uncovered though the selection logic was tested. Extract publish_global_otel_v2_provider, which selects the single v2 logger and publishes its provider through an injected setter, and unit-test that the published provider is the selected logger's. proxy_server delegates to it.
| existing = next( | ||
| (cb for cb in in_memory_loggers if isinstance(cb, OpenTelemetryV2)), None | ||
| ) |
There was a problem hiding this comment.
Why cant the callback factory just publish the provider?
… a list scan The startup publish picked the global TracerProvider by scanning _in_memory_loggers for the first OpenTelemetryV2, re-deriving an answer the factory already settled: the first logger built registers itself as proxy_server.open_telemetry_logger, and every other v2 path (guardrail, identity seeding, phase spans) routes through that owner via _registered_v2_logger. Pass that owner into select_global_otel_v2_logger so the global provider reuses the same logger instead of an independent, order-dependent guess; the list scan remains the SDK-path fallback. The owner is injected at the proxy call site to keep the helper free of hidden global reads.
…itellm_otel_v2_per_exporter_tenant_routing
| owner="weave_otel", | ||
| ), | ||
| ], | ||
| # Weave consumes OpenInference + a small Weave-specific overlay. |
There was a problem hiding this comment.
Nit: owner should be an enum
The owner field carried free-form strings that had to match preset callback names. Introduce a str-based ExporterOwner enum (values equal to the callback names, so per-request credential routing's owner==callback_name comparison still holds) and have each preset tag its exporter with the enum member.
| registered = ( | ||
| open_telemetry_logger | ||
| if isinstance(open_telemetry_logger, OpenTelemetryV2) | ||
| else None | ||
| ) |
There was a problem hiding this comment.
why not use select_global_otel_v2_logger? tbh this can be a single method called select_and_publish_global_otel_provider
| ARIZE = "arize" | ||
| ARIZE_PHOENIX = "arize_phoenix" |
There was a problem hiding this comment.
pls make naming convention consistent
| can match an exporter's owner against the credential source's callback name. | ||
| A ``str`` enum so the value compares equal to the bare callback-name string.""" | ||
|
|
||
| ARIZE = "arize" |
There was a problem hiding this comment.
isnt this the same as arize phoenix?
Distinguish the hosted Arize AX backend from Arize Phoenix at the member level while keeping the value 'arize' (the public callback name routing compares against). Add a comment noting AX and Phoenix are separate backends.
1f93237
into
litellm_internal_staging
Resolves LIT-3787
Relevant issues
Orphan and duplicate gen-ai spans in OpenInference OTEL v2, and cross-backend leakage of per-request OTLP credentials.
Summary
With
LITELLM_OTEL_V2enabled and a preset callback configured (for examplearize), a streamed chat completion produced two problems on the backend.First, the LLM-call span showed up orphaned (no parent server span) and often duplicated. The proxy ended up with two
OpenTelemetryV2loggers that exported to different providers, so the FastAPI server span and the gen-ai spans landed in different traces.Second, when a vendor preset was configured alongside another OTLP exporter, a request that carried one tenant's vendor credentials had those credentials stamped onto every OTLP exporter, so one backend's key was sent to a co-configured backend.
Example:
callbacks: ["arize"]withOTEL_*also set. A streaming request returns the first chunk, and on Arize thechat <model>span has no parent server span and appears twice. Separately, witharizeplus a self-hosted OTLP collector, a request carrying a team's Arize space/key rewrites the collector exporter's headers with that Arize key.Root cause
Orphan/duplicate:
proxy_startup_eventpublished the OTel globalTracerProviderbefore request callbacks were initialized. At that point no preset logger existed yet, so the publish step fell through to building a genericOpenTelemetryV2()and made its provider the global one. Callbacks were initialized later, building the actual preset logger (which folds theOTEL_*base exporter and its own exporter into a single logger). The FastAPI instrumentation binds the server span to the global provider, so the server span exported through the generic logger while the preset logger's gen-ai spans exported through the preset's provider. On the preset backend the LLM span's parent never arrived, and both loggers emitting on the success callback produced the duplicate.Credential bleed:
TenantTracerCache._config_with_headersapplied a request's dynamic OTLP headers to every OTLP exporter on the logger, regardless of which integration the credentials belonged to.Fix
Publish the global provider after callback initialization and reuse the logger the factory already built, instead of constructing a second generic one before any logger exists. The select-and-publish step is extracted into
publish_global_otel_v2_provider, with the global-setter injected, so the behavior is unit-testable rather than buried in the FastAPI lifespan. A generic logger is built only when no logger was configured.Selection reuses the logger the factory already designated as canonical rather than re-deriving it. The first
OpenTelemetryV2the factory builds registers itself asproxy_server.open_telemetry_logger, and every other v2 path (guardrail emission, identity seeding, phase spans) already routes through that owner via_registered_v2_logger.select_global_otel_v2_loggernow takes that registered owner and returns it directly, so the global provider points at the same logger the rest of the v2 code emits through instead of an independent, order-dependent scan of_in_memory_loggers; the scan stays as the SDK-path fallback. The owner is resolved at the proxy call site and injected, keeping the helper free of hidden global reads.Add an
ownerfield toExporterSpecand tag every preset's exporter with its callback name._config_with_headersnow applies per-request dynamic credentials only to the exporter whoseownermatches the credential source, leaving co-configured exporters' base headers untouched.Type
🐛 Bug Fix / 🔒 Security
Changes
litellm/proxy/proxy_server.py: move the OTel v2 global-provider publish to after callback initialization; delegate topublish_global_otel_v2_provider.litellm/integrations/otel/logger.py: addselect_global_otel_v2_logger(return the injected registered owner when present, else fall back to scanning_in_memory_loggers, building one only when none is registered) andpublish_global_otel_v2_provider(select, then publish via an injected setter; the registered owner is injected too).litellm/integrations/otel/model/config.py: addExporterSpec.owner.litellm/integrations/otel/plumbing/routing.py: apply per-request dynamic OTLP credentials only to the exporter whoseownermatches the credential source.litellm/integrations/otel/presets/{arize,langfuse,weave,phoenix,levo,agentops}.py: tag each preset's exporter with itsowner.Tests: global-logger selection reuses the existing logger and never duplicates, and builds exactly one when none is registered; selection prefers the injected registered owner over the list scan when the two would disagree (two presets configured, list order pointing at a different logger); the published provider is the selected logger's; the publish runs after callback initialization in the lifespan; per-request credentials never leak onto a different owner's exporter; each dynamic-credential preset tags its exporter with the matching owner.
Testing
Verified live: with a single preset configured, exactly one logger owns the global provider, a streamed request produces one connected trace rooted at the server span with the LLM-call span nested beneath it carrying the OpenInference vocabulary, and the same trace fans out to every configured exporter.


Known limitation: configuring more than one preset at once still builds one logger per preset and therefore a duplicate gen-ai span per backend (correctly parented, not orphaned). Collapsing multiple presets into a single logger with a merged mapper chain is a larger change tracked separately.