Skip to content

fix(otel): one v2 logger owns the global provider; scope tenant OTLP creds per exporter#30590

Merged
yassin-berriai merged 8 commits into
litellm_internal_stagingfrom
litellm_otel_v2_per_exporter_tenant_routing
Jun 19, 2026
Merged

fix(otel): one v2 logger owns the global provider; scope tenant OTLP creds per exporter#30590
yassin-berriai merged 8 commits into
litellm_internal_stagingfrom
litellm_otel_v2_per_exporter_tenant_routing

Conversation

@yucheng-berri

@yucheng-berri yucheng-berri commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Resolves LIT-3787

Relevant issues

Orphan and duplicate gen-ai spans in OpenInference OTEL v2, and cross-backend leakage of per-request OTLP credentials.

Summary

With LITELLM_OTEL_V2 enabled and a preset callback configured (for example arize), a streamed chat completion produced two problems on the backend.

First, the LLM-call span showed up orphaned (no parent server span) and often duplicated. The proxy ended up with two OpenTelemetryV2 loggers that exported to different providers, so the FastAPI server span and the gen-ai spans landed in different traces.

Second, when a vendor preset was configured alongside another OTLP exporter, a request that carried one tenant's vendor credentials had those credentials stamped onto every OTLP exporter, so one backend's key was sent to a co-configured backend.

Example: callbacks: ["arize"] with OTEL_* also set. A streaming request returns the first chunk, and on Arize the chat <model> span has no parent server span and appears twice. Separately, with arize plus a self-hosted OTLP collector, a request carrying a team's Arize space/key rewrites the collector exporter's headers with that Arize key.

Root cause

Orphan/duplicate: proxy_startup_event published the OTel global TracerProvider before request callbacks were initialized. At that point no preset logger existed yet, so the publish step fell through to building a generic OpenTelemetryV2() and made its provider the global one. Callbacks were initialized later, building the actual preset logger (which folds the OTEL_* base exporter and its own exporter into a single logger). The FastAPI instrumentation binds the server span to the global provider, so the server span exported through the generic logger while the preset logger's gen-ai spans exported through the preset's provider. On the preset backend the LLM span's parent never arrived, and both loggers emitting on the success callback produced the duplicate.

Credential bleed: TenantTracerCache._config_with_headers applied a request's dynamic OTLP headers to every OTLP exporter on the logger, regardless of which integration the credentials belonged to.

Fix

Publish the global provider after callback initialization and reuse the logger the factory already built, instead of constructing a second generic one before any logger exists. The select-and-publish step is extracted into publish_global_otel_v2_provider, with the global-setter injected, so the behavior is unit-testable rather than buried in the FastAPI lifespan. A generic logger is built only when no logger was configured.

Selection reuses the logger the factory already designated as canonical rather than re-deriving it. The first OpenTelemetryV2 the factory builds registers itself as proxy_server.open_telemetry_logger, and every other v2 path (guardrail emission, identity seeding, phase spans) already routes through that owner via _registered_v2_logger. select_global_otel_v2_logger now takes that registered owner and returns it directly, so the global provider points at the same logger the rest of the v2 code emits through instead of an independent, order-dependent scan of _in_memory_loggers; the scan stays as the SDK-path fallback. The owner is resolved at the proxy call site and injected, keeping the helper free of hidden global reads.

Add an owner field to ExporterSpec and tag every preset's exporter with its callback name. _config_with_headers now applies per-request dynamic credentials only to the exporter whose owner matches the credential source, leaving co-configured exporters' base headers untouched.

Type

🐛 Bug Fix / 🔒 Security

Changes

litellm/proxy/proxy_server.py: move the OTel v2 global-provider publish to after callback initialization; delegate to publish_global_otel_v2_provider.

litellm/integrations/otel/logger.py: add select_global_otel_v2_logger (return the injected registered owner when present, else fall back to scanning _in_memory_loggers, building one only when none is registered) and publish_global_otel_v2_provider (select, then publish via an injected setter; the registered owner is injected too).

litellm/integrations/otel/model/config.py: add ExporterSpec.owner.

litellm/integrations/otel/plumbing/routing.py: apply per-request dynamic OTLP credentials only to the exporter whose owner matches the credential source.

litellm/integrations/otel/presets/{arize,langfuse,weave,phoenix,levo,agentops}.py: tag each preset's exporter with its owner.

Tests: global-logger selection reuses the existing logger and never duplicates, and builds exactly one when none is registered; selection prefers the injected registered owner over the list scan when the two would disagree (two presets configured, list order pointing at a different logger); the published provider is the selected logger's; the publish runs after callback initialization in the lifespan; per-request credentials never leak onto a different owner's exporter; each dynamic-credential preset tags its exporter with the matching owner.

Testing

pytest tests/test_litellm/integrations/otel/test_otel_v2_logger.py \
       tests/test_litellm/integrations/otel/test_otel_v2_dynamic.py \
       tests/test_litellm/integrations/otel/test_otel_v2_presets.py \
       tests/test_litellm/proxy/proxy_server/test_lifecycle.py

Verified live: with a single preset configured, exactly one logger owns the global provider, a streamed request produces one connected trace rooted at the server span with the LLM-call span nested beneath it carrying the OpenInference vocabulary, and the same trace fans out to every configured exporter.
image
image

Known limitation: configuring more than one preset at once still builds one logger per preset and therefore a duplicate gen-ai span per backend (correctly parented, not orphaned). Collapsing multiple presets into a single logger with a merged mapper chain is a larger change tracked separately.

… per exporter

The proxy published the OTel global TracerProvider before callbacks were
initialized, so no preset logger existed yet and a second generic logger was
built that won the global provider. Server spans then exported through a
different provider than the preset's gen-ai spans, orphaning the LLM span on
the preset backend. Publish after callback init and reuse the already-built
logger instead.

Separately, per-request tenant OTLP credentials were stamped onto every OTLP
exporter, leaking one backend's key onto a co-configured backend. Tag each
exporter with the preset that contributed it and apply dynamic credentials
only to the matching owner.
@CLAassistant

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov

codecov Bot commented Jun 17, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.55556% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
litellm/proxy/proxy_server.py 30.00% 7 Missing ⚠️

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes two bugs in the LITELLM_OTEL_V2 path: orphaned/duplicated gen-ai spans caused by the global TracerProvider being published before preset loggers were initialized, and per-request OTLP credentials leaking across co-configured exporters.

  • Orphan-span fix: moves the set_tracer_provider call to after _initialize_startup_logging in the proxy lifespan, and replaces the ad-hoc inline scan with publish_global_otel_v2_provider — which prefers the registered canonical owner (proxy_server.open_telemetry_logger) over re-deriving from list order, keeping the FastAPI server span and gen-ai spans on the same provider.
  • Credential-scoping fix: adds ExporterSpec.owner (tagged by each preset on the exporter it contributes) and narrows TenantTracerCache._config_with_headers to apply per-request dynamic headers only to the exporter whose owner matches self._callback_name, leaving co-configured exporters' base headers untouched.
  • Both fixes are backed by focused regression tests: selection preference, publish wiring, cross-owner header isolation, and startup-ordering guard.

Confidence Score: 5/5

Safe to merge — both changes are narrowly scoped to the OTel V2 path behind the LITELLM_OTEL_V2 flag, and the new ExporterSpec.owner field defaults to None making it backwards-compatible.

The startup-ordering fix is straightforward and the injectable helpers make the behavior verifiable without mutating global OTel state. The owner-matching logic in _config_with_headers is simple (exact string equality with a None default that preserves prior behavior for untagged exporters). All new and updated tests directly exercise the regression scenarios without mocking over the behaviors they are meant to guard.

No files require special attention.

Important Files Changed

Filename Overview
litellm/proxy/proxy_server.py Moves OTel V2 global-provider publish to after callback initialization; delegates to publish_global_otel_v2_provider with injected registered owner and setter — fixes the orphan-span root cause cleanly.
litellm/integrations/otel/logger.py Adds select_global_otel_v2_logger and publish_global_otel_v2_provider with injected dependencies; selection prefers the registered canonical owner over a list scan, falling back correctly for the SDK path.
litellm/integrations/otel/plumbing/routing.py _config_with_headers now filters on spec.owner == self._callback_name in addition to the OTLP-kind check, scoping per-request credentials to only the matching exporter — the core security fix for credential bleed.
litellm/integrations/otel/model/config.py Adds ExporterSpec.owner (optional str, default None) with a clear description linking it to the credential-scoping mechanism; backwards-compatible field addition.
tests/test_litellm/integrations/otel/test_otel_v2_logger.py Adds four well-scoped unit tests for select_global_otel_v2_logger and publish_global_otel_v2_provider covering reuse, preference ordering, fallback creation, and provider publishing.
tests/test_litellm/integrations/otel/test_otel_v2_dynamic.py Updates existing test to add owner tags (required by new code), and adds test_dynamic_headers_do_not_leak_to_other_owners_exporter covering the cross-backend credential-leak regression.

Reviews (2): Last reviewed commit: "Merge remote-tracking branch 'origin/lit..." | Re-trigger Greptile

Comment on lines +522 to +532
wrapped = getattr(proxy_startup_event, "__wrapped__", proxy_startup_event)
source = inspect.getsource(wrapped)
init_pos = source.find("_initialize_startup_logging(")
publish_pos = source.find("select_global_otel_v2_logger(")
assert init_pos != -1, "callback init call not found in proxy_startup_event"
assert publish_pos != -1, "OTEL global publish not found in proxy_startup_event"
assert init_pos < publish_pos, (
"OTEL global provider is published before callbacks are initialized; a "
"preset logger will not exist yet and a second generic logger will own "
"the global provider, orphaning gen-ai spans"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Ordering invariant verified via source-text position

The test uses inspect.getsource() + .find() to assert that _initialize_startup_logging( appears before select_global_otel_v2_logger( in the function source. This will silently give a false-negative pass if either call is ever moved into a helper, renamed, or called through an alias — or if either string appears in a doc-string or comment earlier in the function body. A more robust approach would exercise the ordering at runtime: call a stub that records the call order through a side-effecting mock on both functions, then assert the sequence.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Type the logger-selection parameter as Sequence[object] (isinstance narrows
it), cast the list[Any] global at the single call site, and pass model_copy a
typed dict[str, str] update so no changed line carries an Any value.
select_global_otel_v2_logger consumes litellm._in_memory_loggers, a shared
List[Any] global this change does not own. A cast doesn't satisfy the
Any-discipline checker (it inspects the inner expression), and re-annotating the
global is out of scope, so mark the single boundary line any-ok.
…helper

The publish step lived inline in proxy_startup_event (a FastAPI lifespan unit
tests do not execute), so its lines were uncovered though the selection logic
was tested. Extract publish_global_otel_v2_provider, which selects the single v2
logger and publishes its provider through an injected setter, and unit-test that
the published provider is the selected logger's. proxy_server delegates to it.
@yucheng-berri yucheng-berri changed the title fix(otel): one v2 logger owns the global provider; scope tenant creds per exporter fix(otel): one v2 logger owns the global provider; scope tenant OTLP creds per exporter Jun 17, 2026
Comment on lines +564 to +566
existing = next(
(cb for cb in in_memory_loggers if isinstance(cb, OpenTelemetryV2)), None
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant the callback factory just publish the provider?

… a list scan

The startup publish picked the global TracerProvider by scanning
_in_memory_loggers for the first OpenTelemetryV2, re-deriving an answer the
factory already settled: the first logger built registers itself as
proxy_server.open_telemetry_logger, and every other v2 path (guardrail, identity
seeding, phase spans) routes through that owner via _registered_v2_logger. Pass
that owner into select_global_otel_v2_logger so the global provider reuses the
same logger instead of an independent, order-dependent guess; the list scan
remains the SDK-path fallback. The owner is injected at the proxy call site to
keep the helper free of hidden global reads.
@yassin-berriai

Copy link
Copy Markdown
Contributor

@greptileai

owner="weave_otel",
),
],
# Weave consumes OpenInference + a small Weave-specific overlay.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: owner should be an enum

The owner field carried free-form strings that had to match preset callback
names. Introduce a str-based ExporterOwner enum (values equal to the callback
names, so per-request credential routing's owner==callback_name comparison still
holds) and have each preset tag its exporter with the enum member.
Comment on lines +902 to +906
registered = (
open_telemetry_logger
if isinstance(open_telemetry_logger, OpenTelemetryV2)
else None
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use select_global_otel_v2_logger? tbh this can be a single method called select_and_publish_global_otel_provider

Comment on lines +33 to +34
ARIZE = "arize"
ARIZE_PHOENIX = "arize_phoenix"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls make naming convention consistent

can match an exporter's owner against the credential source's callback name.
A ``str`` enum so the value compares equal to the bare callback-name string."""

ARIZE = "arize"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt this the same as arize phoenix?

Distinguish the hosted Arize AX backend from Arize Phoenix at the member level
while keeping the value 'arize' (the public callback name routing compares
against). Add a comment noting AX and Phoenix are separate backends.
@yassin-berriai yassin-berriai merged commit 1f93237 into litellm_internal_staging Jun 19, 2026
122 of 123 checks passed
@yassin-berriai yassin-berriai deleted the litellm_otel_v2_per_exporter_tenant_routing branch June 19, 2026 18:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants