fix: keep late mark spans in parent traces#296
Conversation
Cache completed span contexts in the OpenTelemetry and OpenInference event processors so mark events that arrive after their parent scope has ended still inherit the original trace context. Keep true orphan marks on the existing fallback path, bound the completed-context cache, and add regression coverage for late parented marks after tool scope completion. Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (4)
📜 Recent review details🧰 Additional context used📓 Path-based instructions (15)**/*.rs📄 CodeRabbit inference engine (.agents/skills/add-binding-feature/SKILL.md)
Files:
{crates/adaptive/**/*.rs,**/*test*.{rs,py,go,ts,js},**/*adaptive*test*.{rs,py,go,ts,js},docs/plugins/adaptive/**}📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)
Files:
**/{Cargo.toml,**/*.rs}📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)
Files:
**/*.{h,hpp,c,cpp,rs}📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)
Files:
{crates/core,crates/adaptive}/**/*📄 CodeRabbit inference engine (.agents/skills/prepare-pr/SKILL.md)
Files:
**/*.{rs,toml}📄 CodeRabbit inference engine (.agents/skills/rename-surfaces/SKILL.md)
Files:
crates/core/**/*.rs📄 CodeRabbit inference engine (.agents/skills/test-go-binding/SKILL.md)
Files:
crates/{core,adaptive}/**📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)
Files:
**/*.{rs,py,js,ts,tsx,jsx,go,sh,toml,yaml,yml,md}📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.{rs,py,go,js,ts,tsx}📄 CodeRabbit inference engine (AGENTS.md)
Files:
crates/**/*.rs📄 CodeRabbit inference engine (AGENTS.md)
Files:
**⚙️ CodeRabbit configuration file
Files:
crates/{core,adaptive}/**/*.rs⚙️ CodeRabbit configuration file
Files:
{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}⚙️ CodeRabbit configuration file
Files:
crates/core/src/observability/{atif,otel,openinference}.rs📄 CodeRabbit inference engine (.agents/skills/maintain-observability/SKILL.md)
Files:
🔇 Additional comments (5)
WalkthroughBoth ChangesCompleted span context cache for late-parented events
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/core/src/observability/openinference.rs`:
- Around line 543-544: In the process_start method, when removing an event UUID
from completed_span_contexts, also remove that same UUID from the
completed_span_order queue to maintain synchronization between both data
structures. This prevents stale queue entries from later evicting fresh context
entries and causing loss of parent linkage for completed spans. Additionally,
review and apply the same synchronization fix to any similar span context
removal operations in the code around lines 624-637 where the same issue may
exist.
In `@crates/core/src/observability/otel.rs`:
- Around line 537-538: The process_start method removes entries from
completed_span_contexts but does not remove the corresponding UUID from
completed_span_order queue, causing synchronization issues. When removing an
event UUID from completed_span_contexts, also remove it from
completed_span_order to keep both data structures synchronized. Apply the same
fix to all locations where entries are removed from completed_span_contexts
(including the sections noted around lines 615-628) to ensure the queue and map
remain consistent and prevent stale queue entries from evicting fresh contexts.
In `@crates/core/tests/unit/observability/openinference_tests.rs`:
- Around line 1754-1809: Add a new test function that verifies the bounded
eviction behavior of the completed span context cache. The test should create
and complete more than COMPLETED_SPAN_CONTEXT_LIMIT + 1 spans, then create mark
events parented to both the oldest (evicted) completed span and a recent
completed span. Add assertions to verify that the mark event for the oldest span
does not link to its parent (indicating cache eviction) while the mark event for
a recent span successfully reuses the completed parent trace context. This test
should follow the same pattern as the existing
late_parented_marks_reuse_completed_parent_trace_context test but specifically
validate the cache boundary behavior when the limit is exceeded.
In `@crates/core/tests/unit/observability/otel_tests.rs`:
- Around line 755-805: Add a new regression test after
late_parented_marks_reuse_completed_parent_trace_context that covers the bounded
eviction behavior. The test should create and complete
COMPLETED_SPAN_CONTEXT_LIMIT + 1 span events, then create mark events
referencing both the oldest completed span and a recent completed span. Verify
that the mark for the oldest span becomes a true orphan with distinct trace_id
from its parent and the orphan attribute set to true, while the mark for a
recent span still reuses the parent's trace_id and has the correct
parent_span_id relationship. This ensures the cache boundary behavior is
properly tested alongside the happy path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: de2008ad-b8f9-464a-8cd7-c94facab5ea2
📒 Files selected for processing (4)
crates/core/src/observability/openinference.rscrates/core/src/observability/otel.rscrates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rs
📜 Review details
⏰ Context from checks skipped due to timeout. (2)
- GitHub Check: Check / Run
- GitHub Check: Preview docs
🧰 Additional context used
📓 Path-based instructions (15)
**/*.rs
📄 CodeRabbit inference engine (.agents/skills/add-binding-feature/SKILL.md)
Use
snake_casenaming convention for Rust identifiers (e.g.,nemo_relay_tool_call)
**/*.rs: Any Rust change must runjust test-rust
Any Rust change must runcargo fmt --all
Any Rust change must runcargo clippy --workspace --all-targets -- -D warnings
**/*.rs: Runcargo fmt --allfor all FFI work since it is Rust work
Runjust test-rustto validate FFI changes
Runcargo clippy --workspace --all-targets -- -D warningsto enforce strict linting on FFI workWhen Rust files changed as part of Go work, also run
cargo fmt --all,just test-rust, andcargo clippy --workspace --all-targets -- -D warnings
**/*.rs: Runcargo fmt --allwhen Rust files are changed as part of Node work
Runcargo clippy --workspace --all-targets -- -D warningswhen Rust files are changed as part of Node work
Runjust test-rustwhen Rust files are changed as part of Node work
**/*.rs: Runcargo fmt --allto format all Rust code
Runcargo clippy --workspace --all-targets -- -D warningsto enforce all clippy lints as errors
**/*.rs: Runcargo fmt --allwhen Rust files changed as part of WebAssembly work
Runcargo clippy --workspace --all-targets -- -D warningswhen Rust files changed as part of WebAssembly work
**/*.rs: If any Rust code changed, always runjust test-rust
If any Rust code changed, also runcargo fmt --all
If any Rust code changed, also runcargo clippy --workspace --all-targets -- -D warnings
Run Rust formatting withcargo fmt --all
Run Rust linting withcargo clippy --workspace --all-targets -- -D warnings
**/*.rs: Usecargo fmtfor Rust code formatting
Runcargo clippy -- -D warningsto lint Rust code and treat all warnings as errors
Use Rust snake_case naming convention for Rust identifiers
Include SPDX license header in all Rust source files using double-slash comment syntax
Validate Rust code withuv run pre-commit run --all-filesto enforce cargo fmt formatting check, cargo clippy lints, and cargo deny aud...
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
{crates/adaptive/**/*.rs,**/*test*.{rs,py,go,ts,js},**/*adaptive*test*.{rs,py,go,ts,js},docs/plugins/adaptive/**}
📄 CodeRabbit inference engine (.agents/skills/maintain-optimizer/SKILL.md)
Maintain documented and tested validation and report behavior for adaptive surfaces
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rs
**/{Cargo.toml,**/*.rs}
📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)
Maintain consistency between Rust package names in
Cargo.tomland their actual usage across the codebase
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
**/*.{h,hpp,c,cpp,rs}
📄 CodeRabbit inference engine (.agents/skills/maintain-packaging/SKILL.md)
Ensure FFI header and library naming follows consistent conventions across platform-specific builds
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
{crates/core,crates/adaptive}/**/*
📄 CodeRabbit inference engine (.agents/skills/prepare-pr/SKILL.md)
Changes to
crates/coreorcrates/adaptivemust run the full language matrix
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
**/*.{rs,toml}
📄 CodeRabbit inference engine (.agents/skills/rename-surfaces/SKILL.md)
Update Rust crate names and module prefixes during coordinated rename operations
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
crates/core/**/*.rs
📄 CodeRabbit inference engine (.agents/skills/test-go-binding/SKILL.md)
If the change touched
crates/coreor shared runtime semantics, also usevalidate-changefor broader validation
crates/core/**/*.rs: UseJson = serde_json::Valuein Rust-facing runtime APIs where the existing code expects JSON payloads.
UseResult<T>withFlowErrorin core runtime paths. Keep errors explicit and binding-appropriate at the wrapper layer.
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
crates/{core,adaptive}/**
📄 CodeRabbit inference engine (.agents/skills/validate-change/SKILL.md)
If
crates/coreorcrates/adaptivechanged, run the full matrix across Rust, Python, Go, Node.js, and WebAssembly
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
**/*.{rs,py,js,ts,tsx,jsx,go,sh,toml,yaml,yml,md}
📄 CodeRabbit inference engine (AGENTS.md)
Keep SPDX headers on source, docs, scripts, and configuration files. The project is Apache-2.0.
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
**/*.{rs,py,go,js,ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
Follow binding naming conventions: Rust and Python use
snake_case, C FFI exports prefixednemo_relay_, Go usesPascalCasefor public APIs, Node.js usescamelCase.
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
crates/**/*.rs
📄 CodeRabbit inference engine (AGENTS.md)
crates/**/*.rs: Keep async behavior on the existing tokio-based model. Bindings should preserve callback and future lifetimes rather than blocking or hiding async work unexpectedly.
UseJson = serde_json::Valuein Rust-facing runtime APIs for JSON payload handling.
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
**
⚙️ CodeRabbit configuration file
**:AGENTS.md
This file provides guidance to agents, including Claude Code and OpenAI Codex, when working in this repository.
Project Overview
NeMo Relay is a multi-language agent runtime framework for execution scopes, lifecycle events, middleware, plugins, and observability around tool and LLM calls. The core runtime is Rust. Primary supported bindings are Rust, Python, and Node.js. Go, WebAssembly, and the raw C FFI are experimental and source-first.
The shared runtime model is:
- Scope stacks decide where work belongs and which scope-local behavior is visible.
- Middleware registries decide what guardrails and intercepts run around managed calls.
- Plugins install reusable runtime behavior from configuration.
- Events record runtime behavior in ATOF form.
- Subscribers and exporters consume events in-process or export them to ATIF, OpenTelemetry, OpenInference, or other backends.
Repository Structure
The repository layout separates the Rust runtime, language bindings, documentation,
integration patches, and agent-facing skills.crates/ core/ # Rust core runtime crate, published as nemo-relay adaptive/ # Adaptive runtime primitives and plugin components python/ # PyO3 native extension for the Python package ffi/ # Raw C ABI layer used by downstream bindings such as Go node/ # NAPI Node.js binding and JavaScript/TypeScript entry points wasm/ # wasm-bindgen WebAssembly binding and JS wrappers python/ nemo_relay/ # Python wrapper package: scopes, tools, LLM, middleware, typed helpers, plugins, adaptive helpers tests/ # Python tests go/ nemo_relay/ # Experimental Go CGo binding and tests fern/ # Fern documentation site scripts/ # Stable wrappers and helper scripts; build/test/docs entry points live in justfile third_party/ # P...
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
crates/{core,adaptive}/**/*.rs
⚙️ CodeRabbit configuration file
crates/{core,adaptive}/**/*.rs: Review the Rust runtime for async correctness, scope isolation, middleware ordering, and event lifecycle regressions.
Pay close attention to task-local/thread-local scope propagation, callback lifetimes, stream finalization, and root_uuid isolation.
Public API changes should preserve existing behavior unless tests and docs show the intended migration path.
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rscrates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}
⚙️ CodeRabbit configuration file
{crates/**/tests/**,python/tests/**,go/nemo_relay/**/*_test.go}: Tests should cover the behavior promised by the changed API surface, including error paths and cross-request isolation where relevant.
Prefer assertions on lifecycle events, scope stacks, middleware ordering, and binding parity over shallow smoke tests.
Files:
crates/core/tests/unit/observability/openinference_tests.rscrates/core/tests/unit/observability/otel_tests.rs
crates/core/src/observability/{atif,otel,openinference}.rs
📄 CodeRabbit inference engine (.agents/skills/maintain-observability/SKILL.md)
When changing event fields in ATIF, OpenTelemetry, or OpenInference observability surfaces, keep the core event model in
crates/core/src/observability/atif.rs,crates/core/src/observability/otel.rs, andcrates/core/src/observability/openinference.rsin sync
Files:
crates/core/src/observability/openinference.rscrates/core/src/observability/otel.rs
🔇 Additional comments (3)
crates/core/src/observability/openinference.rs (2)
19-19: 📐 Maintainability & Code QualityConfirm the required Rust/core validation.
Please confirm this PR ran the required Rust checks plus the crates/core language matrix, not just the focused observability tests.
As per coding guidelines, “Any Rust change must run
just test-rust”, “Run Rust formatting withcargo fmt --all”, and “Run Rust linting withcargo clippy --workspace --all-targets -- -D warnings”.
As per path instructions, “Changes tocrates/coreorcrates/adaptivemust run the full language matrix”.Sources: Coding guidelines, Path instructions
48-49: LGTM!Also applies to: 503-517, 561-561, 597-604
crates/core/src/observability/otel.rs (1)
19-19: LGTM!Also applies to: 42-43, 497-511, 555-555, 588-595
mnajafian-nv
left a comment
There was a problem hiding this comment.
LGTM, pending addressing inline suggestions and green CI :)
willkill07
left a comment
There was a problem hiding this comment.
Coderabbit feedback seems sensible to resolve. Otherwise LGTM.
Remove completed span UUIDs from both the context map and FIFO order queue when a span restarts, preventing stale queue entries from later evicting fresh parent contexts. Add regression coverage for restart synchronization and OpenInference completed-context cache eviction behavior. Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
|
Ack @mnajafian-nv will check for conflicts on those PRs when this is in and comment if anything looks off |
|
/merge |
|
Thanks @willkill07 |
Overview
This PR keeps late mark spans attached to the trace of their completed parent scope in the OpenTelemetry and OpenInference exporters. When a mark event arrives after the parent span has already ended, the exporters now reuse the completed parent's span context instead of starting a new root trace.
Details
Validation:
cargo fmt --checkcargo test -p nemo-relay observability::otel::tests::cargo test -p nemo-relay observability::openinference::tests::cargo test -p nemo-relay late_parented_marks_reuse_completed_parent_trace_contextWhere should the reviewer start?
Start with
crates/core/src/observability/otel.rsandcrates/core/src/observability/openinference.rs, specifically the parent context lookup and completed span context cache. The matching regression tests are incrates/core/tests/unit/observability/otel_tests.rsandcrates/core/tests/unit/observability/openinference_tests.rs.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
Bug Fixes
Tests