Skip to content

feat: centralize logging and otel export#20

Merged
JaeAeich merged 3 commits intodevfrom
log
Mar 9, 2026
Merged

feat: centralize logging and otel export#20
JaeAeich merged 3 commits intodevfrom
log

Conversation

@JaeAeich
Copy link
Copy Markdown
Owner

@JaeAeich JaeAeich commented Mar 8, 2026

Description

  • Fixes #(issue number)

Comments

Summary by Sourcery

Centralize tracing and logging setup across API and engine services, enabling optional OpenTelemetry export and improved workflow run observability.

New Features:

  • Introduce a shared telemetry crate that configures tracing, supports JSON log output, and optionally exports spans to an OTLP endpoint.
  • Propagate an API span identifier through NATS run requests so engine workflow spans can be correlated back to the originating API call.

Enhancements:

  • Refine workflow and cancellation logging in the engine and API with structured spans, additional context fields, and clearer log levels.
  • Initialize tracing in the engine bootstrap and API startup using the shared telemetry initialization function.
  • Adjust HTTP tracing on API routes to log responses at info level and failures at error level, and treat certain database write failures as errors instead of warnings.

Build:

  • Add a new telemetry crate to the workspace and wire shared OpenTelemetry and tracing dependencies through workspace configuration.

Documentation:

  • Document engine internals and crate structure for developers, including how to add custom engines.
  • Add tracing documentation covering span structure, configuration via environment variables, NATS trace correlation, and OTEL export setup.
  • Simplify and cross-link engine configuration and architecture docs to point to the new engine internals documentation.

@vercel
Copy link
Copy Markdown

vercel Bot commented Mar 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
metis Ready Ready Preview, Comment Mar 9, 2026 1:00am

@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 8, 2026

Reviewer's Guide

Centralizes tracing/logging configuration via a new telemetry crate, wires OTEL export into API and engine binaries, and enriches workflow/run lifecycle tracing (including NATS correlation) while updating docs to describe the new internals and tracing model.

Sequence diagram for create_run to engine workflow_run with trace correlation

sequenceDiagram
    actor User
    participant Api as metis_api
    participant RunService
    participant Nats as NATS
    participant Engine as Engine_runtime

    User->>Api: HTTP POST /runs
    Api->>RunService: create_run(run_id, user_id, req)
    note over RunService: #[tracing::instrument] creates span create_run

    RunService->>RunService: EngineRequestValidator::new(engine_config)
    RunService->>RunService: validate(req)

    RunService->>RunService: span_id = Span::current().id()
    RunService->>RunService: trace_id = format(span_id.into_u64())

    RunService->>Nats: publish_run(topic, RunRequestMessage{run_id, request, user_id, trace_id})
    RunService->>RunService: info!("Run created and dispatched")
    RunService-->>Api: Ok(())

    Nats-->>Engine: RunRequestMessage{run_id, request, user_id, trace_id}

    Engine->>Engine: EngineRuntime::handle_run_message(...)
    Engine->>Engine: run(run_id, user_id, request, trace_id)
    note over Engine: info_span workflow_run created with api_trace_id field

    Engine->>Engine: run_inner(run_id, user_id, request)
    Engine->>Engine: state=initializing → building → executing → running → parsing → updating_db → cleanup
    Engine->>Engine: duration_ms recorded
    Engine->>Engine: on error: record error field and log "Workflow execution failed"

    Engine-->>Nats: completion/heartbeat events (unchanged)
    Engine-->>Api: DB state updated, visible via existing endpoints
Loading

Sequence diagram for run cancellation tracing

sequenceDiagram
    actor User
    participant Api as metis_api
    participant RunService
    participant Nats as NATS
    participant Engine as Engine_runtime

    User->>Api: HTTP POST /runs/{id}/cancel
    Api->>RunService: request_cancel(id)
    note over RunService: #[tracing::instrument] span request_cancel with run_id

    RunService->>RunService: run = repo.find_by_id(id)
    alt run.state == Queued
        RunService->>RunService: update_state_if(Queued -> Canceled)
        RunService->>RunService: info!("Queued run canceled immediately")
        RunService-->>Api: Ok(None)
    else run.state == Running
        RunService->>RunService: engine_id = run.engine_id
        RunService->>Nats: publish_cancel(engine_id, run_id)
        RunService->>RunService: info!("Cancel signal dispatched to engine")
        RunService-->>Api: Ok(Some(engine_id))

        Nats-->>Engine: cancel message
        Engine->>Engine: EngineRuntime::cancel(run_id)
        note over Engine: info_span workflow_cancel
        Engine->>Engine: mark run_id as cancelled
        Engine->>Engine: pid = pid_store.get(run_id)
        alt pid exists
            Engine->>Engine: ProcessExecutor::cancel(pid)
            Engine->>Engine: pid_store.remove(run_id)
            Engine->>Engine: info!("Workflow cancellation completed")
        else pid missing
            Engine->>Engine: warn!("No PID found for run_id...")
        end
    else
        RunService-->>Api: Ok(None)  %% no-op for terminal states
    end
Loading

Class diagram for telemetry initialization and engine runtime changes

classDiagram
    class Telemetry {
        +init_tracing(service_name: &'static str) void
        +tracing
        +tracing_subscriber
    }

    class ApiTracingModule {
        +init_tracing() void
    }

    class ApiMain {
        +main() Result
    }

    class EngineServer {
        +bootstrap(engine: Engine, name: &'static str) EngineResult
    }

    class EngineRuntime {
        +config() &Arc_FullEngineConfig
        +run(run_id: Uuid, user_id: String, req: ValidatedRunRequest, trace_id: Option_String) Result_RunSummary_EngineError
        +run_inner(run_id: Uuid, user_id: String, req: ValidatedRunRequest) Result_RunSummary_EngineError
        +cancel(run_id: &str) Result_void_EngineError
    }

    class RunService {
        +create_run(run_id: &str, user_id: &str, req: CreateRunRequest) ServiceResult_void
        +request_cancel(id: &RunId) ServiceResult_Option_String
    }

    class RunRequestMessage {
        +run_id: String
        +request: ValidatedRunRequest
        +user_id: String
        +trace_id: Option_String
    }

    class Engine {
        <<trait>>
        +new() Self
        +get_workflow_results() Result_HashMap_Category_Files
        +get_task_logs() Result_Vec_TaskLog
    }

    Telemetry <|.. ApiTracingModule : uses
    ApiTracingModule <|.. ApiMain : calls
    Telemetry <|.. EngineServer : uses

    EngineServer o--> EngineRuntime
    EngineRuntime ..> RunRequestMessage
    RunService ..> RunRequestMessage

    Engine <|.. GenericEngine

    class GenericEngine {
        +new() Self
    }
Loading

File-Level Changes

Change Details Files
Introduce shared telemetry crate and wire it into API and engine binaries for centralized tracing and optional OTEL export.
  • Add new telemetry crate that configures tracing subscribers, JSON vs text format via LOG_FORMAT, and optional OTLP exporter via OTEL_EXPORTER_OTLP_ENDPOINT
  • Expose init_tracing(service_name) plus re-exports of tracing and tracing_subscriber from the telemetry crate
  • Update workspace Cargo.toml to include telemetry crate and OTEL-related dependencies
  • Use init_tracing in metis-api main and in engine bootstrap, and plumb static service name into generic engine main
  • Drop direct tracing-subscriber dependency from API and engine crates in favor of telemetry
Cargo.toml
crates/telemetry/Cargo.toml
crates/telemetry/src/lib.rs
crates/api/Cargo.toml
crates/api/src/main.rs
crates/engine/Cargo.toml
crates/engine/src/lib.rs
crates/engine/src/server.rs
crates/engines/generic/src/main.rs
crates/api/src/tracing.rs
Enhance structured tracing around run creation, workflow execution, and cancellation, including API↔engine correlation via NATS messages.
  • Add #[tracing::instrument] to RunService::create_run and request_cancel with key span fields for run/user/workflow metadata
  • Capture current API span ID as hex trace_id and embed it in RunRequestMessage sent over NATS
  • Extend RunRequestMessage with an optional trace_id field shared between API and engine
  • Change EngineRuntime::run signature to accept an optional trace_id, introduce run_inner, and create a workflow_run span that records state, duration, api_trace_id, and error
  • Update internal state transitions in run_inner to record onto the current span instead of a local span handle
  • Wrap EngineRuntime::cancel body in a workflow_cancel span via .instrument
  • Promote several database write failures from warn! to error! for clearer signal
crates/api/src/services/run.rs
crates/common/src/models.rs
crates/engine/src/runtime.rs
Adjust HTTP tracing configuration for the API and refine documentation to describe engine internals and tracing behavior.
  • Customize tower_http TraceLayer to log HTTP spans/responses at INFO and failures at ERROR
  • Add new docs page describing tracing setup, span taxonomy, correlation via api_trace_id, and OTEL export configuration
  • Add new docs page for engine internals and crate layout, and point previous architecture/engine-configuration content to it
  • Update engine-configuration and architecture docs to reference the new Engine Internals docs instead of duplicating crate and engine descriptions
  • Extend docs sidebar with Dev section linking to Engine Internals and Tracing pages
crates/api/src/routes.rs
docs/dev/tracing.md
docs/dev/engine.md
docs/engine-configuration.md
docs/architecture.md
docs/.vitepress/config.mjs

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 3 issues, and left some high level feedback:

  • The telemetry crate is declared with edition = "2024", which is not yet a stable Rust edition; consider aligning it with the rest of the workspace (likely edition = "2021") to avoid toolchain issues.
  • The engine crate still uses tracing macros (e.g., info!, error!) but its Cargo.toml no longer declares a direct tracing dependency; either re-add tracing as a dependency or import it via telemetry::tracing and adjust the use statements accordingly.
  • In telemetry::init_tracing, failing OTEL exporter construction currently expects and panics; you might want to downgrade this to a logged warning and fall back to local logging so a misconfigured OTEL endpoint does not crash services at startup.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `telemetry` crate is declared with `edition = "2024"`, which is not yet a stable Rust edition; consider aligning it with the rest of the workspace (likely `edition = "2021"`) to avoid toolchain issues.
- The `engine` crate still uses `tracing` macros (e.g., `info!`, `error!`) but its `Cargo.toml` no longer declares a direct `tracing` dependency; either re-add `tracing` as a dependency or import it via `telemetry::tracing` and adjust the `use` statements accordingly.
- In `telemetry::init_tracing`, failing OTEL exporter construction currently `expect`s and panics; you might want to downgrade this to a logged warning and fall back to local logging so a misconfigured OTEL endpoint does not crash services at startup.

## Individual Comments

### Comment 1
<location path="crates/telemetry/Cargo.toml" line_range="10" />
<code_context>
+tracing-subscriber = { features = [ "env-filter", "json" ], workspace = true }
+
+[package]
+edition = "2024"
+name = "telemetry"
+version = "0.1.0"
</code_context>
<issue_to_address>
**issue (bug_risk):** The `edition = "2024"` setting will not compile on current stable Rust toolchains.

Rust 2021 is the latest stable edition; `2024` is not yet supported and will cause Cargo to error on all stable toolchains. Unless you’re targeting a specific nightly with 2024 support, this should be set to `"2021"` to avoid build failures for consumers.
</issue_to_address>

### Comment 2
<location path="crates/telemetry/src/lib.rs" line_range="18-25" />
<code_context>
+            .with_endpoint(endpoint)
+            .build()
+            .expect("Failed to build OTEL exporter");
+        let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
+            .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
+            .build();
</code_context>
<issue_to_address>
**suggestion:** The tracer provider is not configured with a resource, so spans may lack standard service metadata.

Consider building the `TracerProvider` with a `Resource` that sets `service.name` (and any other relevant attributes) from `service_name`, so traces from different binaries are distinguishable in backends. For example:

```rust
let resource = opentelemetry_sdk::Resource::new(vec![
    opentelemetry::KeyValue::new("service.name", service_name),
]);

let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
    .with_resource(resource)
    .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
    .build();
```

```suggestion
        let exporter = opentelemetry_otlp::SpanExporter::builder()
            .with_tonic()
            .with_endpoint(endpoint)
            .build()
            .expect("Failed to build OTEL exporter");

        let resource = opentelemetry_sdk::Resource::new(vec![
            opentelemetry::KeyValue::new("service.name", service_name),
        ]);

        let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
            .with_resource(resource)
            .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
            .build();
```
</issue_to_address>

### Comment 3
<location path="docs/dev/engine.md" line_range="15" />
<code_context>
+
+## Engine Trait
+
+New engines implement two methods:
+
+```rust
</code_context>
<issue_to_address>
**issue (typo):** The text says "two methods" but the trait example shows three methods.

This wording is inconsistent with the `Engine` trait example, which defines three methods (`new`, `get_workflow_results`, `get_task_logs`). Please update the sentence to match the actual number of methods or refer to “the following methods.”

```suggestion
New engines implement the following methods:
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread crates/telemetry/Cargo.toml
Comment thread crates/telemetry/src/lib.rs Outdated
Comment thread docs/dev/engine.md Outdated
@JaeAeich JaeAeich merged commit adbbe22 into dev Mar 9, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant