feat: centralize logging and otel export by JaeAeich · Pull Request #20 · JaeAeich/metis

JaeAeich · 2026-03-08T06:25:35Z

Description

Fixes #(issue number)

Comments

Summary by Sourcery

Centralize tracing and logging setup across API and engine services, enabling optional OpenTelemetry export and improved workflow run observability.

New Features:

Introduce a shared telemetry crate that configures tracing, supports JSON log output, and optionally exports spans to an OTLP endpoint.
Propagate an API span identifier through NATS run requests so engine workflow spans can be correlated back to the originating API call.

Enhancements:

Refine workflow and cancellation logging in the engine and API with structured spans, additional context fields, and clearer log levels.
Initialize tracing in the engine bootstrap and API startup using the shared telemetry initialization function.
Adjust HTTP tracing on API routes to log responses at info level and failures at error level, and treat certain database write failures as errors instead of warnings.

Build:

Add a new telemetry crate to the workspace and wire shared OpenTelemetry and tracing dependencies through workspace configuration.

Documentation:

Document engine internals and crate structure for developers, including how to add custom engines.
Add tracing documentation covering span structure, configuration via environment variables, NATS trace correlation, and OTEL export setup.
Simplify and cross-link engine configuration and architecture docs to point to the new engine internals documentation.

vercel · 2026-03-08T06:25:39Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
metis	Ready	Preview, Comment	Mar 9, 2026 1:00am

sourcery-ai · 2026-03-08T06:25:41Z

Reviewer's Guide

Centralizes tracing/logging configuration via a new telemetry crate, wires OTEL export into API and engine binaries, and enriches workflow/run lifecycle tracing (including NATS correlation) while updating docs to describe the new internals and tracing model.

Sequence diagram for create_run to engine workflow_run with trace correlation

sequenceDiagram
    actor User
    participant Api as metis_api
    participant RunService
    participant Nats as NATS
    participant Engine as Engine_runtime

    User->>Api: HTTP POST /runs
    Api->>RunService: create_run(run_id, user_id, req)
    note over RunService: #[tracing::instrument] creates span create_run

    RunService->>RunService: EngineRequestValidator::new(engine_config)
    RunService->>RunService: validate(req)

    RunService->>RunService: span_id = Span::current().id()
    RunService->>RunService: trace_id = format(span_id.into_u64())

    RunService->>Nats: publish_run(topic, RunRequestMessage{run_id, request, user_id, trace_id})
    RunService->>RunService: info!("Run created and dispatched")
    RunService-->>Api: Ok(())

    Nats-->>Engine: RunRequestMessage{run_id, request, user_id, trace_id}

    Engine->>Engine: EngineRuntime::handle_run_message(...)
    Engine->>Engine: run(run_id, user_id, request, trace_id)
    note over Engine: info_span workflow_run created with api_trace_id field

    Engine->>Engine: run_inner(run_id, user_id, request)
    Engine->>Engine: state=initializing → building → executing → running → parsing → updating_db → cleanup
    Engine->>Engine: duration_ms recorded
    Engine->>Engine: on error: record error field and log "Workflow execution failed"

    Engine-->>Nats: completion/heartbeat events (unchanged)
    Engine-->>Api: DB state updated, visible via existing endpoints

Sequence diagram for run cancellation tracing

sequenceDiagram
    actor User
    participant Api as metis_api
    participant RunService
    participant Nats as NATS
    participant Engine as Engine_runtime

    User->>Api: HTTP POST /runs/{id}/cancel
    Api->>RunService: request_cancel(id)
    note over RunService: #[tracing::instrument] span request_cancel with run_id

    RunService->>RunService: run = repo.find_by_id(id)
    alt run.state == Queued
        RunService->>RunService: update_state_if(Queued -> Canceled)
        RunService->>RunService: info!("Queued run canceled immediately")
        RunService-->>Api: Ok(None)
    else run.state == Running
        RunService->>RunService: engine_id = run.engine_id
        RunService->>Nats: publish_cancel(engine_id, run_id)
        RunService->>RunService: info!("Cancel signal dispatched to engine")
        RunService-->>Api: Ok(Some(engine_id))

        Nats-->>Engine: cancel message
        Engine->>Engine: EngineRuntime::cancel(run_id)
        note over Engine: info_span workflow_cancel
        Engine->>Engine: mark run_id as cancelled
        Engine->>Engine: pid = pid_store.get(run_id)
        alt pid exists
            Engine->>Engine: ProcessExecutor::cancel(pid)
            Engine->>Engine: pid_store.remove(run_id)
            Engine->>Engine: info!("Workflow cancellation completed")
        else pid missing
            Engine->>Engine: warn!("No PID found for run_id...")
        end
    else
        RunService-->>Api: Ok(None)  %% no-op for terminal states
    end

Class diagram for telemetry initialization and engine runtime changes

classDiagram
    class Telemetry {
        +init_tracing(service_name: &'static str) void
        +tracing
        +tracing_subscriber
    }

    class ApiTracingModule {
        +init_tracing() void
    }

    class ApiMain {
        +main() Result
    }

    class EngineServer {
        +bootstrap(engine: Engine, name: &'static str) EngineResult
    }

    class EngineRuntime {
        +config() &Arc_FullEngineConfig
        +run(run_id: Uuid, user_id: String, req: ValidatedRunRequest, trace_id: Option_String) Result_RunSummary_EngineError
        +run_inner(run_id: Uuid, user_id: String, req: ValidatedRunRequest) Result_RunSummary_EngineError
        +cancel(run_id: &str) Result_void_EngineError
    }

    class RunService {
        +create_run(run_id: &str, user_id: &str, req: CreateRunRequest) ServiceResult_void
        +request_cancel(id: &RunId) ServiceResult_Option_String
    }

    class RunRequestMessage {
        +run_id: String
        +request: ValidatedRunRequest
        +user_id: String
        +trace_id: Option_String
    }

    class Engine {
        <<trait>>
        +new() Self
        +get_workflow_results() Result_HashMap_Category_Files
        +get_task_logs() Result_Vec_TaskLog
    }

    Telemetry <|.. ApiTracingModule : uses
    ApiTracingModule <|.. ApiMain : calls
    Telemetry <|.. EngineServer : uses

    EngineServer o--> EngineRuntime
    EngineRuntime ..> RunRequestMessage
    RunService ..> RunRequestMessage

    Engine <|.. GenericEngine

    class GenericEngine {
        +new() Self
    }

File-Level Changes

Change	Details	Files
Introduce shared telemetry crate and wire it into API and engine binaries for centralized tracing and optional OTEL export.	Add new telemetry crate that configures tracing subscribers, JSON vs text format via LOG_FORMAT, and optional OTLP exporter via OTEL_EXPORTER_OTLP_ENDPOINT Expose init_tracing(service_name) plus re-exports of tracing and tracing_subscriber from the telemetry crate Update workspace Cargo.toml to include telemetry crate and OTEL-related dependencies Use init_tracing in metis-api main and in engine bootstrap, and plumb static service name into generic engine main Drop direct tracing-subscriber dependency from API and engine crates in favor of telemetry	`Cargo.toml` `crates/telemetry/Cargo.toml` `crates/telemetry/src/lib.rs` `crates/api/Cargo.toml` `crates/api/src/main.rs` `crates/engine/Cargo.toml` `crates/engine/src/lib.rs` `crates/engine/src/server.rs` `crates/engines/generic/src/main.rs` `crates/api/src/tracing.rs`
Enhance structured tracing around run creation, workflow execution, and cancellation, including API↔engine correlation via NATS messages.	Add #[tracing::instrument] to RunService::create_run and request_cancel with key span fields for run/user/workflow metadata Capture current API span ID as hex trace_id and embed it in RunRequestMessage sent over NATS Extend RunRequestMessage with an optional trace_id field shared between API and engine Change EngineRuntime::run signature to accept an optional trace_id, introduce run_inner, and create a workflow_run span that records state, duration, api_trace_id, and error Update internal state transitions in run_inner to record onto the current span instead of a local span handle Wrap EngineRuntime::cancel body in a workflow_cancel span via .instrument Promote several database write failures from warn! to error! for clearer signal	`crates/api/src/services/run.rs` `crates/common/src/models.rs` `crates/engine/src/runtime.rs`
Adjust HTTP tracing configuration for the API and refine documentation to describe engine internals and tracing behavior.	Customize tower_http TraceLayer to log HTTP spans/responses at INFO and failures at ERROR Add new docs page describing tracing setup, span taxonomy, correlation via api_trace_id, and OTEL export configuration Add new docs page for engine internals and crate layout, and point previous architecture/engine-configuration content to it Update engine-configuration and architecture docs to reference the new Engine Internals docs instead of duplicating crate and engine descriptions Extend docs sidebar with Dev section linking to Engine Internals and Tracing pages	`crates/api/src/routes.rs` `docs/dev/tracing.md` `docs/dev/engine.md` `docs/engine-configuration.md` `docs/architecture.md` `docs/.vitepress/config.mjs`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 3 issues, and left some high level feedback:

The telemetry crate is declared with edition = "2024", which is not yet a stable Rust edition; consider aligning it with the rest of the workspace (likely edition = "2021") to avoid toolchain issues.
The engine crate still uses tracing macros (e.g., info!, error!) but its Cargo.toml no longer declares a direct tracing dependency; either re-add tracing as a dependency or import it via telemetry::tracing and adjust the use statements accordingly.
In telemetry::init_tracing, failing OTEL exporter construction currently expects and panics; you might want to downgrade this to a logged warning and fall back to local logging so a misconfigured OTEL endpoint does not crash services at startup.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `telemetry` crate is declared with `edition = "2024"`, which is not yet a stable Rust edition; consider aligning it with the rest of the workspace (likely `edition = "2021"`) to avoid toolchain issues.
- The `engine` crate still uses `tracing` macros (e.g., `info!`, `error!`) but its `Cargo.toml` no longer declares a direct `tracing` dependency; either re-add `tracing` as a dependency or import it via `telemetry::tracing` and adjust the `use` statements accordingly.
- In `telemetry::init_tracing`, failing OTEL exporter construction currently `expect`s and panics; you might want to downgrade this to a logged warning and fall back to local logging so a misconfigured OTEL endpoint does not crash services at startup.

## Individual Comments

### Comment 1
<location path="crates/telemetry/Cargo.toml" line_range="10" />
<code_context>
+tracing-subscriber = { features = [ "env-filter", "json" ], workspace = true }
+
+[package]
+edition = "2024"
+name = "telemetry"
+version = "0.1.0"
</code_context>
<issue_to_address>
**issue (bug_risk):** The `edition = "2024"` setting will not compile on current stable Rust toolchains.

Rust 2021 is the latest stable edition; `2024` is not yet supported and will cause Cargo to error on all stable toolchains. Unless you’re targeting a specific nightly with 2024 support, this should be set to `"2021"` to avoid build failures for consumers.
</issue_to_address>

### Comment 2
<location path="crates/telemetry/src/lib.rs" line_range="18-25" />
<code_context>
+            .with_endpoint(endpoint)
+            .build()
+            .expect("Failed to build OTEL exporter");
+        let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
+            .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
+            .build();
</code_context>
<issue_to_address>
**suggestion:** The tracer provider is not configured with a resource, so spans may lack standard service metadata.

Consider building the `TracerProvider` with a `Resource` that sets `service.name` (and any other relevant attributes) from `service_name`, so traces from different binaries are distinguishable in backends. For example:

```rust
let resource = opentelemetry_sdk::Resource::new(vec![
    opentelemetry::KeyValue::new("service.name", service_name),
]);

let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
    .with_resource(resource)
    .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
    .build();
```

```suggestion
        let exporter = opentelemetry_otlp::SpanExporter::builder()
            .with_tonic()
            .with_endpoint(endpoint)
            .build()
            .expect("Failed to build OTEL exporter");

        let resource = opentelemetry_sdk::Resource::new(vec![
            opentelemetry::KeyValue::new("service.name", service_name),
        ]);

        let tracer_provider = opentelemetry_sdk::trace::TracerProvider::builder()
            .with_resource(resource)
            .with_batch_exporter(exporter, opentelemetry_sdk::runtime::Tokio)
            .build();
```
</issue_to_address>

### Comment 3
<location path="docs/dev/engine.md" line_range="15" />
<code_context>
+
+## Engine Trait
+
+New engines implement two methods:
+
+```rust
</code_context>
<issue_to_address>
**issue (typo):** The text says "two methods" but the trait example shows three methods.

This wording is inconsistent with the `Engine` trait example, which defines three methods (`new`, `get_workflow_results`, `get_task_logs`). Please update the sentence to match the actual number of methods or refer to “the following methods.”

```suggestion
New engines implement the following methods:
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

feat: centralize logging and otel

f536da7

vercel Bot deployed to Preview March 8, 2026 06:25 View deployment

sourcery-ai Bot reviewed Mar 8, 2026

View reviewed changes

Comment thread crates/telemetry/Cargo.toml

Comment thread crates/telemetry/src/lib.rs Outdated

Comment thread docs/dev/engine.md Outdated

chore: do not log health requests

243eb80

vercel Bot deployed to Preview March 9, 2026 00:50 View deployment

chore: reviews and docs

6d3879f

vercel Bot deployed to Preview March 9, 2026 01:00 View deployment

JaeAeich merged commit adbbe22 into dev Mar 9, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: centralize logging and otel export#20

feat: centralize logging and otel export#20
JaeAeich merged 3 commits intodevfrom
log

JaeAeich commented Mar 8, 2026 •

edited by sourcery-ai Bot

Loading

Uh oh!

vercel Bot commented Mar 8, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Mar 8, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JaeAeich commented Mar 8, 2026 • edited by sourcery-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Comments

Summary by Sourcery

Uh oh!

vercel Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sourcery-ai Bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for create_run to engine workflow_run with trace correlation

Sequence diagram for run cancellation tracing

Class diagram for telemetry initialization and engine runtime changes

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JaeAeich commented Mar 8, 2026 •

edited by sourcery-ai Bot

Loading

vercel Bot commented Mar 8, 2026 •

edited

Loading

sourcery-ai Bot commented Mar 8, 2026 •

edited

Loading