diff --git a/backend/README.md b/backend/README.md index ca10612..c409478 100644 --- a/backend/README.md +++ b/backend/README.md @@ -283,7 +283,7 @@ High-performance endpoints for monitoring application health and system metrics. - `/api/v1/profiling/health`: System health status. - `/api/status`: Unified health, metrics, and active recovery tasks. -### 🔭 OpenTelemetry Tracing (Issue #251) +### 🔭 OpenTelemetry Tracing (Issue #365) Production-grade distributed tracing with OTLP exporter and Jaeger integration. - **Full instrumentation**: HTTP handlers, database queries, Redis operations, background jobs - **Semantic conventions**: W3C trace context, OpenTelemetry semantic conventions @@ -472,8 +472,8 @@ All workers are designed to be production-ready with comprehensive error handlin 2. **Run the backend** with tracing enabled: ```bash - export OTLP_ENDPOINT=http://localhost:4317 - export ENV=dev + export APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces + export APP_ENV=development cargo run -p backend ``` @@ -569,8 +569,8 @@ otel.kind = "internal" | Variable | Default | Description | |---|---|---| -| `OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint | -| `ENV` | `dev` | Environment (dev, staging, production) | +| `APP_OBSERVABILITY__TRACING_ENDPOINT` | `http://localhost:4318/v1/traces` | OTLP HTTP traces endpoint | +| `APP_ENV` | `development` | Environment (development, staging, production) | | `RUST_LOG` | `info,crucible=debug` | Log level filter | #### Sampling Strategies @@ -664,7 +664,7 @@ Crucible uses a typed contract system for all API endpoints to ensure consistenc Production-grade distributed tracing with OTLP exporter and Jaeger integration. 1. **Start Jaeger**: `docker-compose -f docker-compose-jaeger.yml up -d` -2. **Run Backend**: `export OTLP_ENDPOINT=http://localhost:4317; cargo run -p backend` +2. **Run Backend**: `export APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces; cargo run -p backend` 3. **View Traces**: Open `http://localhost:16686` Spans from every `#[tracing::instrument]`-annotated function are exported with **< 1% latency overhead**. @@ -888,6 +888,7 @@ If `APP_ENV` is omitted, the application safely defaults to `development` and em | `APP_REDIS__URL` | String | None | Yes | Redis connection string. **(SENSITIVE)** | | `APP_REDIS__JOB_QUEUE_URL` | String | Falls back to URL | No | Separate Redis instance for the job queue. **(SENSITIVE)** | | `APP_OBSERVABILITY__LOG_LEVEL` | String | (from TOML) | No | Logging verbosity (e.g., `info`, `warn`, `debug`). | +| `APP_OBSERVABILITY__TRACING_ENDPOINT` | String | `http://localhost:4318/v1/traces` | No | OTLP HTTP traces endpoint for Jaeger or an OpenTelemetry Collector. | ### How to Add a New Configuration Field diff --git a/backend/TRACING_IMPLEMENTATION_SUMMARY.md b/backend/TRACING_IMPLEMENTATION_SUMMARY.md index 6d5481d..ba24ca5 100644 --- a/backend/TRACING_IMPLEMENTATION_SUMMARY.md +++ b/backend/TRACING_IMPLEMENTATION_SUMMARY.md @@ -1,4 +1,4 @@ -# OpenTelemetry Tracing Implementation Summary - Issue #251 +# OpenTelemetry Tracing Implementation Summary - Issue #365 **Implementation Date:** 2026-04-29 **Status:** ✅ COMPLETE @@ -23,7 +23,7 @@ Successfully implemented production-grade OpenTelemetry tracing across the entir **Status:** ✅ Already implemented (enhanced) The `TracingService` provides: -- OTLP/gRPC exporter with configurable endpoint +- OTLP/HTTP exporter with configurable endpoint - Environment-based sampling strategies (100% dev, 10% staging, 1% prod) - Semantic convention-compliant span factories - Resource detection (service name, version, environment) @@ -220,8 +220,8 @@ Based on benchmarks and sampling strategies: | Variable | Default | Description | |---|---|---| -| `OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint | -| `ENV` | `dev` | Environment (dev, staging, production) | +| `APP_OBSERVABILITY__TRACING_ENDPOINT` | `http://localhost:4318/v1/traces` | OTLP HTTP traces endpoint | +| `APP_ENV` | `development` | Environment (development, staging, production) | | `RUST_LOG` | `info,crucible=debug` | Log level filter | #### Sampling Strategies @@ -266,8 +266,8 @@ Created `backend/jaeger-sampling.json` with: docker-compose -f docker-compose-jaeger.yml up -d # Run backend with tracing -export OTLP_ENDPOINT=http://localhost:4317 -export ENV=dev +export APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces +export APP_ENV=development cargo run -p backend # View traces @@ -416,24 +416,24 @@ Created `backend/src/services/TRACING_RECON.md` with: #### Development - [x] Start Jaeger: `docker-compose -f docker-compose-jaeger.yml up -d` -- [x] Set `OTLP_ENDPOINT=http://localhost:4317` -- [x] Set `ENV=dev` +- [x] Set `APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces` +- [x] Set `APP_ENV=development` - [x] Run backend: `cargo run -p backend` - [x] View traces: `http://localhost:16686` #### Staging - [ ] Deploy Jaeger Collector with persistent storage -- [ ] Set `OTLP_ENDPOINT=http://jaeger-collector:4317` -- [ ] Set `ENV=staging` (10% sampling) +- [ ] Set `APP_OBSERVABILITY__TRACING_ENDPOINT=http://jaeger-collector:4318/v1/traces` +- [ ] Set `APP_ENV=staging` (10% sampling) - [ ] Monitor span drop rates - [ ] Validate trace quality #### Production - [ ] Deploy Jaeger with Elasticsearch backend -- [ ] Set `OTLP_ENDPOINT=http://jaeger-collector:4317` -- [ ] Set `ENV=production` (1% sampling) +- [ ] Set `APP_OBSERVABILITY__TRACING_ENDPOINT=http://jaeger-collector:4318/v1/traces` +- [ ] Set `APP_ENV=production` (1% sampling) - [ ] Set up alerts for high error rates - [ ] Monitor collector metrics - [ ] Validate performance impact diff --git a/backend/src/api/handlers/profiling.rs b/backend/src/api/handlers/profiling.rs index 6fac342..d206494 100644 --- a/backend/src/api/handlers/profiling.rs +++ b/backend/src/api/handlers/profiling.rs @@ -7,11 +7,11 @@ use crate::services::{ error_recovery::ErrorManager, log_aggregator::LogAggregator, sys_metrics::MetricsExporter, tracing::TracingService, }; +use crate::AppError; use axum::{extract::State, response::IntoResponse, Json}; use chrono::{DateTime, Utc}; use redis::Client as RedisClient; use serde::{Deserialize, Serialize}; -use sqlx::PgPool; use std::sync::Arc; use tracing::{info, info_span, instrument}; use utoipa::ToSchema; @@ -134,16 +134,40 @@ pub async fn get_health(State(state): State>) -> Result redis::cmd("PING") + .query_async::<_, String>(&mut conn) + .await + .map(|pong| pong == "PONG") + .unwrap_or_else(|e| { + TracingService::record_error(&redis_span, &e.to_string(), "redis_ping"); + false + }), + Err(e) => { + TracingService::record_error(&redis_span, &e.to_string(), "redis_connection"); + false + } + }; + drop(_redis_enter); + let response = HealthResponse { - status: if db_healthy { "healthy" } else { "degraded" }.to_string(), + status: if db_healthy && redis_healthy { + "healthy" + } else { + "degraded" + } + .to_string(), version: env!("CARGO_PKG_VERSION").to_string(), timestamp: Utc::now(), database_connected: db_healthy, - redis_connected: true, + redis_connected: redis_healthy, }; info!( db_connected = db_healthy, + redis_connected = redis_healthy, version = env!("CARGO_PKG_VERSION"), "Health check completed" ); diff --git a/backend/src/api/middleware/logging.rs b/backend/src/api/middleware/logging.rs index 5bf7bc1..f47f37c 100644 --- a/backend/src/api/middleware/logging.rs +++ b/backend/src/api/middleware/logging.rs @@ -1,13 +1,8 @@ use crate::api::handlers::profiling::AppState; -use axum::{ - body::Body, - extract::State, - http::{Request, Response}, - middleware::Next, - response::IntoResponse, -}; +use crate::services::tracing::TracingService; +use axum::{body::Body, extract::State, http::Request, middleware::Next, response::IntoResponse}; use std::{sync::Arc, time::Instant}; -use tracing::{info_span, Instrument}; +use tracing::Instrument; /// Middleware to log HTTP requests and responses. /// @@ -28,13 +23,8 @@ pub async fn logging_middleware( let uri = request.uri().clone(); let version = request.version(); - // Create a tracing span for this request - let span = info_span!( - "http_request", - %method, - %uri, - ?version, - ); + let path = uri.path().to_string(); + let span = TracingService::http_request_span(method.as_str(), &path, None); async move { // Log the incoming request @@ -44,11 +34,16 @@ pub async fn logging_middleware( let latency = start_time.elapsed(); let status = response.status(); + span.record("http.status_code", status.as_u16()); + if status.is_server_error() { + TracingService::record_error(&span, status.as_str(), "http_server_error"); + } // Log the response tracing::info!( latency_ms = latency.as_millis(), status = status.as_u16(), + ?version, "Finished processing request" ); diff --git a/backend/src/config/defaults/default.toml b/backend/src/config/defaults/default.toml index dd87905..af494e1 100644 --- a/backend/src/config/defaults/default.toml +++ b/backend/src/config/defaults/default.toml @@ -17,4 +17,5 @@ max_retries = 3 [observability] log_level = "info" +tracing_endpoint = "http://localhost:4318/v1/traces" enable_metrics = false diff --git a/backend/src/config/observability.rs b/backend/src/config/observability.rs index f42970e..f1ef0b6 100644 --- a/backend/src/config/observability.rs +++ b/backend/src/config/observability.rs @@ -6,7 +6,6 @@ use serde::{Deserialize, Serialize}; use std::str::FromStr; use tracing::Level; -use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter}; /// Observability configuration (logs, tracing, metrics). #[derive(Clone, Debug, Deserialize, Serialize)] @@ -25,36 +24,11 @@ impl ObservabilityConfig { Level::from_str(&self.log_level).unwrap_or(Level::INFO) } - /// Initializes the tracing subscriber globally based on the environment. - /// Development uses a pretty formatter, while Staging/Production use JSON. - pub fn init_tracing(&self, env: crate::config::Environment) { - let filter = - EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new(&self.log_level)); - - match env { - crate::config::Environment::Development => { - let fmt_layer = tracing_subscriber::fmt::layer() - .pretty() - .with_thread_ids(true) - .with_target(true); - - // try_init silently returns an error if a subscriber is already set, preventing panics in tests - let _ = tracing_subscriber::registry() - .with(filter) - .with(fmt_layer) - .try_init(); - } - crate::config::Environment::Staging | crate::config::Environment::Production => { - let fmt_layer = tracing_subscriber::fmt::layer() - .json() - .with_span_list(true) - .with_current_span(true); - - let _ = tracing_subscriber::registry() - .with(filter) - .with(fmt_layer) - .try_init(); - } - } + /// Returns true when logs should be emitted as JSON. + pub fn json_logs(&self, env: crate::config::Environment) -> bool { + matches!( + env, + crate::config::Environment::Staging | crate::config::Environment::Production + ) } } diff --git a/backend/src/main.rs b/backend/src/main.rs index ae67fd9..390f32f 100644 --- a/backend/src/main.rs +++ b/backend/src/main.rs @@ -44,10 +44,7 @@ async fn main() -> Result<(), anyhow::Error> { let env = Environment::from_env(); let config = AppConfig::load(env).expect("Failed to load configuration"); - // Initialize observability using the new config system - config.observability.init_tracing(env); - - // Initialize OpenTelemetry tracing FIRST - before any other services + // Initialize OpenTelemetry tracing before other services so startup work is captured. let tracing_config = TracingConfig::new( "crucible-backend".to_string(), env!("CARGO_PKG_VERSION").to_string(), @@ -58,10 +55,14 @@ async fn main() -> Result<(), anyhow::Error> { .observability .tracing_endpoint .clone() - .unwrap_or_else(|| "http://localhost:4317".to_string()), + .unwrap_or_else(|| "http://localhost:4318/v1/traces".to_string()), ); - TracingService::init(tracing_config)?; + let _tracing_guard = TracingService::init_with_filter( + tracing_config, + Some(&config.observability.log_level), + config.observability.json_logs(env), + )?; let span = info_span!("app.startup"); let _enter = span.enter(); @@ -115,6 +116,8 @@ async fn main() -> Result<(), anyhow::Error> { metrics_exporter: metrics_exporter.clone(), error_manager: error_manager.clone(), config_manager: config_manager.clone(), + log_aggregator: log_aggregator.clone(), + redis: redis_client.clone(), }); // Create dashboard state @@ -278,19 +281,7 @@ async fn main() -> Result<(), anyhow::Error> { Err(_) => tracing::warn!("Timeout waiting for in-flight requests to complete"), } - // Flush tracing and logging - tracing::info!("Flushing tracing and logging subscribers"); - // In practice, we'd use a tracing subscriber guard to flush - // For now, we note that tracing is flushed when the subscriber is dropped - // which happens naturally at the end of the program - - // Close database connection pool - tracing::info!("Closing database connection pool"); - drop(state.db); // This closes the pool - - // Close Redis connection - tracing::info!("Closing Redis connection"); - drop(state.redis); // This closes the connection manager + tracing::info!("Flushing tracing provider and releasing service clients"); tracing::info!("Graceful shutdown completed successfully"); diff --git a/backend/src/services/tracing.rs b/backend/src/services/tracing.rs index ccb3ec9..9f25611 100644 --- a/backend/src/services/tracing.rs +++ b/backend/src/services/tracing.rs @@ -14,12 +14,25 @@ use std::time::Duration; use tracing::{info_span, warn}; -use tracing_subscriber::layer::SubscriberExt; -use tracing_subscriber::{EnvFilter, Registry}; +use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter}; /// Central tracing service for initialization and span creation pub struct TracingService; +/// Keeps the OpenTelemetry provider alive and flushes spans during shutdown. +#[derive(Debug)] +pub struct TracingGuard { + tracer_provider: SdkTracerProvider, +} + +impl Drop for TracingGuard { + fn drop(&mut self) { + if let Err(error) = self.tracer_provider.shutdown() { + eprintln!("failed to shut down OpenTelemetry tracer provider: {error:?}"); + } + } +} + /// Configuration for the tracing service #[derive(Clone, Debug)] pub struct TracingConfig { @@ -44,7 +57,7 @@ pub struct TracingConfig { impl Default for TracingConfig { fn default() -> Self { Self { - otlp_endpoint: "http://localhost:4317".to_string(), + otlp_endpoint: "http://localhost:4318/v1/traces".to_string(), service_name: "crucible-backend".to_string(), service_version: env!("CARGO_PKG_VERSION").to_string(), environment: std::env::var("ENV").unwrap_or("dev".to_string()), @@ -107,7 +120,7 @@ impl TracingService { tracing::info!("Service: {}", config.service_name); tracing::info!("Environment: {}", config.environment); - Ok(()) + Ok(TracingGuard { tracer_provider }) } /// Create an HTTP request span with semantic conventions