Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 7 additions & 6 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,7 @@ High-performance endpoints for monitoring application health and system metrics.
- `/api/v1/profiling/health`: System health status.
- `/api/status`: Unified health, metrics, and active recovery tasks.

### 🔭 OpenTelemetry Tracing (Issue #251)
### 🔭 OpenTelemetry Tracing (Issue #365)
Production-grade distributed tracing with OTLP exporter and Jaeger integration.
- **Full instrumentation**: HTTP handlers, database queries, Redis operations, background jobs
- **Semantic conventions**: W3C trace context, OpenTelemetry semantic conventions
Expand Down Expand Up @@ -472,8 +472,8 @@ All workers are designed to be production-ready with comprehensive error handlin

2. **Run the backend** with tracing enabled:
```bash
export OTLP_ENDPOINT=http://localhost:4317
export ENV=dev
export APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces
export APP_ENV=development
cargo run -p backend
```

Expand Down Expand Up @@ -569,8 +569,8 @@ otel.kind = "internal"

| Variable | Default | Description |
|---|---|---|
| `OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint |
| `ENV` | `dev` | Environment (dev, staging, production) |
| `APP_OBSERVABILITY__TRACING_ENDPOINT` | `http://localhost:4318/v1/traces` | OTLP HTTP traces endpoint |
| `APP_ENV` | `development` | Environment (development, staging, production) |
| `RUST_LOG` | `info,crucible=debug` | Log level filter |

#### Sampling Strategies
Expand Down Expand Up @@ -664,7 +664,7 @@ Crucible uses a typed contract system for all API endpoints to ensure consistenc
Production-grade distributed tracing with OTLP exporter and Jaeger integration.

1. **Start Jaeger**: `docker-compose -f docker-compose-jaeger.yml up -d`
2. **Run Backend**: `export OTLP_ENDPOINT=http://localhost:4317; cargo run -p backend`
2. **Run Backend**: `export APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces; cargo run -p backend`
3. **View Traces**: Open `http://localhost:16686`

Spans from every `#[tracing::instrument]`-annotated function are exported with **< 1% latency overhead**.
Expand Down Expand Up @@ -888,6 +888,7 @@ If `APP_ENV` is omitted, the application safely defaults to `development` and em
| `APP_REDIS__URL` | String | None | Yes | Redis connection string. **(SENSITIVE)** |
| `APP_REDIS__JOB_QUEUE_URL` | String | Falls back to URL | No | Separate Redis instance for the job queue. **(SENSITIVE)** |
| `APP_OBSERVABILITY__LOG_LEVEL` | String | (from TOML) | No | Logging verbosity (e.g., `info`, `warn`, `debug`). |
| `APP_OBSERVABILITY__TRACING_ENDPOINT` | String | `http://localhost:4318/v1/traces` | No | OTLP HTTP traces endpoint for Jaeger or an OpenTelemetry Collector. |

### How to Add a New Configuration Field

Expand Down
24 changes: 12 additions & 12 deletions backend/TRACING_IMPLEMENTATION_SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# OpenTelemetry Tracing Implementation Summary - Issue #251
# OpenTelemetry Tracing Implementation Summary - Issue #365

**Implementation Date:** 2026-04-29
**Status:** ✅ COMPLETE
Expand All @@ -23,7 +23,7 @@ Successfully implemented production-grade OpenTelemetry tracing across the entir
**Status:** ✅ Already implemented (enhanced)

The `TracingService` provides:
- OTLP/gRPC exporter with configurable endpoint
- OTLP/HTTP exporter with configurable endpoint
- Environment-based sampling strategies (100% dev, 10% staging, 1% prod)
- Semantic convention-compliant span factories
- Resource detection (service name, version, environment)
Expand Down Expand Up @@ -220,8 +220,8 @@ Based on benchmarks and sampling strategies:

| Variable | Default | Description |
|---|---|---|
| `OTLP_ENDPOINT` | `http://localhost:4317` | OTLP gRPC endpoint |
| `ENV` | `dev` | Environment (dev, staging, production) |
| `APP_OBSERVABILITY__TRACING_ENDPOINT` | `http://localhost:4318/v1/traces` | OTLP HTTP traces endpoint |
| `APP_ENV` | `development` | Environment (development, staging, production) |
| `RUST_LOG` | `info,crucible=debug` | Log level filter |

#### Sampling Strategies
Expand Down Expand Up @@ -266,8 +266,8 @@ Created `backend/jaeger-sampling.json` with:
docker-compose -f docker-compose-jaeger.yml up -d

# Run backend with tracing
export OTLP_ENDPOINT=http://localhost:4317
export ENV=dev
export APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces
export APP_ENV=development
cargo run -p backend

# View traces
Expand Down Expand Up @@ -416,24 +416,24 @@ Created `backend/src/services/TRACING_RECON.md` with:
#### Development

- [x] Start Jaeger: `docker-compose -f docker-compose-jaeger.yml up -d`
- [x] Set `OTLP_ENDPOINT=http://localhost:4317`
- [x] Set `ENV=dev`
- [x] Set `APP_OBSERVABILITY__TRACING_ENDPOINT=http://localhost:4318/v1/traces`
- [x] Set `APP_ENV=development`
- [x] Run backend: `cargo run -p backend`
- [x] View traces: `http://localhost:16686`

#### Staging

- [ ] Deploy Jaeger Collector with persistent storage
- [ ] Set `OTLP_ENDPOINT=http://jaeger-collector:4317`
- [ ] Set `ENV=staging` (10% sampling)
- [ ] Set `APP_OBSERVABILITY__TRACING_ENDPOINT=http://jaeger-collector:4318/v1/traces`
- [ ] Set `APP_ENV=staging` (10% sampling)
- [ ] Monitor span drop rates
- [ ] Validate trace quality

#### Production

- [ ] Deploy Jaeger with Elasticsearch backend
- [ ] Set `OTLP_ENDPOINT=http://jaeger-collector:4317`
- [ ] Set `ENV=production` (1% sampling)
- [ ] Set `APP_OBSERVABILITY__TRACING_ENDPOINT=http://jaeger-collector:4318/v1/traces`
- [ ] Set `APP_ENV=production` (1% sampling)
- [ ] Set up alerts for high error rates
- [ ] Monitor collector metrics
- [ ] Validate performance impact
Expand Down
30 changes: 27 additions & 3 deletions backend/src/api/handlers/profiling.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ use crate::services::{
error_recovery::ErrorManager, log_aggregator::LogAggregator, sys_metrics::MetricsExporter,
tracing::TracingService,
};
use crate::AppError;
use axum::{extract::State, response::IntoResponse, Json};
use chrono::{DateTime, Utc};
use redis::Client as RedisClient;
use serde::{Deserialize, Serialize};
use sqlx::PgPool;
use std::sync::Arc;
use tracing::{info, info_span, instrument};
use utoipa::ToSchema;
Expand Down Expand Up @@ -134,16 +134,40 @@ pub async fn get_health(State(state): State<Arc<AppState>>) -> Result<impl IntoR
};
drop(_db_enter);

let redis_span = TracingService::redis_command_span("PING", None);
let _redis_enter = redis_span.enter();
let redis_healthy = match state.redis.get_multiplexed_async_connection().await {
Ok(mut conn) => redis::cmd("PING")
.query_async::<_, String>(&mut conn)
.await
.map(|pong| pong == "PONG")
.unwrap_or_else(|e| {
TracingService::record_error(&redis_span, &e.to_string(), "redis_ping");
false
}),
Err(e) => {
TracingService::record_error(&redis_span, &e.to_string(), "redis_connection");
false
}
};
drop(_redis_enter);

let response = HealthResponse {
status: if db_healthy { "healthy" } else { "degraded" }.to_string(),
status: if db_healthy && redis_healthy {
"healthy"
} else {
"degraded"
}
.to_string(),
version: env!("CARGO_PKG_VERSION").to_string(),
timestamp: Utc::now(),
database_connected: db_healthy,
redis_connected: true,
redis_connected: redis_healthy,
};

info!(
db_connected = db_healthy,
redis_connected = redis_healthy,
version = env!("CARGO_PKG_VERSION"),
"Health check completed"
);
Expand Down
25 changes: 10 additions & 15 deletions backend/src/api/middleware/logging.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,8 @@
use crate::api::handlers::profiling::AppState;
use axum::{
body::Body,
extract::State,
http::{Request, Response},
middleware::Next,
response::IntoResponse,
};
use crate::services::tracing::TracingService;
use axum::{body::Body, extract::State, http::Request, middleware::Next, response::IntoResponse};
use std::{sync::Arc, time::Instant};
use tracing::{info_span, Instrument};
use tracing::Instrument;

/// Middleware to log HTTP requests and responses.
///
Expand All @@ -28,13 +23,8 @@ pub async fn logging_middleware(
let uri = request.uri().clone();
let version = request.version();

// Create a tracing span for this request
let span = info_span!(
"http_request",
%method,
%uri,
?version,
);
let path = uri.path().to_string();
let span = TracingService::http_request_span(method.as_str(), &path, None);

async move {
// Log the incoming request
Expand All @@ -44,11 +34,16 @@ pub async fn logging_middleware(

let latency = start_time.elapsed();
let status = response.status();
span.record("http.status_code", status.as_u16());
if status.is_server_error() {
TracingService::record_error(&span, status.as_str(), "http_server_error");
}

// Log the response
tracing::info!(
latency_ms = latency.as_millis(),
status = status.as_u16(),
?version,
"Finished processing request"
);

Expand Down
1 change: 1 addition & 0 deletions backend/src/config/defaults/default.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,5 @@ max_retries = 3

[observability]
log_level = "info"
tracing_endpoint = "http://localhost:4318/v1/traces"
enable_metrics = false
38 changes: 6 additions & 32 deletions backend/src/config/observability.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
use serde::{Deserialize, Serialize};
use std::str::FromStr;
use tracing::Level;
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};

/// Observability configuration (logs, tracing, metrics).
#[derive(Clone, Debug, Deserialize, Serialize)]
Expand All @@ -25,36 +24,11 @@ impl ObservabilityConfig {
Level::from_str(&self.log_level).unwrap_or(Level::INFO)
}

/// Initializes the tracing subscriber globally based on the environment.
/// Development uses a pretty formatter, while Staging/Production use JSON.
pub fn init_tracing(&self, env: crate::config::Environment) {
let filter =
EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new(&self.log_level));

match env {
crate::config::Environment::Development => {
let fmt_layer = tracing_subscriber::fmt::layer()
.pretty()
.with_thread_ids(true)
.with_target(true);

// try_init silently returns an error if a subscriber is already set, preventing panics in tests
let _ = tracing_subscriber::registry()
.with(filter)
.with(fmt_layer)
.try_init();
}
crate::config::Environment::Staging | crate::config::Environment::Production => {
let fmt_layer = tracing_subscriber::fmt::layer()
.json()
.with_span_list(true)
.with_current_span(true);

let _ = tracing_subscriber::registry()
.with(filter)
.with(fmt_layer)
.try_init();
}
}
/// Returns true when logs should be emitted as JSON.
pub fn json_logs(&self, env: crate::config::Environment) -> bool {
matches!(
env,
crate::config::Environment::Staging | crate::config::Environment::Production
)
}
}
29 changes: 10 additions & 19 deletions backend/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,7 @@ async fn main() -> Result<(), anyhow::Error> {
let env = Environment::from_env();
let config = AppConfig::load(env).expect("Failed to load configuration");

// Initialize observability using the new config system
config.observability.init_tracing(env);

// Initialize OpenTelemetry tracing FIRST - before any other services
// Initialize OpenTelemetry tracing before other services so startup work is captured.
let tracing_config = TracingConfig::new(
"crucible-backend".to_string(),
env!("CARGO_PKG_VERSION").to_string(),
Expand All @@ -58,10 +55,14 @@ async fn main() -> Result<(), anyhow::Error> {
.observability
.tracing_endpoint
.clone()
.unwrap_or_else(|| "http://localhost:4317".to_string()),
.unwrap_or_else(|| "http://localhost:4318/v1/traces".to_string()),
);

TracingService::init(tracing_config)?;
let _tracing_guard = TracingService::init_with_filter(
tracing_config,
Some(&config.observability.log_level),
config.observability.json_logs(env),
)?;

let span = info_span!("app.startup");
let _enter = span.enter();
Expand Down Expand Up @@ -115,6 +116,8 @@ async fn main() -> Result<(), anyhow::Error> {
metrics_exporter: metrics_exporter.clone(),
error_manager: error_manager.clone(),
config_manager: config_manager.clone(),
log_aggregator: log_aggregator.clone(),
redis: redis_client.clone(),
});

// Create dashboard state
Expand Down Expand Up @@ -278,19 +281,7 @@ async fn main() -> Result<(), anyhow::Error> {
Err(_) => tracing::warn!("Timeout waiting for in-flight requests to complete"),
}

// Flush tracing and logging
tracing::info!("Flushing tracing and logging subscribers");
// In practice, we'd use a tracing subscriber guard to flush
// For now, we note that tracing is flushed when the subscriber is dropped
// which happens naturally at the end of the program

// Close database connection pool
tracing::info!("Closing database connection pool");
drop(state.db); // This closes the pool

// Close Redis connection
tracing::info!("Closing Redis connection");
drop(state.redis); // This closes the connection manager
tracing::info!("Flushing tracing provider and releasing service clients");

tracing::info!("Graceful shutdown completed successfully");

Expand Down
21 changes: 17 additions & 4 deletions backend/src/services/tracing.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,25 @@

use std::time::Duration;
use tracing::{info_span, warn};
use tracing_subscriber::layer::SubscriberExt;
use tracing_subscriber::{EnvFilter, Registry};
use tracing_subscriber::{layer::SubscriberExt, util::SubscriberInitExt, EnvFilter};

/// Central tracing service for initialization and span creation
pub struct TracingService;

/// Keeps the OpenTelemetry provider alive and flushes spans during shutdown.
#[derive(Debug)]
pub struct TracingGuard {
tracer_provider: SdkTracerProvider,
}

impl Drop for TracingGuard {
fn drop(&mut self) {
if let Err(error) = self.tracer_provider.shutdown() {
eprintln!("failed to shut down OpenTelemetry tracer provider: {error:?}");
}
}
}

/// Configuration for the tracing service
#[derive(Clone, Debug)]
pub struct TracingConfig {
Expand All @@ -44,7 +57,7 @@ pub struct TracingConfig {
impl Default for TracingConfig {
fn default() -> Self {
Self {
otlp_endpoint: "http://localhost:4317".to_string(),
otlp_endpoint: "http://localhost:4318/v1/traces".to_string(),
service_name: "crucible-backend".to_string(),
service_version: env!("CARGO_PKG_VERSION").to_string(),
environment: std::env::var("ENV").unwrap_or("dev".to_string()),
Expand Down Expand Up @@ -107,7 +120,7 @@ impl TracingService {
tracing::info!("Service: {}", config.service_name);
tracing::info!("Environment: {}", config.environment);

Ok(())
Ok(TracingGuard { tracer_provider })
}

/// Create an HTTP request span with semantic conventions
Expand Down