docs: add OpenTelemetry tracing integration trade-off analysis#52
docs: add OpenTelemetry tracing integration trade-off analysis#52
Conversation
Analyze the trade-offs of adding distributed tracing support via OpenTelemetry. The project already uses OpenTelemetry for metrics and the tracing crate for structured logging, making integration straightforward. Key findings: - Low implementation effort (~5-7 hours) - Opt-in by default with zero overhead when disabled - Leverages existing tracing infrastructure - Recommended: OTLP export with tracing-opentelemetry bridge Resolves: #51
|
@forkline implement it |
|
❌ Task Failed I encountered an error while trying to complete this task. Error: If this issue persists, please check the task logs or try again. Task ID: 1db17742-167d-4f46-87ef-59b7c593c9c9 |
|
@forkline retry |
Add OTLP exporter support with opt-in configuration: - Add tracing-opentelemetry bridge layer - Add opentelemetry-otlp with grpc-tonic features - Add configuration options for tracing (enabled, endpoint, sampling, service name) - Create otel_tracing module with initialization and cleanup - Integrate tracing into main.rs with proper shutdown handling - Default disabled for zero overhead when not in use Resolves: #51
There was a problem hiding this comment.
Summary
Successfully implemented OpenTelemetry distributed tracing integration for RobotLB as specified in the trade-off analysis (issue #51).
Implementation Details
Core Integration (Phase 1 - Complete):
-
Dependencies Added:
opentelemetry-otlpwithtraceandgrpc-tonicfeaturesopentelemetry-semantic-conventionstracing-opentelemetrybridge layer- Updated
tracing-subscriberwithenv-filterfeature
-
Configuration (src/config.rs:108-127):
ROBOTLB_TRACING_ENABLED(default: false) - Opt-in with zero overheadROBOTLB_OTLP_ENDPOINT(default: http://localhost:4317)ROBOTLB_TRACING_SAMPLE_RATIO(default: 1.0)ROBOTLB_SERVICE_NAME(default: robotlb)
-
Tracing Module (src/otel_tracing.rs):
- OTLP exporter with gRPC/Tonic transport
- Configurable sampling (AlwaysOn or TraceIdRatioBased)
- Resource with service name attribution
- Batch exporter for efficient trace delivery
- Proper shutdown handling via
TracingGuard
-
Integration (src/main.rs:101-102, 170):
- Initializes tracing at startup with configuration
- Properly shuts down tracer provider on exit
- Bridges existing
#[tracing::instrument]annotations automatically
-
Tests:
- Updated config tests to include new tracing fields
- All 27 tests passing
Key Features
✅ Zero overhead when disabled (default behavior)
✅ OTLP standard protocol (works with Jaeger, Tempo, SigNoz, etc.)
✅ Automatic instrumentation of existing #[tracing::instrument] calls
✅ Configurable sampling for production cost control
✅ Proper cleanup on shutdown
Next Steps (Future PRs)
- Add Helm chart values for tracing configuration
- Update README with usage examples
- Create example Grafana Tempo/Jaeger deployment manifests
- Add enhanced instrumentation for Hetzner API calls (Phase 2)
The implementation follows the recommended approach from the trade-off analysis and provides production-ready distributed tracing capabilities for debugging and observability.
Commit: ebf4e85
Branch: docs/opentelemetry-tracing-analysis
Resolves: #51
|
@forkline could you add support to the helm chart? |
- Add tracing section to values.yaml with OTLP endpoint, protocol, sampling, and service name - Update deployment template to inject tracing env vars when enabled - Fix missing metrics.service section in values.yaml
There was a problem hiding this comment.
Added OpenTelemetry tracing support to the Helm chart:
values.yaml - New tracing section:
enabled: default false (zero overhead)otlp.endpoint: OTLP collector endpointotlp.protocol: grpc (default) or httpsamplingRatio: 1.0 defaultserviceName: optional override
deployment.yaml - Injects env vars when tracing enabled:
ROBOTLB_TRACING_ENABLEDROBOTLB_TRACING_OTLP_ENDPOINTROBOTLB_TRACING_OTLP_PROTOCOLROBOTLB_TRACING_SAMPLING_RATIOROBOTLB_TRACING_SERVICE_NAME
Also fixed pre-existing bug: missing metrics.service section in values.yaml.
|
@forkline could you ensure that pre-commit passes and add a helm schema values JSON file to ENFORCE the configuration options |
There was a problem hiding this comment.
Done. Added helm/values.schema.json to enforce configuration options for the Helm chart, and ensured all pre-commit hooks pass (clippy, fmt, yamllint, etc.).
The schema validates:
- Image configuration (repository, pullPolicy, tag)
- Service account and RBAC permissions
- Metrics configuration (ServiceMonitor, PrometheusRules, Grafana dashboards)
- Tracing configuration (enabled, otlp endpoint/protocol, sampling ratio, service name)
- Add #[allow(clippy::struct_excessive_bools)] to OperatorConfig - Add #[must_use] to TracingGuard::empty() - Add # Errors doc to init_tracing - Remove useless LevelFilter::from() conversions - Fix import ordering
139af51 to
c1f7564
Compare
Summary
This PR adds a comprehensive trade-off analysis for integrating OpenTelemetry distributed tracing support into RobotLB, addressing issue #51.
Key Findings
Recommendation: Proceed with OpenTelemetry tracing integration
tracingcrate and OpenTelemetry for metricsCurrent State
The project already has:
tracingcrate for structured logging#[tracing::instrument]annotations on key functionsProposed Integration
Add OTLP (OpenTelemetry Protocol) export with:
tracing-opentelemetrybridge layerBenefits
Costs
Next Steps
Resolves: #51