Skip to content

docs: add OpenTelemetry tracing integration trade-off analysis#52

Merged
pando85 merged 6 commits intomasterfrom
docs/opentelemetry-tracing-analysis
Mar 4, 2026
Merged

docs: add OpenTelemetry tracing integration trade-off analysis#52
pando85 merged 6 commits intomasterfrom
docs/opentelemetry-tracing-analysis

Conversation

@forkline-bot
Copy link

@forkline-bot forkline-bot bot commented Mar 4, 2026

Summary

This PR adds a comprehensive trade-off analysis for integrating OpenTelemetry distributed tracing support into RobotLB, addressing issue #51.

Key Findings

Recommendation: Proceed with OpenTelemetry tracing integration

  • Low implementation effort: ~5-7 hours estimated
  • Leverages existing infrastructure: Already using tracing crate and OpenTelemetry for metrics
  • Opt-in by default: Zero overhead when disabled
  • Cloud-native standard: Aligns with Kubernetes ecosystem best practices

Current State

The project already has:

  • OpenTelemetry SDK 0.31 for metrics
  • tracing crate for structured logging
  • #[tracing::instrument] annotations on key functions

Proposed Integration

Add OTLP (OpenTelemetry Protocol) export with:

  • tracing-opentelemetry bridge layer
  • Configurable OTLP endpoint
  • Sampling ratio support
  • Default disabled (zero overhead)

Benefits

Benefit Impact
End-to-end visibility High
Performance debugging High
Error correlation High
Unified observability Medium

Costs

Cost Impact
Binary size Low (~500KB-1MB)
Runtime overhead Low (1-3% when enabled)
Dependency complexity Medium (3-4 new crates)

Next Steps

  1. Approve integration approach
  2. Implement Phase 1 (Core Integration)
  3. Add Helm chart values for tracing configuration
  4. Update documentation
  5. Create example Grafana Tempo/Jaeger deployment

Resolves: #51

Analyze the trade-offs of adding distributed tracing support via
OpenTelemetry. The project already uses OpenTelemetry for metrics
and the tracing crate for structured logging, making integration
straightforward.

Key findings:
- Low implementation effort (~5-7 hours)
- Opt-in by default with zero overhead when disabled
- Leverages existing tracing infrastructure
- Recommended: OTLP export with tracing-opentelemetry bridge

Resolves: #51
@pando85
Copy link

pando85 commented Mar 4, 2026

@forkline implement it

@forkline-bot
Copy link
Author

forkline-bot bot commented Mar 4, 2026

Task Failed

I encountered an error while trying to complete this task.

Error:

model inactivity timeout: no output for 300 seconds

If this issue persists, please check the task logs or try again.


Task ID: 1db17742-167d-4f46-87ef-59b7c593c9c9

@pando85
Copy link

pando85 commented Mar 4, 2026

@forkline retry

Add OTLP exporter support with opt-in configuration:
- Add tracing-opentelemetry bridge layer
- Add opentelemetry-otlp with grpc-tonic features
- Add configuration options for tracing (enabled, endpoint, sampling, service name)
- Create otel_tracing module with initialization and cleanup
- Integrate tracing into main.rs with proper shutdown handling
- Default disabled for zero overhead when not in use

Resolves: #51
Copy link
Author

@forkline-bot forkline-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Successfully implemented OpenTelemetry distributed tracing integration for RobotLB as specified in the trade-off analysis (issue #51).

Implementation Details

Core Integration (Phase 1 - Complete):

  1. Dependencies Added:

    • opentelemetry-otlp with trace and grpc-tonic features
    • opentelemetry-semantic-conventions
    • tracing-opentelemetry bridge layer
    • Updated tracing-subscriber with env-filter feature
  2. Configuration (src/config.rs:108-127):

    • ROBOTLB_TRACING_ENABLED (default: false) - Opt-in with zero overhead
    • ROBOTLB_OTLP_ENDPOINT (default: http://localhost:4317)
    • ROBOTLB_TRACING_SAMPLE_RATIO (default: 1.0)
    • ROBOTLB_SERVICE_NAME (default: robotlb)
  3. Tracing Module (src/otel_tracing.rs):

    • OTLP exporter with gRPC/Tonic transport
    • Configurable sampling (AlwaysOn or TraceIdRatioBased)
    • Resource with service name attribution
    • Batch exporter for efficient trace delivery
    • Proper shutdown handling via TracingGuard
  4. Integration (src/main.rs:101-102, 170):

    • Initializes tracing at startup with configuration
    • Properly shuts down tracer provider on exit
    • Bridges existing #[tracing::instrument] annotations automatically
  5. Tests:

    • Updated config tests to include new tracing fields
    • All 27 tests passing

Key Features

Zero overhead when disabled (default behavior)
OTLP standard protocol (works with Jaeger, Tempo, SigNoz, etc.)
Automatic instrumentation of existing #[tracing::instrument] calls
Configurable sampling for production cost control
Proper cleanup on shutdown

Next Steps (Future PRs)

  • Add Helm chart values for tracing configuration
  • Update README with usage examples
  • Create example Grafana Tempo/Jaeger deployment manifests
  • Add enhanced instrumentation for Hetzner API calls (Phase 2)

The implementation follows the recommended approach from the trade-off analysis and provides production-ready distributed tracing capabilities for debugging and observability.

Commit: ebf4e85
Branch: docs/opentelemetry-tracing-analysis
Resolves: #51

@pando85
Copy link

pando85 commented Mar 4, 2026

@forkline could you add support to the helm chart?

- Add tracing section to values.yaml with OTLP endpoint, protocol, sampling, and service name
- Update deployment template to inject tracing env vars when enabled
- Fix missing metrics.service section in values.yaml
Copy link
Author

@forkline-bot forkline-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added OpenTelemetry tracing support to the Helm chart:

values.yaml - New tracing section:

  • enabled: default false (zero overhead)
  • otlp.endpoint: OTLP collector endpoint
  • otlp.protocol: grpc (default) or http
  • samplingRatio: 1.0 default
  • serviceName: optional override

deployment.yaml - Injects env vars when tracing enabled:

  • ROBOTLB_TRACING_ENABLED
  • ROBOTLB_TRACING_OTLP_ENDPOINT
  • ROBOTLB_TRACING_OTLP_PROTOCOL
  • ROBOTLB_TRACING_SAMPLING_RATIO
  • ROBOTLB_TRACING_SERVICE_NAME

Also fixed pre-existing bug: missing metrics.service section in values.yaml.

@pando85
Copy link

pando85 commented Mar 4, 2026

@forkline could you ensure that pre-commit passes and add a helm schema values JSON file to ENFORCE the configuration options

Copy link
Author

@forkline-bot forkline-bot bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added helm/values.schema.json to enforce configuration options for the Helm chart, and ensured all pre-commit hooks pass (clippy, fmt, yamllint, etc.).

The schema validates:

  • Image configuration (repository, pullPolicy, tag)
  • Service account and RBAC permissions
  • Metrics configuration (ServiceMonitor, PrometheusRules, Grafana dashboards)
  • Tracing configuration (enabled, otlp endpoint/protocol, sampling ratio, service name)

- Add #[allow(clippy::struct_excessive_bools)] to OperatorConfig
- Add #[must_use] to TracingGuard::empty()
- Add # Errors doc to init_tracing
- Remove useless LevelFilter::from() conversions
- Fix import ordering
@forkline-bot forkline-bot bot force-pushed the docs/opentelemetry-tracing-analysis branch from 139af51 to c1f7564 Compare March 4, 2026 20:03
@pando85 pando85 merged commit 32012da into master Mar 4, 2026
1 check passed
@pando85 pando85 deleted the docs/opentelemetry-tracing-analysis branch March 4, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Analyze the trade-off of integrating with opentelemetry and traces

1 participant