Skip to content

feat(observability): add distributed tracing, metrics, log aggregatio…#4

Open
Hexor-Hash wants to merge 2 commits into
teslims2:dashboard-implementationfrom
Hexor-Hash:feat/observability-518
Open

feat(observability): add distributed tracing, metrics, log aggregatio…#4
Hexor-Hash wants to merge 2 commits into
teslims2:dashboard-implementationfrom
Hexor-Hash:feat/observability-518

Conversation

@Hexor-Hash
Copy link
Copy Markdown

…n, alerting, and dashboards

  • Add OpenTelemetry SDK with auto-instrumentation for HTTP, PostgreSQL, Redis
  • Add TracingService with withSpan() helper for custom spans
  • Enhance MetricsService with http_request_duration_seconds, enrollments_total, course_completions_total, auth_attempts_total, active_connections
  • Add Loki transport to Winston logger for log aggregation (opt-in via LOKI_URL)
  • Add Prometheus alerting rules: HighErrorRate, SlowHttpResponses, SlowStellarRpc, HighMemoryUsage, HighEventLoopLag, ServiceDown, HighAuthFailureRate
  • Add Alertmanager configuration with critical/warning routing
  • Add Brain-Storm Overview Grafana dashboard with 11 panels
  • Add Loki datasource to Grafana provisioning
  • Add Loki and Alertmanager services to docker-compose.monitoring.yml
  • Add comprehensive observability documentation (docs/observability.md)
  • Add OTel, winston-loki, nest-winston packages to backend package.json
  • Add OTEL_EXPORTER_OTLP_ENDPOINT and LOKI_URL to .env.example

Closes BrainTease#518

Description

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Dependency update
  • CI/CD improvement

Related Issues

Closes #

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • E2E tests added/updated (if applicable)
  • Manual testing performed

Documentation

  • README updated (if applicable)
  • API documentation updated (if applicable)
  • Code comments added for complex logic
  • Migration guide added (if breaking changes)

Breaking Changes

  • No breaking changes
  • Breaking changes documented below:

Checklist

  • My code follows the project's style guidelines
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests passed locally with my changes
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Context

…n, alerting, and dashboards

- Add OpenTelemetry SDK with auto-instrumentation for HTTP, PostgreSQL, Redis
- Add TracingService with withSpan() helper for custom spans
- Enhance MetricsService with http_request_duration_seconds, enrollments_total,
  course_completions_total, auth_attempts_total, active_connections
- Add Loki transport to Winston logger for log aggregation (opt-in via LOKI_URL)
- Add Prometheus alerting rules: HighErrorRate, SlowHttpResponses, SlowStellarRpc,
  HighMemoryUsage, HighEventLoopLag, ServiceDown, HighAuthFailureRate
- Add Alertmanager configuration with critical/warning routing
- Add Brain-Storm Overview Grafana dashboard with 11 panels
- Add Loki datasource to Grafana provisioning
- Add Loki and Alertmanager services to docker-compose.monitoring.yml
- Add comprehensive observability documentation (docs/observability.md)
- Add OTel, winston-loki, nest-winston packages to backend package.json
- Add OTEL_EXPORTER_OTLP_ENDPOINT and LOKI_URL to .env.example

Closes BrainTease#518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant