Feature: Observability & Monitoring Stack
Background
RELab has no active monitoring. Issues are discovered reactively — someone notices the app is broken, then we SSH in and run docker logs -f. There is no visibility into performance degradation, rising error rates, or infrastructure pressure before they become user-facing problems.
Goal
Implement observability covering logs, metrics, and traces, with alerting, so that production health is visible at a glance and problems are surfaced proactively.
This is split into two phases. V1 is fast to ship using SaaS tools with minimal infrastructure. V2 replaces/extends with a fully self-hosted stack.
V1 — Lightweight self-hosted (two new Docker services)
Stack: Dozzle · Uptime Kuma
Both are open-source, single-container, near-zero config, and use negligible resources.
Dozzle — log viewer
A read-only web UI for Docker container logs. No storage, no agents, no config — just mount the Docker socket. Replaces docker logs -f with a browser UI that supports multi-container tailing, search, and filtering.
Uptime Kuma — uptime & alerting
Point at the existing /live healthcheck endpoint for each service. Gives:
- Uptime monitoring with a status page
- Alerts via email, Slack, Discord, ntfy, or webhook on downtime
Both services should be added to compose.prod.yml and exposed via the Cloudflare Tunnel with access control.
V1 Acceptance Criteria
V2 — Self-hosted Grafana LGTM Stack (full observability)
Stack: Prometheus · Loki · Tempo · Grafana · Grafana Alloy
All open-source, runs on the existing server as additional Docker Compose services. Grafana exposed via the existing Cloudflare Tunnel (with access control).
Backend instrumentation
- OpenTelemetry auto-instrumentation for FastAPI, SQLAlchemy, and Redis. Export traces to Tempo via OTLP.
- Prometheus
/metrics endpoint (request rate, latency histograms, error rate per endpoint).
- Trace IDs injected into loguru records for log-trace correlation.
New Compose services
Prometheus, Loki, Tempo, Grafana, Grafana Alloy, postgres_exporter — all on an internal network.
Alloy tails Docker log streams → Loki and scrapes host metrics (CPU, memory, disk, network).
Dashboards (provisioned as JSON, no manual setup)
- Service health — request rate, error rate, p95/p99 latency per endpoint
- Infrastructure — CPU, memory, disk, network per container and host
- Database & cache — Postgres connection pool, Redis hit/miss ratio
- Logs — Loki panels per service with level breakdown
Alerting
Notify via email or webhook on:
- Any service healthcheck failing > 1 min
- HTTP error rate > 1% over 5 min
- p99 latency > 2s over 5 min
- Disk > 80% or memory > 90% sustained
- No logs from a service for > 5 min (silent crash detection)
V2 Acceptance Criteria
Out of Scope
- Frontend RUM
- CI observability (test timing, flake detection)
Feature: Observability & Monitoring Stack
Background
RELab has no active monitoring. Issues are discovered reactively — someone notices the app is broken, then we SSH in and run
docker logs -f. There is no visibility into performance degradation, rising error rates, or infrastructure pressure before they become user-facing problems.Goal
Implement observability covering logs, metrics, and traces, with alerting, so that production health is visible at a glance and problems are surfaced proactively.
This is split into two phases. V1 is fast to ship using SaaS tools with minimal infrastructure. V2 replaces/extends with a fully self-hosted stack.
V1 — Lightweight self-hosted (two new Docker services)
Stack: Dozzle · Uptime Kuma
Both are open-source, single-container, near-zero config, and use negligible resources.
Dozzle — log viewer
A read-only web UI for Docker container logs. No storage, no agents, no config — just mount the Docker socket. Replaces
docker logs -fwith a browser UI that supports multi-container tailing, search, and filtering.Uptime Kuma — uptime & alerting
Point at the existing
/livehealthcheck endpoint for each service. Gives:Both services should be added to
compose.prod.ymland exposed via the Cloudflare Tunnel with access control.V1 Acceptance Criteria
/liveendpoint and sends an alert on failureV2 — Self-hosted Grafana LGTM Stack (full observability)
Stack: Prometheus · Loki · Tempo · Grafana · Grafana Alloy
All open-source, runs on the existing server as additional Docker Compose services. Grafana exposed via the existing Cloudflare Tunnel (with access control).
Backend instrumentation
/metricsendpoint (request rate, latency histograms, error rate per endpoint).New Compose services
Prometheus, Loki, Tempo, Grafana, Grafana Alloy, postgres_exporter — all on an internal network.
Alloy tails Docker log streams → Loki and scrapes host metrics (CPU, memory, disk, network).
Dashboards (provisioned as JSON, no manual setup)
Alerting
Notify via email or webhook on:
V2 Acceptance Criteria
docker compose -f compose.yml -f compose.prod.yml upOut of Scope