1. Summary
This RFC proposes an Online Monitoring subsystem in RL-Insight to improve live operational visibility during complex RL training (e.g., long context, agentic loops, distributed train/infer). The subsystem will provide:
- Metrics: counter increment, gauge set, histogram / distribution observe
- Traces: spans with attributes (latency and cross-component timelines)
- Logs (optional / phased): structured logging hook with a stable API surface
- Multi-process / multi-node ingestion and aggregation toward Grafana-friendly backends
- Grafana as the primary UI (metrics + traces + state-style timelines), including dashboard JSON export/import
- Optional offline export of time-sliced markers / trace-like JSON for post-hoc inspection
2. Motivation
Modern RL systems are increasingly distributed and agentic. Scalar experiment trackers are often insufficient for runtime questions, while continuous deep profiling is too expensive for day-to-day operation.
| Pain Point |
Description |
| Coarse signals |
Tools like TensorBoard / WandB are strong for curves but often miss runtime structure (rank skew, stalls, phase boundaries, tool-loop latency). |
| Profiling cost |
Profiling is rich but heavy for long runs; it should complement—not replace—lightweight online telemetry. |
| Cross-component correlation |
Train/infer split and multi-turn tools require correlated metrics + spans across processes, not isolated logs. |
RL-Insight focuses on performance insight with a well-defined data protocol; online monitoring extends that with always-on, low-friction telemetry aligned with the Grafana ecosystem.
3. Design Goals
- G1: Provide a small, framework-agnostic producer API (metrics + spans + structured log hook).
- G2: Support multi-node / multi-process ingestion with documented behavior under load (batching, backpressure, cardinality guidance).
- G3: Integrate with Grafana: Prometheus-style metrics, OTLP-compatible traces, and state timeline style panels where appropriate; ship versioned dashboard JSON for reproducible setup.
- G4: Define a versioned event schema so collectors, backends, and dashboards can evolve without breaking producers.
- G5: Provide an optional offline path: export time-sliced or trace-like JSON for external viewers (e.g., Chrome trace–compatible workflows where feasible).
4. Scope & Non-Goals
- In scope: online operational telemetry, reference local stack documentation (e.g., compose), and redaction hooks for sensitive fields.
- Out of scope (initially): replacing full GPU/kernel profilers as the primary analysis tool; hosted SaaS; guaranteed automatic root-cause attribution for every slowdown.
1. Summary
This RFC proposes an Online Monitoring subsystem in RL-Insight to improve live operational visibility during complex RL training (e.g., long context, agentic loops, distributed train/infer). The subsystem will provide:
2. Motivation
Modern RL systems are increasingly distributed and agentic. Scalar experiment trackers are often insufficient for runtime questions, while continuous deep profiling is too expensive for day-to-day operation.
RL-Insight focuses on performance insight with a well-defined data protocol; online monitoring extends that with always-on, low-friction telemetry aligned with the Grafana ecosystem.
3. Design Goals
4. Scope & Non-Goals