[RFC] RL-Insight Online Monitoring System

## 1. Summary

This RFC proposes an **Online Monitoring** subsystem in [RL-Insight](https://github.com/verl-project/rl-insight) to improve **live operational visibility** during complex RL training (e.g., long context, agentic loops, distributed train/infer). The subsystem will provide:

- **Metrics:** counter increment, gauge set, histogram / distribution observe  
- **Traces:** spans with attributes (latency and cross-component timelines)  
- **Logs (optional / phased):** structured logging hook with a stable API surface  
- **Multi-process / multi-node** ingestion and aggregation toward Grafana-friendly backends  
- **Grafana** as the primary UI (metrics + traces + state-style timelines), including **dashboard JSON** export/import  
- **Optional offline** export of time-sliced markers / trace-like JSON for post-hoc inspection  

---

## 2. Motivation

Modern RL systems are increasingly **distributed and agentic**. Scalar experiment trackers are often insufficient for **runtime** questions, while continuous deep profiling is too expensive for day-to-day operation.

| Pain Point | Description |
|------------|-------------|
| **Coarse signals** | Tools like TensorBoard / WandB are strong for curves but often miss **runtime structure** (rank skew, stalls, phase boundaries, tool-loop latency). |
| **Profiling cost** | Profiling is rich but **heavy** for long runs; it should complement—not replace—lightweight online telemetry. |
| **Cross-component correlation** | Train/infer split and multi-turn tools require **correlated metrics + spans** across processes, not isolated logs. |

RL-Insight focuses on **performance insight** with a well-defined data protocol; online monitoring extends that with **always-on, low-friction** telemetry aligned with the Grafana ecosystem.

---

## 3. Design Goals

- **G1:** Provide a **small, framework-agnostic** producer API (metrics + spans + structured log hook).  
- **G2:** Support **multi-node / multi-process** ingestion with documented behavior under load (batching, backpressure, cardinality guidance).  
- **G3:** Integrate with **Grafana**: Prometheus-style metrics, OTLP-compatible traces, and **state timeline** style panels where appropriate; ship **versioned dashboard JSON** for reproducible setup.  
- **G4:** Define a **versioned event schema** so collectors, backends, and dashboards can evolve without breaking producers.  
- **G5:** Provide an **optional offline** path: export time-sliced or trace-like JSON for external viewers (e.g., Chrome trace–compatible workflows where feasible).  

---

## 4. Scope & Non-Goals

- **In scope:** online operational telemetry, reference local stack documentation (e.g., compose), and redaction hooks for sensitive fields.  
- **Out of scope (initially):** replacing full GPU/kernel profilers as the primary analysis tool; hosted SaaS; guaranteed automatic root-cause attribution for every slowdown.  

Pain Point	Description
Coarse signals	Tools like TensorBoard / WandB are strong for curves but often miss runtime structure (rank skew, stalls, phase boundaries, tool-loop latency).
Profiling cost	Profiling is rich but heavy for long runs; it should complement—not replace—lightweight online telemetry.
Cross-component correlation	Train/infer split and multi-turn tools require correlated metrics + spans across processes, not isolated logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] RL-Insight Online Monitoring System #46

1. Summary

2. Motivation

3. Design Goals

4. Scope & Non-Goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] RL-Insight Online Monitoring System #46

Description

1. Summary

2. Motivation

3. Design Goals

4. Scope & Non-Goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions