Skip to content

[RFC] RL-Insight Online Monitoring System #46

@mengchengTang

Description

@mengchengTang

1. Summary

This RFC proposes an Online Monitoring subsystem in RL-Insight to improve live operational visibility during complex RL training (e.g., long context, agentic loops, distributed train/infer). The subsystem will provide:

  • Metrics: counter increment, gauge set, histogram / distribution observe
  • Traces: spans with attributes (latency and cross-component timelines)
  • Logs (optional / phased): structured logging hook with a stable API surface
  • Multi-process / multi-node ingestion and aggregation toward Grafana-friendly backends
  • Grafana as the primary UI (metrics + traces + state-style timelines), including dashboard JSON export/import
  • Optional offline export of time-sliced markers / trace-like JSON for post-hoc inspection

2. Motivation

Modern RL systems are increasingly distributed and agentic. Scalar experiment trackers are often insufficient for runtime questions, while continuous deep profiling is too expensive for day-to-day operation.

Pain Point Description
Coarse signals Tools like TensorBoard / WandB are strong for curves but often miss runtime structure (rank skew, stalls, phase boundaries, tool-loop latency).
Profiling cost Profiling is rich but heavy for long runs; it should complement—not replace—lightweight online telemetry.
Cross-component correlation Train/infer split and multi-turn tools require correlated metrics + spans across processes, not isolated logs.

RL-Insight focuses on performance insight with a well-defined data protocol; online monitoring extends that with always-on, low-friction telemetry aligned with the Grafana ecosystem.


3. Design Goals

  • G1: Provide a small, framework-agnostic producer API (metrics + spans + structured log hook).
  • G2: Support multi-node / multi-process ingestion with documented behavior under load (batching, backpressure, cardinality guidance).
  • G3: Integrate with Grafana: Prometheus-style metrics, OTLP-compatible traces, and state timeline style panels where appropriate; ship versioned dashboard JSON for reproducible setup.
  • G4: Define a versioned event schema so collectors, backends, and dashboards can evolve without breaking producers.
  • G5: Provide an optional offline path: export time-sliced or trace-like JSON for external viewers (e.g., Chrome trace–compatible workflows where feasible).

4. Scope & Non-Goals

  • In scope: online operational telemetry, reference local stack documentation (e.g., compose), and redaction hooks for sensitive fields.
  • Out of scope (initially): replacing full GPU/kernel profilers as the primary analysis tool; hosted SaaS; guaranteed automatic root-cause attribution for every slowdown.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions