Skip to content

[pipeline] feat: rl insight support online monitor#53

Merged
tardis-key merged 1 commit into
verl-project:mainfrom
mengchengTang:monitor
May 20, 2026
Merged

[pipeline] feat: rl insight support online monitor#53
tardis-key merged 1 commit into
verl-project:mainfrom
mengchengTang:monitor

Conversation

@mengchengTang
Copy link
Copy Markdown
Collaborator

@mengchengTang mengchengTang commented May 6, 2026

What does this PR do?

This PR introduces an experimental monitoring flow for RL-Insight in three parts:

  1. training-side metric and trace APIs
  2. a backend collector for aggregating metrics and exporting traces
  3. rl-insight server start / rl-insight server stop for managing the observability stack

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include pipeline, parser, visualizer, data, deployment, perf, algo, env, doc, cfg, ci, misc
    • If this PR involves multiple modules, separate them with , like [mstx, ci]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][mstx, torch_profile] feat: support timeline parsing

Test

  • rl-insight server start
  • verify Prometheus / Tempo / Grafana are up
  • verify Grafana loads Prometheus and Tempo datasources
  • verify Ctrl+C stops the full stack in foreground mode
  • verify rl-insight server stop works from another terminal

API and Usage Example

The following monitoring APIs are available from rl_insight:

  • rl_insight.init(config=None)
    Initialize the training-side monitor client.
  • rl_insight.close()
    Reset local monitoring state in the current process.
  • rl_insight.metric_count(name, amount=1.0, documentation="", **labels)
    Report a counter metric.
  • rl_insight.metric_value(name, value, documentation="", **labels)
    Report a gauge metric.
  • rl_insight.metric_distribution(name, value, documentation="", **labels)
    Report a histogram metric.
  • rl_insight.trace_state(state_name, state_lane_id=None, **labels)
    Record a state interval as a root span.
  • rl_insight.trace_op(name=None, extra_labels=None, **static_labels)
    Decorator for recording operation duration spans.
  • rl_insight.update_prometheus_config(conf, server_addresses, job_name=..., reload_mode=...)
    Update Prometheus scrape targets and trigger reload when supported.
import os
import ray
import rl_insight as insight

os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://<server-ip>:4318/v1/traces"

ray.init(address="auto", namespace="rl-insight-monitor")
insight.init()

insight.metric_count("train_step_total", amount=1, worker="trainer_0")
insight.metric_value("reward_mean", value=1.23, worker="trainer_0")
insight.metric_distribution("step_latency_ms", value=42.5, worker="trainer_0")

with insight.trace_state("rollout", state_lane_id="trainer_0", step=10):
    run_rollout()

@insight.trace_op("update_model", stage="optimizer")
def update_model(batch):
    ...

insight.update_prometheus_config(
    conf={
        "prometheus": {
            "config_file": "/path/to/prometheus.yml",
            "prometheus_port": 9090,
            "reload": {"mode": "ray"},
        }
    },
    server_addresses=["127.0.0.1:9092"],
)

Design & Code Changes

image

This architecture shows the end-to-end online monitoring flow in rl-insight.
The RL framework first integrates with the rl insight api, which provides lightweight collection APIs for function, variable, metric, and state monitoring. The API aggregates training-side monitoring events and forwards them to the rl insight data collector. The collector is responsible for backend data collection and processing, and can also interact with distributed runtime components such as Ray through RPC.
After aggregation and processing, the collector reports metrics and traces to the rl insight server. The server manages the observability stack and connects Prometheus for metrics, Tempo for traces, and Grafana for visualization. In this way, rl-insight separates instrumentation, collection, storage, and visualization into clear layers.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@mengchengTang mengchengTang marked this pull request as draft May 6, 2026 01:39
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an experimental online monitoring system for RL-Insight, leveraging Ray for event collection and Prometheus/OpenTelemetry for observability. It provides a high-level Python API for recording metrics and traces, a CLI for managing backend services via Docker Compose, and a central Ray actor to aggregate data. The review feedback highlights several performance and portability improvements, specifically recommending asynchronous event submission to avoid blocking the training loop, using batch processing for OpenTelemetry spans, and replacing external 'curl' dependencies with native Python libraries for better cross-environment compatibility.

Comment on lines +85 to +86
ref = self._actor.apply_event.remote(event)
ray.get(ref)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling ray.get() on every event submission makes monitoring synchronous and blocking. This can significantly degrade the performance of the training loop, especially when recording metrics or traces at high frequency. Since monitoring events are typically fire-and-forget, it is recommended to remove the ray.get() call to allow asynchronous submission via Ray.

Suggested change
ref = self._actor.apply_event.remote(event)
ray.get(ref)
self._actor.apply_event.remote(event)

resource=self._otel.Resource.create(resource_attributes),
)
exporter = self._otel.OTLPSpanExporter(endpoint=resolved_endpoint)
provider.add_span_processor(self._otel.SimpleSpanProcessor(exporter))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

SimpleSpanProcessor exports spans synchronously as they are ended, which can block the collector's execution thread if the OTLP endpoint is slow or under load. Using BatchSpanProcessor is recommended for better performance as it buffers spans and exports them in the background. Note that you will also need to update the imports and the SimpleNamespace in _require_opentelemetry to use BatchSpanProcessor.

Suggested change
provider.add_span_processor(self._otel.SimpleSpanProcessor(exporter))
provider.add_span_processor(self._otel.BatchSpanProcessor(exporter))

Comment thread experimental/utils/prometheus_utils.py Outdated
Comment on lines +242 to +251
import subprocess

if not reload_url:
hostname = socket.gethostname()
ip_address = socket.gethostbyname(hostname)
reload_url = f"http://{ip_address}:{port}/-/reload"

try:
subprocess.run(["curl", "-X", "POST", reload_url], capture_output=True, text=True, timeout=10)
print(f"Reloading Prometheus on node: {reload_url}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using subprocess.run(["curl", ...]) introduces an external dependency on the curl binary being present on all nodes in the cluster. It is more portable and robust to use Python's built-in urllib.request to perform the POST request for reloading Prometheus.

Suggested change
import subprocess
if not reload_url:
hostname = socket.gethostname()
ip_address = socket.gethostbyname(hostname)
reload_url = f"http://{ip_address}:{port}/-/reload"
try:
subprocess.run(["curl", "-X", "POST", reload_url], capture_output=True, text=True, timeout=10)
print(f"Reloading Prometheus on node: {reload_url}")
import urllib.request
if not reload_url:
hostname = socket.gethostname()
ip_address = socket.gethostbyname(hostname)
reload_url = f"http://{ip_address}:{port}/-/reload"
try:
req = urllib.request.Request(reload_url, method="POST")
with urllib.request.urlopen(req, timeout=10):
print(f"Reloading Prometheus on node: {reload_url}")

@mengchengTang mengchengTang changed the title rl insight support online monitor [online monitor] rl insight support online monitor May 6, 2026
@mengchengTang mengchengTang changed the title [online monitor] rl insight support online monitor [online monitor] feat: rl insight support online monitor May 6, 2026
@mengchengTang mengchengTang force-pushed the monitor branch 7 times, most recently from 3d78032 to 981188a Compare May 9, 2026 07:13
@@ -0,0 +1,133 @@
"""Trainer vs observability-stack paths and loaders for RL-Insight monitoring."""

from __future__ import annotations
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RL-Insight之前一直使用简单的argsparser来控制参数,随着项目规模上升,确实需要考虑omegaconf来管理。个人感受是verl的方案稍微有点复杂, RL-Insight可以尝试更简单直接得使用omegaconf来管理配置。

Comment thread experimental/README.md
@@ -0,0 +1,122 @@
# RL-Insight Monitor
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议进行一些高并发测试,明确一下当前基于ray进行数据后端传输的负载上限。(也许我们当前的需求远不触及)

Comment thread experimental/README.md Outdated
|---|---:|---|
| `namespace` | `rl_insight_monitor` | 指标 / trace 业务命名空间 |
| `backend.type` | `ray` | 当前只支持 `ray` |
| `prometheus.metrics_report_port` | `9092` | monitor hub 暴露 `/metrics` 的端口 |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

监控平台是否可能像后端一样后续可支持扩展

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个后续可以作为一个feature,扩展下,理论上支持

Comment thread experimental/api.py
Comment thread pyproject.toml
[tool.setuptools.packages.find]
where = ["."]
include = ["rl_insight"]
include = ["rl_insight", "rl_insight.*", "experimental", "experimental.*"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Monitor引入了不少的环境依赖,及时整理,更新readme、toml、requirements等。



@ray.remote()
class MonitorHubActor:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collector要支持替换的话,建议提供一下基类。或者在文档或者代码中进行必要的接口说明,区分collector能力和raycollector的代码。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

后续抽离出基类

return MonitorRayClient(handle)


class MonitorRayClient:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client的定位是什么,我一时没有很好理解。建议刷新一下rfc中的框架

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

数据收集的代理

@@ -0,0 +1,4 @@
global:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们在grafana中应该有自定义的布局文件吧,这部分内容也可以上库

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有的,后续确认好第一版verl打点位置以及可视化效果后上库,随版本发布

start_http_server(port, addr=addr)


class MetricRegistry:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前的代码暂时应该只是做数据的采集和展示。后续如果需要进行数据分析和处理等能力,请参考offline pipiline的parser接口去实现功能

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前数据的可视化处理主要在grafana前端配置,后续可以考虑再collector后端添加数据处理类

Comment thread experimental/utils/prometheus_utils.py Outdated
@@ -0,0 +1,288 @@
# Copyright 2026 Meituan Ltd. and/or its affiliates
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对prometheusd的定位有点困惑,数据后端汇总通过collector实现,数据前端展示通过grafna。prometheus似乎只完成了通用指标的传输收集工作,有可能化简吗?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

promethues服务需要有一个配置文件记录那些ip地址需要监控,此工具类提供用户添加ip地址至配置文件的函数

@mengchengTang mengchengTang force-pushed the monitor branch 3 times, most recently from 0687bef to d8ea87d Compare May 11, 2026 12:56
@mengchengTang mengchengTang marked this pull request as ready for review May 14, 2026 08:10
@tardis-key tardis-key mentioned this pull request May 16, 2026
24 tasks
@mengchengTang mengchengTang force-pushed the monitor branch 2 times, most recently from 7ac27f4 to 772fb2b Compare May 18, 2026 11:58
@mengchengTang mengchengTang changed the title [online monitor] feat: rl insight support online monitor [pipeline] feat: rl insight support online monitor May 19, 2026
@mengchengTang mengchengTang force-pushed the monitor branch 3 times, most recently from 00777ad to 7acb81f Compare May 19, 2026 13:56
@mengchengTang
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the RL-Insight Monitor, an experimental observability stack for RL training metrics and traces utilizing Prometheus, Tempo, and Grafana. It includes a new CLI for managing the Docker Compose stack, a Ray-based monitor hub actor for event collection, and a suite of Python APIs for reporting metrics and spans. Feedback highlights several improvement opportunities: switching to BatchSpanProcessor to avoid blocking the single-threaded hub actor during trace exports, enhancing URL parsing robustness in the CLI, moving Ray tasks to the module level to prevent redundant registration, and replacing curl with the requests library for better portability when reloading Prometheus.

resource=Resource.create(resource_attributes),
)
exporter = OTLPSpanExporter(endpoint=resolved_endpoint)
provider.add_span_processor(SimpleSpanProcessor(exporter))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using SimpleSpanProcessor results in synchronous span exports. Since the MonitorHubActor (which uses this collector) is single-threaded and processes events sequentially, every trace event will block the hub until the OTLP export (HTTP POST) completes. This can significantly limit the event processing throughput of the monitoring system. It is highly recommended to use BatchSpanProcessor instead, which exports spans asynchronously in batches.

Suggested change
provider.add_span_processor(SimpleSpanProcessor(exporter))
from opentelemetry.sdk.trace.export import BatchSpanProcessor
provider.add_span_processor(BatchSpanProcessor(exporter))

Comment thread experimental/cli.py Outdated

def _otlp_http_publish_port(traces_endpoint: str) -> int:
"""Publish host port implied by ``otel.traces_endpoint``."""
parsed = urlparse(traces_endpoint.strip())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

urlparse may fail to correctly identify the port if the traces_endpoint string does not include a scheme (e.g., "127.0.0.1:4318"). In such cases, parsed.port will be None. It's safer to ensure the endpoint has a scheme before parsing.

    endpoint = traces_endpoint.strip()
    if "://" not in endpoint:
        endpoint = f"http://{endpoint}"
    parsed = urlparse(endpoint)

Comment thread experimental/utils/prometheus_utils.py Outdated
Comment on lines +251 to +275
@ray.remote(num_cpus=0)
def write_config_file(config_data, path):
os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
yaml.dump(config_data, f, default_flow_style=False, indent=2)
return True

# Ray task: ask node's Prometheus HTTP API to reload configuration.
@ray.remote(num_cpus=0)
def reload_prometheus(port, r_url=None):
url = str(r_url) if r_url else None
if not url:
hostname = socket.gethostname()
ip_address = socket.gethostbyname(hostname)
url = f"http://{ip_address}:{int(port)}/-/reload"
try:
subprocess.run(
["curl", "-X", "POST", url],
capture_output=True,
text=True,
timeout=10,
)
print(f"Reloading Prometheus on node: {url}")
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Defining Ray tasks (write_config_file, reload_prometheus) inside the update_prometheus_config function causes them to be redefined and re-registered with Ray every time the function is called. This is inefficient and can lead to issues in Ray's task management. These tasks should be moved to the module level.

Comment thread experimental/utils/prometheus_utils.py Outdated
Comment on lines +267 to +272
subprocess.run(
["curl", "-X", "POST", url],
capture_output=True,
text=True,
timeout=10,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using subprocess.run(["curl", ...]) to reload Prometheus is less portable and harder to debug than using a Python library. Since requests is already a project dependency, it should be used here instead. This also avoids potential FileNotFoundError if curl is not installed on the Ray nodes.

Suggested change
subprocess.run(
["curl", "-X", "POST", url],
capture_output=True,
text=True,
timeout=10,
)
import requests
try:
requests.post(url, timeout=10).raise_for_status()
print(f"Reloading Prometheus on node: {url}")
except Exception as e:
print(f"Failed to reload Prometheus on node {url}: {e}")

@tardis-key
Copy link
Copy Markdown
Collaborator

Merge this PR as an experimental feature for further optimization and iteration in practical use. Track relevant requirements in the roadmap. #49

@tardis-key tardis-key merged commit cf6ffb1 into verl-project:main May 20, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants