[pipeline] feat: rl insight support online monitor by mengchengTang · Pull Request #53 · verl-project/rl-insight

mengchengTang · 2026-05-06T01:39:04Z

What does this PR do?

This PR introduces an experimental monitoring flow for RL-Insight in three parts:

training-side metric and trace APIs
a backend collector for aggregating metrics and exporting traces
rl-insight server start / rl-insight server stop for managing the observability stack

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include pipeline, parser, visualizer, data, deployment, perf, algo, env, doc, cfg, ci, misc
- If this PR involves multiple modules, separate them with , like [mstx, ci]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][mstx, torch_profile] feat: support timeline parsing

Test

rl-insight server start
verify Prometheus / Tempo / Grafana are up
verify Grafana loads Prometheus and Tempo datasources
verify Ctrl+C stops the full stack in foreground mode
verify rl-insight server stop works from another terminal

API and Usage Example

The following monitoring APIs are available from rl_insight:

rl_insight.init(config=None)
Initialize the training-side monitor client.
rl_insight.close()
Reset local monitoring state in the current process.
rl_insight.metric_count(name, amount=1.0, documentation="", **labels)
Report a counter metric.
rl_insight.metric_value(name, value, documentation="", **labels)
Report a gauge metric.
rl_insight.metric_distribution(name, value, documentation="", **labels)
Report a histogram metric.
rl_insight.trace_state(state_name, state_lane_id=None, **labels)
Record a state interval as a root span.
rl_insight.trace_op(name=None, extra_labels=None, **static_labels)
Decorator for recording operation duration spans.
rl_insight.update_prometheus_config(conf, server_addresses, job_name=..., reload_mode=...)
Update Prometheus scrape targets and trigger reload when supported.

import os
import ray
import rl_insight as insight

os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://<server-ip>:4318/v1/traces"

ray.init(address="auto", namespace="rl-insight-monitor")
insight.init()

insight.metric_count("train_step_total", amount=1, worker="trainer_0")
insight.metric_value("reward_mean", value=1.23, worker="trainer_0")
insight.metric_distribution("step_latency_ms", value=42.5, worker="trainer_0")

with insight.trace_state("rollout", state_lane_id="trainer_0", step=10):
    run_rollout()

@insight.trace_op("update_model", stage="optimizer")
def update_model(batch):
    ...

insight.update_prometheus_config(
    conf={
        "prometheus": {
            "config_file": "/path/to/prometheus.yml",
            "prometheus_port": 9090,
            "reload": {"mode": "ray"},
        }
    },
    server_addresses=["127.0.0.1:9092"],
)

Design & Code Changes

This architecture shows the end-to-end online monitoring flow in rl-insight.
The RL framework first integrates with the rl insight api, which provides lightweight collection APIs for function, variable, metric, and state monitoring. The API aggregates training-side monitoring events and forwards them to the rl insight data collector. The collector is responsible for backend data collection and processing, and can also interact with distributed runtime components such as Ray through RPC.
After aggregation and processing, the collector reports metrics and traces to the rl insight server. The server manages the observability stack and connects Prometheus for metrics, Tempo for traces, and Grafana for visualization. In this way, rl-insight separates instrumentation, collection, storage, and visualization into clear layers.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...

gemini-code-assist

Code Review

This pull request introduces an experimental online monitoring system for RL-Insight, leveraging Ray for event collection and Prometheus/OpenTelemetry for observability. It provides a high-level Python API for recording metrics and traces, a CLI for managing backend services via Docker Compose, and a central Ray actor to aggregate data. The review feedback highlights several performance and portability improvements, specifically recommending asynchronous event submission to avoid blocking the training loop, using batch processing for OpenTelemetry spans, and replacing external 'curl' dependencies with native Python libraries for better cross-environment compatibility.

gemini-code-assist · 2026-05-06T01:41:03Z

+        ref = self._actor.apply_event.remote(event)
+        ray.get(ref)


Calling ray.get() on every event submission makes monitoring synchronous and blocking. This can significantly degrade the performance of the training loop, especially when recording metrics or traces at high frequency. Since monitoring events are typically fire-and-forget, it is recommended to remove the ray.get() call to allow asynchronous submission via Ray.

Suggested change

ref = self._actor.apply_event.remote(event)

ray.get(ref)

self._actor.apply_event.remote(event)

gemini-code-assist · 2026-05-06T01:41:03Z

+            resource=self._otel.Resource.create(resource_attributes),
+        )
+        exporter = self._otel.OTLPSpanExporter(endpoint=resolved_endpoint)
+        provider.add_span_processor(self._otel.SimpleSpanProcessor(exporter))


SimpleSpanProcessor exports spans synchronously as they are ended, which can block the collector's execution thread if the OTLP endpoint is slow or under load. Using BatchSpanProcessor is recommended for better performance as it buffers spans and exports them in the background. Note that you will also need to update the imports and the SimpleNamespace in _require_opentelemetry to use BatchSpanProcessor.

Suggested change

provider.add_span_processor(self._otel.SimpleSpanProcessor(exporter))

provider.add_span_processor(self._otel.BatchSpanProcessor(exporter))

gemini-code-assist · 2026-05-06T01:41:03Z

+            import subprocess
+
+            if not reload_url:
+                hostname = socket.gethostname()
+                ip_address = socket.gethostbyname(hostname)
+                reload_url = f"http://{ip_address}:{port}/-/reload"
+
+            try:
+                subprocess.run(["curl", "-X", "POST", reload_url], capture_output=True, text=True, timeout=10)
+                print(f"Reloading Prometheus on node: {reload_url}")


Using subprocess.run(["curl", ...]) introduces an external dependency on the curl binary being present on all nodes in the cluster. It is more portable and robust to use Python's built-in urllib.request to perform the POST request for reloading Prometheus.

Suggested change

import subprocess

if not reload_url:

hostname = socket.gethostname()

ip_address = socket.gethostbyname(hostname)

reload_url = f"http://{ip_address}:{port}/-/reload"

try:

subprocess.run(["curl", "-X", "POST", reload_url], capture_output=True, text=True, timeout=10)

print(f"Reloading Prometheus on node: {reload_url}")

import urllib.request

if not reload_url:

hostname = socket.gethostname()

ip_address = socket.gethostbyname(hostname)

reload_url = f"http://{ip_address}:{port}/-/reload"

try:

req = urllib.request.Request(reload_url, method="POST")

with urllib.request.urlopen(req, timeout=10):

print(f"Reloading Prometheus on node: {reload_url}")

tardis-key · 2026-05-11T02:22:10Z

@@ -0,0 +1,133 @@
+"""Trainer vs observability-stack paths and loaders for RL-Insight monitoring."""
+
+from __future__ import annotations


RL-Insight之前一直使用简单的argsparser来控制参数，随着项目规模上升，确实需要考虑omegaconf来管理。个人感受是verl的方案稍微有点复杂， RL-Insight可以尝试更简单直接得使用omegaconf来管理配置。

tardis-key · 2026-05-11T02:25:39Z

@@ -0,0 +1,122 @@
+# RL-Insight Monitor


建议进行一些高并发测试，明确一下当前基于ray进行数据后端传输的负载上限。（也许我们当前的需求远不触及）

tardis-key · 2026-05-11T02:27:09Z

+|---|---:|---|
+| `namespace` | `rl_insight_monitor` | 指标 / trace 业务命名空间 |
+| `backend.type` | `ray` | 当前只支持 `ray` |
+| `prometheus.metrics_report_port` | `9092` | monitor hub 暴露 `/metrics` 的端口 |


监控平台是否可能像后端一样后续可支持扩展

这个后续可以作为一个feature，扩展下，理论上支持

tardis-key · 2026-05-11T02:44:20Z

 [tool.setuptools.packages.find]
 where = ["."]
-include = ["rl_insight"]
+include = ["rl_insight", "rl_insight.*", "experimental", "experimental.*"]


Monitor引入了不少的环境依赖，及时整理，更新readme、toml、requirements等。

tardis-key · 2026-05-11T02:55:04Z

+
+
+@ray.remote()
+class MonitorHubActor:


Collector要支持替换的话，建议提供一下基类。或者在文档或者代码中进行必要的接口说明，区分collector能力和raycollector的代码。

后续抽离出基类

tardis-key · 2026-05-11T02:55:58Z

+    return MonitorRayClient(handle)
+
+
+class MonitorRayClient:


Client的定位是什么，我一时没有很好理解。建议刷新一下rfc中的框架

数据收集的代理

tardis-key · 2026-05-11T02:57:30Z

@@ -0,0 +1,4 @@
+global:


我们在grafana中应该有自定义的布局文件吧，这部分内容也可以上库

有的，后续确认好第一版verl打点位置以及可视化效果后上库，随版本发布

tardis-key · 2026-05-11T03:00:28Z

+    start_http_server(port, addr=addr)
+
+
+class MetricRegistry:


目前的代码暂时应该只是做数据的采集和展示。后续如果需要进行数据分析和处理等能力，请参考offline pipiline的parser接口去实现功能

目前数据的可视化处理主要在grafana前端配置，后续可以考虑再collector后端添加数据处理类

tardis-key · 2026-05-11T03:02:34Z

@@ -0,0 +1,288 @@
+# Copyright 2026 Meituan Ltd. and/or its affiliates


对prometheusd的定位有点困惑，数据后端汇总通过collector实现，数据前端展示通过grafna。prometheus似乎只完成了通用指标的传输收集工作，有可能化简吗？

promethues服务需要有一个配置文件记录那些ip地址需要监控，此工具类提供用户添加ip地址至配置文件的函数

mengchengTang · 2026-05-19T13:59:17Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces the RL-Insight Monitor, an experimental observability stack for RL training metrics and traces utilizing Prometheus, Tempo, and Grafana. It includes a new CLI for managing the Docker Compose stack, a Ray-based monitor hub actor for event collection, and a suite of Python APIs for reporting metrics and spans. Feedback highlights several improvement opportunities: switching to BatchSpanProcessor to avoid blocking the single-threaded hub actor during trace exports, enhancing URL parsing robustness in the CLI, moving Ray tasks to the module level to prevent redundant registration, and replacing curl with the requests library for better portability when reloading Prometheus.

gemini-code-assist · 2026-05-19T14:02:10Z

+            resource=Resource.create(resource_attributes),
+        )
+        exporter = OTLPSpanExporter(endpoint=resolved_endpoint)
+        provider.add_span_processor(SimpleSpanProcessor(exporter))


Using SimpleSpanProcessor results in synchronous span exports. Since the MonitorHubActor (which uses this collector) is single-threaded and processes events sequentially, every trace event will block the hub until the OTLP export (HTTP POST) completes. This can significantly limit the event processing throughput of the monitoring system. It is highly recommended to use BatchSpanProcessor instead, which exports spans asynchronously in batches.

Suggested change

provider.add_span_processor(SimpleSpanProcessor(exporter))

from opentelemetry.sdk.trace.export import BatchSpanProcessor

provider.add_span_processor(BatchSpanProcessor(exporter))

gemini-code-assist · 2026-05-19T14:02:11Z

+
+def _otlp_http_publish_port(traces_endpoint: str) -> int:
+    """Publish host port implied by ``otel.traces_endpoint``."""
+    parsed = urlparse(traces_endpoint.strip())


urlparse may fail to correctly identify the port if the traces_endpoint string does not include a scheme (e.g., "127.0.0.1:4318"). In such cases, parsed.port will be None. It's safer to ensure the endpoint has a scheme before parsing.

endpoint = traces_endpoint.strip() if "://" not in endpoint: endpoint = f"http://{endpoint}" parsed = urlparse(endpoint)

gemini-code-assist · 2026-05-19T14:02:11Z

+        @ray.remote(num_cpus=0)
+        def write_config_file(config_data, path):
+            os.makedirs(os.path.dirname(path) or ".", exist_ok=True)
+            with open(path, "w", encoding="utf-8") as f:
+                yaml.dump(config_data, f, default_flow_style=False, indent=2)
+            return True
+
+        # Ray task: ask node's Prometheus HTTP API to reload configuration.
+        @ray.remote(num_cpus=0)
+        def reload_prometheus(port, r_url=None):
+            url = str(r_url) if r_url else None
+            if not url:
+                hostname = socket.gethostname()
+                ip_address = socket.gethostbyname(hostname)
+                url = f"http://{ip_address}:{int(port)}/-/reload"
+            try:
+                subprocess.run(
+                    ["curl", "-X", "POST", url],
+                    capture_output=True,
+                    text=True,
+                    timeout=10,
+                )
+                print(f"Reloading Prometheus on node: {url}")
+            except Exception:
+                pass


Defining Ray tasks (write_config_file, reload_prometheus) inside the update_prometheus_config function causes them to be redefined and re-registered with Ray every time the function is called. This is inefficient and can lead to issues in Ray's task management. These tasks should be moved to the module level.

gemini-code-assist · 2026-05-19T14:02:11Z

+                subprocess.run(
+                    ["curl", "-X", "POST", url],
+                    capture_output=True,
+                    text=True,
+                    timeout=10,
+                )


Using subprocess.run(["curl", ...]) to reload Prometheus is less portable and harder to debug than using a Python library. Since requests is already a project dependency, it should be used here instead. This also avoids potential FileNotFoundError if curl is not installed on the Ray nodes.

Suggested change

subprocess.run(

["curl", "-X", "POST", url],

capture_output=True,

text=True,

timeout=10,

)

import requests

try:

requests.post(url, timeout=10).raise_for_status()

print(f"Reloading Prometheus on node: {url}")

except Exception as e:

print(f"Failed to reload Prometheus on node {url}: {e}")

tardis-key · 2026-05-20T01:43:45Z

Merge this PR as an experimental feature for further optimization and iteration in practical use. Track relevant requirements in the roadmap. #49

mengchengTang marked this pull request as draft May 6, 2026 01:39

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

mengchengTang force-pushed the monitor branch from 89e6ce1 to 4f05d63 Compare May 6, 2026 03:53

mengchengTang changed the title ~~rl insight support online monitor~~ [online monitor] rl insight support online monitor May 6, 2026

mengchengTang changed the title ~~[online monitor] rl insight support online monitor~~ [online monitor] feat: rl insight support online monitor May 6, 2026

mengchengTang force-pushed the monitor branch 7 times, most recently from 3d78032 to 981188a Compare May 9, 2026 07:13

tardis-key reviewed May 11, 2026

View reviewed changes

mengchengTang force-pushed the monitor branch 3 times, most recently from 0687bef to d8ea87d Compare May 11, 2026 12:56

tardis-key mentioned this pull request May 12, 2026

[RFC] RL-Insight Online Monitoring System #46

Closed

mengchengTang marked this pull request as ready for review May 14, 2026 08:10

tardis-key mentioned this pull request May 16, 2026

[roadmap] RL-Insight Q2 Roadmap #49

Open

24 tasks

mengchengTang force-pushed the monitor branch 2 times, most recently from 7ac27f4 to 772fb2b Compare May 18, 2026 11:58

mengchengTang changed the title ~~[online monitor] feat: rl insight support online monitor~~ [pipeline] feat: rl insight support online monitor May 19, 2026

mengchengTang force-pushed the monitor branch 3 times, most recently from 00777ad to 7acb81f Compare May 19, 2026 13:56

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

rl insight support online monitor

119c9bc

mengchengTang force-pushed the monitor branch from 7acb81f to 119c9bc Compare May 19, 2026 16:11

tardis-key merged commit cf6ffb1 into verl-project:main May 20, 2026
5 checks passed

	ref = self._actor.apply_event.remote(event)
	ray.get(ref)
	self._actor.apply_event.remote(event)

	provider.add_span_processor(self._otel.SimpleSpanProcessor(exporter))
	provider.add_span_processor(self._otel.BatchSpanProcessor(exporter))

		@@ -0,0 +1,133 @@
		"""Trainer vs observability-stack paths and loaders for RL-Insight monitoring."""

		from __future__ import annotations

		@@ -0,0 +1,288 @@
		# Copyright 2026 Meituan Ltd. and/or its affiliates

	provider.add_span_processor(SimpleSpanProcessor(exporter))
	from opentelemetry.sdk.trace.export import BatchSpanProcessor
	provider.add_span_processor(BatchSpanProcessor(exporter))



		@ray.remote()
		class MonitorHubActor:

Conversation

mengchengTang commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengchengTang commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

tardis-key commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

mengchengTang commented May 6, 2026 •

edited

Loading