Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ repos:
rev: "v1.17.0"
hooks:
- id: mypy
additional_dependencies: [types-requests]
additional_dependencies: [types-PyYAML, types-requests]

- repo: local
hooks:
Expand Down
153 changes: 153 additions & 0 deletions experimental/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# RL-Insight Monitor
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议进行一些高并发测试,明确一下当前基于ray进行数据后端传输的负载上限。(也许我们当前的需求远不触及)


RL-Insight Monitor provides an observability stack for RL training metrics and traces based on Prometheus, Tempo, and Grafana.

It has two parts:

- `rl-insight server ...`: manage the observability Docker stack.
- `rl_insight`: training-side Python APIs for metrics and traces.

## Quickstart

### 1. Install

From the repository root:

```bash
pip install -r requirements.txt
pip install -e .
```

### 2. Start the observability stack

Default foreground mode:

```bash
rl-insight server start
```

This mode starts Docker Compose silently, keeps the CLI attached, and stops the whole stack when you press `Ctrl+C`.

Grafana will be provisioned automatically with Prometheus and Tempo datasources plus an empty starter dashboard. The datasources follow the configured Prometheus and Tempo published ports.

Background mode:

```bash
rl-insight server start --detach
```

Foreground mode with compose/container logs attached:

```bash
rl-insight server start --attach-logs
```

Use a custom config file:

```bash
rl-insight server start --config path/to/config.yaml
```

Stop the stack explicitly from another terminal:

```bash
rl-insight server stop
```

After startup, the CLI prints:

- Prometheus config file path
- Trainer OTLP traces URL
- Prometheus, Tempo, and Grafana access URLs

### 3. Initialize the training side

```python
import os
import ray
import rl_insight as insight

os.environ["OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"] = "http://<server-ip>:4318/v1/traces"

ray.init(address="auto", namespace="rl-insight-monitor")
insight.init()
```

Notes:

- `ray.init(namespace="rl-insight-monitor")` is used to find the monitor hub actor.
- `OTEL_EXPORTER_OTLP_TRACES_ENDPOINT` takes precedence over `insight.init(config)` -> `otel.traces_endpoint`.

### 4. Emit metrics and traces

```python
import rl_insight as insight

insight.metric_count("train_step_total", amount=1, worker="trainer_0")
insight.metric_value("reward_mean", value=1.23, worker="trainer_0")
insight.metric_distribution("step_latency_ms", value=42.5, worker="trainer_0")

with insight.trace_state("rollout", state_lane_id="trainer_0", step=10):
run_rollout()

@insight.trace_op("update_model", stage="optimizer")
def update_model(batch):
...
```

## APIs

| API | Purpose |
|---|---|
| `init(config=None)` | Initialize training-side monitoring |
| `close()` | Reset monitor state in the current process |
| `metric_count()` | Report a counter |
| `metric_value()` | Report a gauge |
| `metric_distribution()` | Report a histogram |
| `trace_state()` | Report a state interval |
| `trace_op()` | Decorator for operation latency traces |

## CLI Reference

### `rl-insight server start`

| Argument | Default | Description |
|---|---:|---|
| `--detach` | `false` | Start in background and return immediately |
| `--attach-logs` | `false` | Run in foreground and stream compose/container logs |
| `--config` | `experimental/config/services/config.yaml` | Server config file path |
| `--log-level` | `INFO` | Python log level |

### `rl-insight server stop`

| Argument | Default | Description |
|---|---:|---|
| `--config` | `experimental/config/services/config.yaml` | Server config file path |
| `--log-level` | `INFO` | Python log level |

## Server YAML

| Key | Default | Description |
|---|---:|---|
| `server.backend` | `docker_compose` | Stack startup backend |
| `server.compose_file` | `docker-compose.yaml` | Compose file path |
| `server.project_name` | `rl-insight-monitor` | Compose project name |
| `prometheus.prometheus_port` | `9090` | Prometheus HTTP port |
| `prometheus.config_file` | `prometheus.yml` | Prometheus config file |
| `tempo.query_port` | `3200` | Tempo query port |
| `otel.traces_endpoint` | `http://127.0.0.1:4318/v1/traces` | Trainer trace export endpoint |
| `grafana.port` | `3000` | Grafana HTTP port |
| `grafana.provisioning_dir` | `provisioning` | Grafana provisioning directory mounted into the container |
| `grafana.dashboards_dir` | `dashboards` | Grafana dashboard JSON directory mounted into the container |

## `insight.init(config)`

| Key | Default | Description |
|---|---:|---|
| `namespace` | `rl_insight_monitor` | Metrics and trace namespace |
| `backend.type` | `ray` | Currently only `ray` is supported |
| `prometheus.metrics_report_port` | `9092` | Monitor hub `/metrics` port |
| `prometheus.prometheus_port` | `9090` | Prometheus HTTP port used for reload |
| `prometheus.config_file` | bundled absolute path | Prometheus config file to rewrite |
| `prometheus.reload.mode` | `ray` | `ray` or `none` |
| `otel.traces_endpoint` | `http://127.0.0.1:4318/v1/traces` | Trainer trace export endpoint |
51 changes: 51 additions & 0 deletions experimental/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Copyright (c) 2026 verl-project authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Experimental online monitoring: Ray hub, Prometheus ``/metrics``, and OTLP trace export."""

from .api import (
close,
init,
metric_count,
metric_distribution,
metric_value,
trace_op,
trace_state,
)
from .config import (
MONITOR_HUB_ACTOR_NAME,
MONITOR_RAY_NAMESPACE,
load_monitor_config,
load_server_config_file,
resolve_monitor_stack_paths,
)
from .utils import PROMETHEUS_SCRAPE_JOB_NAME, update_prometheus_config


__all__ = [
"close",
"init",
"load_monitor_config",
"load_server_config_file",
"MONITOR_HUB_ACTOR_NAME",
"MONITOR_RAY_NAMESPACE",
"metric_count",
"metric_distribution",
"metric_value",
"PROMETHEUS_SCRAPE_JOB_NAME",
"resolve_monitor_stack_paths",
"trace_op",
"trace_state",
"update_prometheus_config",
]
21 changes: 21 additions & 0 deletions experimental/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (c) 2026 verl-project authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import annotations

from .cli import main


if __name__ == "__main__":
raise SystemExit(main())
Loading
Loading