[pipeline] feat: rl online monitor statetimeline optime by mengchengTang · Pull Request #61 · verl-project/rl-insight

mengchengTang · 2026-05-26T12:56:02Z

What does this PR do?

Summary
Merge overlapping trace_state intervals in MonitorHubActor before exporting to Tempo, so Grafana state timelines show one row per process instead of many overlapping spans.

Changes
Add _state_pending to buffer in-flight state intervals keyed by (state_lane_id, state_name).
For state_interval traces, merge overlapping spans into a single envelope [min(start), max(end)].
Flush the pending interval to OTLP when a new span starts after a gap (non-overlapping start time).
Other trace types (e.g. trace_op) are exported unchanged.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include pipeline, parser, visualizer, data, deployment, perf, algo, env, doc, cfg, ci, misc
- If this PR involves multiple modules, separate them with , like [mstx, ci]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][mstx, torch_profile] feat: support timeline parsing

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...

gemini-code-assist

Code Review

This pull request introduces state interval trace merging in the Ray monitor hub, allowing overlapping state intervals to be merged before exporting them as OTLP spans. The review highlights two critical issues: first, direct dictionary access of state_lane_id and other keys could raise a KeyError and crash the shared actor, requiring defensive .get() fallbacks; second, buffered traces in _state_pending are never flushed at the end of a run, leading to data loss for the final state interval, which should be resolved by adding a public flush method.

gemini-code-assist · 2026-05-26T12:57:22Z

+    def _handle_state_interval_trace(
+        self, event: dict[str, Any], attrs: dict[str, Any]
+    ) -> None:
+        """Merge overlapping state intervals per (lane, name); flush on gap."""
+        key = (str(attrs["state_lane_id"]), event["name"])
+        start_ns = int(event["start_time_ns"])
+        end_ns = int(event["end_time_ns"])
+
+        pending = self._state_pending.get(key)
+        if pending is not None and start_ns > pending["end_ns"]:
+            self._export_trace_span(
+                pending["name"],
+                pending["start_ns"],
+                pending["end_ns"],
+                pending["attributes"],
+            )
+            pending = None
+
+        if pending is None:
+            self._state_pending[key] = {
+                "name": event["name"],
+                "start_ns": start_ns,
+                "end_ns": end_ns,
+                "attributes": attrs,
+            }
+            return
+
+        pending["start_ns"] = min(pending["start_ns"], start_ns)
+        pending["end_ns"] = max(pending["end_ns"], end_ns)
+


The _handle_state_interval_trace method accesses attrs["state_lane_id"] directly. If state_lane_id is missing from the attributes dictionary, this will raise a KeyError and potentially crash the shared MonitorHubActor. To ensure robustness, use defensive programming by using .get() with a default fallback value.

def _handle_state_interval_trace( self, event: dict[str, Any], attrs: dict[str, Any] ) -> None: """Merge overlapping state intervals per (lane, name); flush on gap.""" lane_id = attrs.get("state_lane_id", "unknown") key = (str(lane_id), event.get("name", "unknown")) start_ns = int(event.get("start_time_ns", 0)) end_ns = int(event.get("end_time_ns", 0)) pending = self._state_pending.get(key) if pending is not None and start_ns > pending["end_ns"]: self._export_trace_span( pending["name"], pending["start_ns"], pending["end_ns"], pending["attributes"], ) pending = None if pending is None: self._state_pending[key] = { "name": event.get("name", "unknown"), "start_ns": start_ns, "end_ns": end_ns, "attributes": attrs, } return pending["start_ns"] = min(pending["start_ns"], start_ns) pending["end_ns"] = max(pending["end_ns"], end_ns)

gemini-code-assist · 2026-05-26T12:57:22Z

+    def _export_trace_span(
+        self,
+        name: str,
+        start_time_ns: int,
+        end_time_ns: int,
+        attributes: dict[str, Any],
+    ) -> None:
+        """Export one root span via OTLP (no-op if collector disabled)."""
+        if self._trace_collector is None:
+            return
+        self._trace_collector.record_span(
+            name,
+            start_time_ns,
+            end_time_ns,
+            attributes=attributes,
        )


Because state interval traces are buffered in self._state_pending and only flushed when a subsequent span starts after a gap, the final state interval for each lane will never be exported to OTLP. This results in data loss for the last state of the training run. Adding a public flush method allows clients or the lifecycle manager to flush any remaining pending spans at the end of the run.

def _export_trace_span( self, name: str, start_time_ns: int, end_time_ns: int, attributes: dict[str, Any], ) -> None: """Export one root span via OTLP (no-op if collector disabled).""" if self._trace_collector is None: return self._trace_collector.record_span( name, start_time_ns, end_time_ns, attributes=attributes, ) def flush(self) -> None: """Flush all pending state interval traces to OTLP.""" for key, pending in list(self._state_pending.items()): self._export_trace_span( pending["name"], pending["start_ns"], pending["end_ns"], pending["attributes"], ) self._state_pending.clear()

rl online monitor statetimeline optime

4750818

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pipeline] feat: rl online monitor statetimeline optime#61

[pipeline] feat: rl online monitor statetimeline optime#61
mengchengTang wants to merge 1 commit into
verl-project:mainfrom
mengchengTang:0526

mengchengTang commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mengchengTang commented May 26, 2026

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant