[Feature] Agentic Rollout — training real agent apps with strict app–RL-framework separation and lightweight API integration

# Agentic Rollout — Training Real Agent Apps with Strict App–RL-Framework Separation and Lightweight API Integration

---

## 1. Motivation

Modern RL-with-LLM workflows increasingly revolve around **real agent applications** — LangChain / LangGraph agents, SWE-agent, OpenHands, browser-use, Claude-Code-style assistants, and in-house tool-using agents. These apps already ship as independently runnable services with their own main loops, tool dispatch, state management, and (often) UIs.

The question this issue asks is: **how do we train these agents with RL without forcing them to rewrite themselves against our framework?**

Today Relax has a working Agentic rollout (`examples/deepeyes/rollout.py` + `BaseInteractionEnv`), but integrating a **new** agent app into training is heavy. We want:

- **Real agent apps** — not just synthetic single-turn chat, but multi-turn, tool-calling, branching, possibly multi-agent workflows.
- **Strict app ↔ RL-framework separation** — the agent code does **not** import the RL framework; the framework does **not** understand the agent's business logic (how many tools, how to parse them, how to keep state).
- **Lightweight API integration** — ideally a couple of lines of code, or just a change of `base_url`, is all it takes to plug an existing agent into training.

---

## 2. Current State & Pain Points

Evidence collected from the code base (`examples/deepeyes/`, `relax/engine/rollout/`):

### 2.1 Rollout layer — users copy 400+ lines of "framework code"

[`examples/deepeyes/rollout.py`](../../examples/deepeyes/rollout.py) is ~564 LOC. **~85 %** of it is generic multi-turn scaffolding:
- Token concatenation across turns
- `loss_mask` construction
- Multimodal accumulation
- Budget / max-turn management
- Partial-rollout (abort / resume) bookkeeping

Every new agent example today has to **copy-paste** this file, patch `DEFAULT_ENV_MODULE`, and maintain its own drift-prone fork.

### 2.2 Env layer — `BaseInteractionEnv.step(response_text)` is a synchronous callback

```python
# Current model (examples/deepeyes/base_env.py)
env.reset()
for turn in range(max_turns):
    response = llm_generate(tokens)        # framework drives the loop
    obs, done, info = env.step(response)   # env is a passive callback
    tokens += encode(obs)
```

This does **not** fit:

| Agent shape | Why the current model hurts |
|---|---|
| Browser / Code agents | They already have `while not done: plan → act → observe`; being forced back into a callback requires major surgery |
| LangGraph / DAG agents | Control flow is a graph, not a linear step |
| Multi-agent setups | Multiple agents call the LLM independently — cannot be mapped to a single env |
| Async agents | `_process_env_step` supports `awaitable`, but execution is still step-wise & sequential |

### 2.3 Intrusion points — env must understand framework internals

- `format_observation()` must return a chat-message dict (role + content list).
- The env has to know the tokenizer and the multimodal encoding rules.
- The env must mutate framework-private fields like `sample.metadata["_env_current_image"]` to make partial-rollout recovery work.

**Net effect**: "integrate a new agent" today means *"read 500+ lines of framework code, copy half of it, and learn our internal `Sample` layout"*. That is neither light-weight nor separated.

---

## 3. Lessons from the Industry (verl / AReaL / OpenRLHF)

Before proposing a design, we align with existing open-source practice. Three frameworks have converged on a very similar pattern:

### 3.1 verl — `AgentLoop` (a.k.a. "rollout-as-a-service")

[verl](https://github.com/volcengine/verl) introduced `AgentLoop` to solve exactly the problem this issue raises:

- A **`AgentLoopManager`** (Ray actor) owns N **`AgentLoopWorker`** processes.
- Each Worker hosts a pluggable **`BaseAgentLoop` subclass** (`ToolAgentLoop`, `SingleTurnAgentLoop`, user-defined `MyAgentLoop`) — this is where *business logic* lives.
- The Worker talks to a headless vLLM/SGLang server via **`AsyncLLMServerManager.generate(...)`**, which is an OpenAI-style async generate call.
- The framework **never looks inside the loop**; it only consumes the returned `AgentLoopOutput(prompt_ids, response_ids, response_mask, num_turns, metrics)`.

Key insight: **the user only writes one class with one `async def run(...)` method**. Everything else (batching, dispatching, weight sync, load-balancing SGLang) is framework-owned.

### 3.2 AReaL — `Controller` + `LLMAgent`

[AReaL](https://github.com/inclusionAI/AReaL) splits concerns even more explicitly:

- **`RolloutController`** holds a `WorkflowExecutor` with an input queue & result queue.
- The user writes a **`RolloutWorkflow.arun_episode(engine, data)`** coroutine — this is the agent.
- Inside `arun_episode`, the user calls **`engine.agenerate(...)`** (an HTTP client to the remote inference server) as many times as they want, in any shape.
- The result is returned as a flat list of `Trajectory` objects; the trainer side calls `controller.prepare_batch()` to drain.

Key insight: **the agent is just an async Python coroutine**. No callback, no state machine, no message-diff trickery.

### 3.3 OpenRLHF — `ExperienceMaker`

Thinner, synchronous, but the same idea: the agent logic is a callable owned by the user and is invoked by the framework with a *remote* LLM handle.

### 3.4 Convergent pattern

| Concern | verl | AReaL | OpenRLHF | **Relax (proposed)** |
|---|---|---|---|---|
| Who owns the loop? | `AgentLoop` subclass | `RolloutWorkflow` coroutine | `ExperienceMaker` | **`AgenticRollout` (user class) OR OpenAI-base_url (zero-code)** |
| How does the loop talk to the engine? | `AsyncLLMServerManager.generate` | `engine.agenerate` | direct inference call | **OpenAI-compatible HTTP + optional Python `AsyncEngineClient`** |
| Where is token/mask/logprob bookkeeping? | Framework (`AgentLoopWorker`) | Framework (`WorkflowExecutor`) | Framework | **Framework (`AgenticRolloutProxy`)** |
| How is a trajectory finalized? | return `AgentLoopOutput` | return `Trajectory` | return dict | **`finish(reward, success)` call, or return from user coroutine** |
| Framework import in user code? | 1 base class | 1 base class | 1 base class | **0 (HTTP) or 1 base class (Python)** |

**Takeaway for Relax**: we should offer **two interoperable front-ends** on top of the same backend:

1. **Zero-dep HTTP front-end** (our unique selling point, for LangChain / SWE-agent / browser-use / external apps).
2. **verl/AReaL-style Python front-end** (`AgenticRollout` subclass) for users who *do* want a thin in-process coroutine and max throughput.

Both drop into the same `AgenticRolloutProxy` / `AgentLoopManager` internals, so there is one backend to maintain.

---

## 4. Proposed Design — **Inverted Control (IoC) Agentic Rollout**

**Core idea**: **Invert the control flow**. Instead of the framework driving the env, the **agent app drives itself** and calls the framework's LLM endpoint over an **OpenAI-compatible HTTP API** (or a thin Python coroutine à la verl/AReaL). The framework passively records the trajectory.

### 4.1 Three-layer architecture (aligned with verl / AReaL)

```
┌──────────────────────────────────────────────────────────────────┐
│  Layer 1 · Agent (user code)                                     │
│    Front-end A: pure HTTP (OpenAI SDK, langchain, swe-agent …)   │
│    Front-end B: `class MyAgent(AgenticRollout): async def run()` │
│                      (verl AgentLoop / AReaL Workflow style)     │
└──────────────────────────────────────────────────────────────────┘
                          │  requests                     ▲
                          ▼                               │ samples
┌──────────────────────────────────────────────────────────────────┐
│  Layer 2 · Orchestration (framework, new)                        │
│    • `AgenticRolloutManager` (Ray actor) — like verl AgentLoopMgr│
│    • Spawns N `AgentWorker`s (subprocess | ray_actor | docker)   │
│    • `AgenticRolloutProxy` (FastAPI) — OpenAI-compatible ingress │
│    • Per-trajectory Sample state machine + asyncio.Queue         │
└──────────────────────────────────────────────────────────────────┘
                          │  engine.agenerate (async)     ▲
                          ▼                               │ tokens+logprobs
┌──────────────────────────────────────────────────────────────────┐
│  Layer 3 · Inference (existing, unchanged)                       │
│    • SGLang router + engines                                     │
│    • Partial rollout / weight update / MoE routed_experts        │
└──────────────────────────────────────────────────────────────────┘
```

### 4.2 Deployment diagram

```
┌─────────────────────────────────────────────────────────────┐
│  Agent App (user code, zero framework dependency)           │
│    - LangChain / LangGraph / Custom / ...                   │
│    - while not done:                                        │
│        reply = openai.chat.completions.create(...) ──┐      │
│        tool_result = run_tool(reply.tool_calls)      │      │
│        messages.append(tool_result)                  │      │
└──────────────────────────────────────────────────────┼──────┘
                                                       │  HTTP  /v1/chat/completions
                                                       │  (OpenAI-compatible)
                                                       ▼
┌─────────────────────────────────────────────────────────────┐
│  Relax Rollout Gateway (NEW: AgenticRolloutProxy)           │
│   • Intercepts each completion request                      │
│   • Keyed by trajectory_id (X-Relax-Trajectory-Id header)   │
│   • Calls SGLang engine, records logprobs + routed_experts  │
│   • Incrementally builds a per-trajectory Sample:           │
│       - messages_in  → loss_mask=0 span (env input)         │
│       - completion  → loss_mask=1 span (model output)       │
│   • Trajectory finishes via /v1/trajectories/{id}/finish    │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼   (tokens, loss_mask, logprobs, routed_experts)
          ┌──────────────────────────┐
          │  Existing Sample & training │
          │  GRPO / PPO / Reward / ...  │
          └──────────────────────────┘
```

### 3.2 User integration — *lightweight by construction*

**Before** (an existing agent app):
```python
import openai
client = openai.OpenAI(base_url="https://api.openai.com/v1", api_key="sk-xxx")
```

**After** (two-line change):
```python
import openai
client = openai.OpenAI(
    base_url=f"{RELAX_ROLLOUT_URL}/v1",                         # ① swap endpoint
    default_headers={"X-Relax-Trajectory-Id": task_id},         # ② tag trajectory
)
# ... the rest of the agent code is unchanged ...
```

At the end of a trajectory the agent reports reward/termination:
```python
client.post(f"{RELAX_ROLLOUT_URL}/v1/trajectories/{task_id}/finish",
            json={"reward": my_reward, "success": True})
```

Result: `grep -r "relax" agent_app/` returns **zero** matches.

### 4.3 Alternative — Python front-end (verl/AReaL style)

For users who prefer an in-process coroutine (no subprocess, no HTTP overhead, easy debugging):

```python
# examples/agentic_gsm8k/agent_loop.py
from relax.engine.rollout.agentic import AgenticRollout, register_agentic

@register_agentic("gsm8k_react")
class GSM8KReActAgent(AgenticRollout):
    async def run(self, sample, engine):
        messages = [{"role": "user", "content": sample.prompt}]
        for turn in range(5):
            resp = await engine.agenerate(messages, tools=TOOLS)  # remote, async
            messages.append(resp.message)
            if resp.finish_reason == "stop":
                break
            tool_result = await run_tool(resp.tool_calls[0])
            messages.append({"role": "tool", "content": tool_result})
        return {"reward": grade(messages[-1].content, sample.answer),
                "success": True}
```

This is semantically identical to the HTTP path; both produce the same `Sample` objects downstream. The framework picks the front-end based on `--agent-launcher-type`:

- `python_coroutine` → Python front-end (max throughput, like verl AgentLoop)
- `subprocess` / `ray_actor` / `docker` / `http` → HTTP front-end (max isolation, for real agent apps)

### 4.4 End-to-end sequence diagram

```mermaid
sequenceDiagram
    participant T as Trainer (existing)
    participant M as AgenticRolloutManager
    participant W as AgentWorker #k
    participant A as Agent App (user)
    participant P as AgenticRolloutProxy
    participant E as SGLang Engine

    T->>M: generate_rollout_agentic(prompts)
    loop for each prompt
        M->>W: spawn(trajectory_id, prompt, proxy_url)
        W->>A: exec agent_app.py (env: URL + TID + PROMPT)
    end

    loop agent main loop
        A->>P: POST /v1/chat/completions (X-Relax-Trajectory-Id)
        P->>P: diff messages → new_input_ids, loss_mask=0
        P->>E: async generate(tokens, mm_inputs)
        E-->>P: response_ids, logprobs, routed_experts
        P->>P: append response, loss_mask=1
        P-->>A: OpenAI-format completion
        A->>A: run_tool(tool_calls)
    end

    A->>P: POST /v1/trajectories/{id}/finish (reward, success)
    P->>P: sample.status = COMPLETED, put on queue
    M->>P: drain(N)
    P-->>M: List[Sample]
    M-->>T: build_rollout_fn_output(...)
```

---

## 5. Implementation Plan

### Step 1 · Add `AgenticRolloutProxy` component *(3–4 days)*

New file: `relax/engine/rollout/agentic/proxy.py`

```python
class AgenticRolloutProxy:
    """FastAPI ingress implementing OpenAI-compatible /v1/chat/completions.
    Each request is traced by X-Relax-Trajectory-Id and token-level data is
    accumulated into per-trajectory Sample objects."""

    def __init__(self, args, sglang_router_url):
        self._samples: dict[str, Sample] = {}          # trajectory_id -> Sample
        self._pending: asyncio.Queue[Sample] = asyncio.Queue()
        self._tokenizer = load_tokenizer(args)
        self._processor = load_processor(args)

    @app.post("/v1/chat/completions")
    async def chat_completions(self, body, headers):
        tid = headers["X-Relax-Trajectory-Id"]
        sample = self._samples.setdefault(tid, Sample(id=tid, tokens=[], loss_mask=[]))
        # 1. diff messages -> new input tokens (loss_mask=0)
        new_input_ids, mm_inputs = self._diff_and_encode(sample, body["messages"])
        sample.tokens.extend(new_input_ids)
        sample.loss_mask.extend([0] * len(new_input_ids))
        # 2. forward to SGLang engine, receive response + logprobs
        resp = await self._sglang_generate(sample.tokens, body["sampling_params"], mm_inputs)
        # 3. append response (loss_mask=1, real logprobs)
        sample.tokens.extend(resp.tokens)
        sample.loss_mask.extend([1] * len(resp.tokens))
        sample.rollout_log_probs.extend(resp.logprobs)
        # 4. return OpenAI-format response to the agent
        return _to_openai_format(resp)

    @app.post("/v1/trajectories/{tid}/finish")
    async def finish(self, tid, body):
        sample = self._samples.pop(tid)
        sample.reward = body["reward"]
        sample.status = (
            Sample.Status.COMPLETED if body.get("success")
            else Sample.Status.TRUNCATED
        )
        await self._pending.put(sample)

    async def drain(self, n: int) -> list[Sample]:
        return [await self._pending.get() for _ in range(n)]
```

### Step 2 · Add `AgenticRolloutManager` + `generate_rollout_agentic` *(2 days)*

New files: `relax/engine/rollout/agentic/manager.py`, `relax/engine/rollout/agentic/generate.py`

`AgenticRolloutManager` is the analogue of verl's `AgentLoopManager`:

```python
@ray.remote
class AgenticRolloutManager:
    def __init__(self, args, proxy_handle, sglang_router_url):
        self._launcher = build_launcher(args.agent_launcher_type)
        self._proxy = proxy_handle
        self._concurrency = args.agent_concurrency or args.rollout_batch_size

    async def run_batch(self, prompts: list[Sample]) -> list[Sample]:
        sem = asyncio.Semaphore(self._concurrency)
        async def _one(p):
            async with sem:
                await self._launcher.start(p, self._proxy.url)
                return await self._proxy.wait(p.id)  # blocks until /finish
        return await asyncio.gather(*[_one(p) for p in prompts])
```

And `generate.py` is the tiny adapter that plugs into the existing rollout entrypoint:

```python
async def generate_rollout_agentic(
    args, rollout_id, data_source, data_system_client, evaluation=False,
):
    """Drop-in replacement for sglang_rollout.generate_rollout that routes
    prompts through the AgenticRolloutManager + Proxy."""
    manager = AgenticRolloutManager.get_or_create(args)
    prompts = data_source.get_samples(args.rollout_batch_size)
    samples = await manager.run_batch.remote(prompts)
    return build_rollout_fn_output(
        samples, rollout_id=rollout_id, args=args,
        data_system_client=data_system_client, evaluation=evaluation,
    )
```

### Step 3 · Agent-launcher abstraction *(1.5 days)*

New package: `relax/engine/rollout/agentic/launchers/`

| Launcher | Front-end | Use-case |
|---|---|---|
| `python_coroutine.py` | Python | `AgenticRollout` subclass, in-process, fastest (verl-style) |
| `subprocess.py` | HTTP | Start the agent app as a Python subprocess (most generic) |
| `ray_actor.py` | HTTP | Ray Actor (GPU-friendly, e.g. browser agents with vision models) |
| `docker.py` | HTTP | Containerized sandbox (code agents, SWE-agent) |
| `http_callback.py` | HTTP | The user already runs an agent server; framework just POSTs `/run` |

Common interface:

```python
class BaseAgentLauncher:
    async def start(self, sample: Sample, proxy_url: str) -> None: ...
    async def kill(self, sample_id: str) -> None: ...
```

### Step 4 · New CLI arguments *(0.5 day)*

In `relax/utils/arguments.py`, rollout argument group:

```python
parser.add_argument(
    "--rollout-function-path",
    default="relax.engine.rollout.sglang_rollout.generate_rollout",
)
# To use agentic rollout:
#   --rollout-function-path relax.engine.rollout.agentic.generate.generate_rollout_agentic

parser.add_argument(
    "--agent-launcher", type=str, default=None,
    help=("Entrypoint of the agent app. "
          "For python_coroutine: 'examples.agentic_gsm8k.agent_loop:GSM8KReActAgent'. "
          "For subprocess: 'examples.swe_agent.run:main'. "
          "For http: 'http://agent-svc/run'."),
)
parser.add_argument(
    "--agent-launcher-type",
    choices=["python_coroutine", "subprocess", "ray_actor", "docker", "http"],
    default="subprocess",
)
parser.add_argument("--agent-timeout-secs", type=int, default=600)
parser.add_argument(
    "--agent-concurrency", type=int, default=None,
    help="Max concurrent agent instances; default = rollout_batch_size",
)
```

### Step 5 · First example *(1–2 days)*

New directory: `examples/agentic_gsm8k/`

```
examples/agentic_gsm8k/
├── agent_app.py           # pure user code: a ~50-line ReAct agent using openai SDK
├── run_agentic_gsm8k.sh   # launch script
└── README.md
```

`agent_app.py` — **no `import relax` anywhere**:
```python
import os, json, openai

client = openai.OpenAI(
    base_url=os.environ["RELAX_ROLLOUT_URL"] + "/v1",
    default_headers={"X-Relax-Trajectory-Id": os.environ["RELAX_TRAJECTORY_ID"]},
    api_key="dummy",
)
messages = [{"role": "user", "content": os.environ["RELAX_PROMPT"]}]
for turn in range(5):
    resp = client.chat.completions.create(model="policy", messages=messages, tools=TOOLS)
    messages.append(resp.choices[0].message)
    if resp.choices[0].finish_reason == "stop":
        break
    tool_result = run_tool(resp.choices[0].message.tool_calls[0])
    messages.append({"role": "tool", "content": tool_result, "tool_call_id": ...})

answer = messages[-1].content
reward = 1.0 if extract_answer(answer) == ground_truth else 0.0
client.post(
    f"/v1/trajectories/{os.environ['RELAX_TRAJECTORY_ID']}/finish",
    json={"reward": reward, "success": True},
)
```

---

## 6. Key Technical Challenges & Mitigations

| Challenge | Mitigation |
|---|---|
| **Token-level diff of messages** | Maintain `sample.last_messages_hash`; on each new request, tokenize only the delta via `tokenizer.apply_chat_template` — the same trick used in DeepEyes `_encode_observation_for_generation` |
| **Tool messages carry no tokens yet** | Route tool results through the chat template → encode → `loss_mask=0`, consistent with the existing DeepEyes path |
| **Multimodal inputs (images / video URLs)** | Proxy downloads / base64-encodes and injects into SGLang `image_data`; transparent to the user |
| **Partial-rollout compatibility** | On abort, proxy keeps the sample state; the next request simply resumes appending |
| **Weight-update blocking** | When `can_do_update_weight_for_async()` is active, proxy rejects new completions with HTTP 503; the agent's OpenAI client retries naturally |
| **Concurrency & isolation** | Per-trajectory `asyncio.Lock`; proxy-internal `asyncio.Queue`; reuse the existing sglang_router load balancing |
| **Per-step reward attribution** | `/v1/trajectories/{id}/finish` optionally accepts `turn_rewards: [...]`; written to `sample.metadata["step_rewards"]` for custom reward functions |
| **MoE routing replay** | Read `routed_experts` from SGLang `meta_info` and accumulate into `sample.rollout_routed_experts`, identical to the DeepEyes `_update_routed_experts` path |
| **Async / fully-async modes** | Proxy lives inside the Rollout process, so it inherits TransferQueue / DCS / staleness semantics for free |

---

## 7. Testing & Validation Plan

### L1 · Unit tests *(0.5 day)*
| Test | File | Focus |
|---|---|---|
| `test_message_diff_encoding` | `tests/rollout/test_agentic_proxy.py` | Incremental messages → correct tokens / loss_mask |
| `test_openai_compat` | same | chat.completions / tool_calls response is byte-identical to OpenAI |
| `test_trajectory_isolation` | same | Concurrent trajectories never cross-contaminate |

### L2 · Integration smoke test *(1 day)*
1. Start Relax rollout with the proxy only (no training loop).
2. Run `agent_app.py` + a real SGLang engine against 10 GSM8K prompts.
3. **Assertions**:
   - Exactly 10 Samples collected.
   - `sum(loss_mask) ≈ #generated tokens`.
   - `len(sample.rollout_log_probs) == sum(loss_mask)`.
   - Reward distribution is reasonable.

### L3 · End-to-end training *(2 days)*
- **Hardware**: 8×H800 single node (colocate mode), matching the DeepEyes minimal setup.
- **Model**: Qwen3-4B (fast iteration).
- **Dataset**: GSM8K with tool calling.
- **Baseline**: same data, same model, DeepEyes-style rollout (current path).
- **Metrics**:
  - Throughput (tokens/s) within 10 % of baseline.
  - Reward-vs-step curve shape matches baseline.
  - After 100 iters, GSM8K pass@1 gap < 1 %.
- **Success criteria**: no curve regression, **≥ 70 % LOC reduction** for new agent integrations.

### L4 · Compatibility matrix *(1 day)*
| Scenario | Expected |
|---|---|
| Colocate + Bridge | ✅ |
| Fully-async | ✅ proxy correctly returns 503 during weight update |
| Partial rollout | ✅ aborted samples resume correctly |
| MoE (routed_experts replay) | ✅ |
| Multimodal (Qwen3-VL) | ✅ image URLs pass through |
| Elastic rollout scale-out | ✅ proxy is engine-transparent |

### L5 · Real agent-framework integration *(2–3 days)*
Pick 2–3 representative real agents:
- **LangChain ReAct agent** (classic tool calling).
- **SWE-agent / OpenHands** (code agent with its own complex main loop).
- **browser-use** (multimodal + tools).

Target: **only `base_url` changes, zero business-logic change** → each runs one full RL iteration successfully.

---

## 8. Hardware Plan

This issue is primarily a **framework capability**; hardware demand is existing-training-demand plus N agent subprocesses.

### Development (single engineer)
- 1× H800 80 GB + 8-core CPU
- Qwen3-4B, batch = 4, concurrency = 4
- Agents run as host-local subprocesses (CPU overhead negligible)

### Validation (baseline comparison)
- 8× H800 single node, Qwen3-4B fully-async, batch = 64
- Agent concurrency = 64 → CPU ≥ 32 cores (tool execution / parsing)

### Production (at scale)
- 32× H800 (4 nodes), Qwen3-32B, batch = 256
- **Recommended**: schedule agent workers onto dedicated CPU nodes via Ray placement groups:
  ```bash
  --agent-launcher-type ray_actor \
  --resource '{"actor":[3,8], "rollout":[1,8], "agent_workers":[1,0]}'
  ```
- Network: agent workers ↔ rollout proxy on intra-cluster gRPC/HTTP, < 1 ms latency

### Special cases
- **Browser agents**: each worker runs a headless Chrome → 1 node ≈ 16 workers (CPU-bound).
- **Code agents**: each worker uses a Docker sandbox → use `docker_launcher`; 1 node ≈ 8 workers.

---

## 9. Milestones & Effort

| # | Task | Effort | Deliverable |
|---|---|---|---|
| M1 | Proxy + message-diff encoding | 3 d | Can intercept completions and build Samples correctly |
| M2 | `AgenticRolloutManager` + subprocess launcher + `generate_rollout_agentic` | 2 d | One GSM8K agent integration works via HTTP front-end |
| M3 | `python_coroutine` launcher + `AgenticRollout` base class | 1 d | verl/AReaL-style front-end works (same Sample output) |
| M4 | L1 + L2 tests green | 1 d | Unit + smoke tests pass |
| M5 | L3 end-to-end training parity | 2 d | Curves match the DeepEyes baseline |
| M6 | Ray-actor + docker launchers | 2 d | Production-grade launchers |
| M7 | L5 real-agent PoCs (LangChain, SWE-agent) | 3 d | 2 real-agent example projects |
| M8 | Docs + migration guide | 1 d | `docs/agentic-rollout.md`; DeepEyes path marked legacy |
| **Total** | | **~15 working days** | |

---

## 10. Risks & Rollback

| Risk | Likelihood | Mitigation |
|---|---|---|
| Token-level message diff breaks when chat templates change | Medium | LCS-based token alignment as a fallback + rich assertions |
| OpenAI API compatibility edge cases (streaming, `n > 1`) | Medium | Initially support only `stream=false, n=1`; reject others with HTTP 400 |
| Agent subprocess leaks / timeouts | Low | Launcher-level kill on timeout + Ray worker-level GC |
| Regression in the existing DeepEyes path | Low | New path is a separate module; switchable via `--rollout-function-path`; old path preserved |

**Rollback strategy**: all changes are additive except four CLI arguments in `arguments.py`. If anything misbehaves, flip `--rollout-function-path` back to `relax.engine.rollout.sglang_rollout.generate_rollout`.

---


Challenge	Mitigation
Token-level diff of messages	Maintain `sample.last_messages_hash`; on each new request, tokenize only the delta via `tokenizer.apply_chat_template` — the same trick used in DeepEyes `_encode_observation_for_generation`
Tool messages carry no tokens yet	Route tool results through the chat template → encode → `loss_mask=0`, consistent with the existing DeepEyes path
Multimodal inputs (images / video URLs)	Proxy downloads / base64-encodes and injects into SGLang `image_data`; transparent to the user
Partial-rollout compatibility	On abort, proxy keeps the sample state; the next request simply resumes appending
Weight-update blocking	When `can_do_update_weight_for_async()` is active, proxy rejects new completions with HTTP 503; the agent's OpenAI client retries naturally
Concurrency & isolation	Per-trajectory `asyncio.Lock`; proxy-internal `asyncio.Queue`; reuse the existing sglang_router load balancing
Per-step reward attribution	`/v1/trajectories/{id}/finish` optionally accepts `turn_rewards: [...]`; written to `sample.metadata["step_rewards"]` for custom reward functions
MoE routing replay	Read `routed_experts` from SGLang `meta_info` and accumulate into `sample.rollout_routed_experts`, identical to the DeepEyes `_update_routed_experts` path
Async / fully-async modes	Proxy lives inside the Rollout process, so it inherits TransferQueue / DCS / staleness semantics for free

Agent shape	Why the current model hurts
Browser / Code agents	They already have `while not done: plan → act → observe`; being forced back into a callback requires major surgery
LangGraph / DAG agents	Control flow is a graph, not a linear step
Multi-agent setups	Multiple agents call the LLM independently — cannot be mapped to a single env
Async agents	`_process_env_step` supports `awaitable`, but execution is still step-wise & sequential

Concern	verl	AReaL	OpenRLHF	Relax (proposed)
Who owns the loop?	`AgentLoop` subclass	`RolloutWorkflow` coroutine	`ExperienceMaker`	`AgenticRollout` (user class) OR OpenAI-base_url (zero-code)
How does the loop talk to the engine?	`AsyncLLMServerManager.generate`	`engine.agenerate`	direct inference call	OpenAI-compatible HTTP + optional Python `AsyncEngineClient`
Where is token/mask/logprob bookkeeping?	Framework (`AgentLoopWorker`)	Framework (`WorkflowExecutor`)	Framework	Framework (`AgenticRolloutProxy`)
How is a trajectory finalized?	return `AgentLoopOutput`	return `Trajectory`	return dict	`finish(reward, success)` call, or return from user coroutine
Framework import in user code?	1 base class	1 base class	1 base class	0 (HTTP) or 1 base class (Python)

Launcher	Front-end	Use-case
`python_coroutine.py`	Python	`AgenticRollout` subclass, in-process, fastest (verl-style)
`subprocess.py`	HTTP	Start the agent app as a Python subprocess (most generic)
`ray_actor.py`	HTTP	Ray Actor (GPU-friendly, e.g. browser agents with vision models)
`docker.py`	HTTP	Containerized sandbox (code agents, SWE-agent)
`http_callback.py`	HTTP	The user already runs an agent server; framework just POSTs `/run`

Test	File	Focus
`test_message_diff_encoding`	`tests/rollout/test_agentic_proxy.py`	Incremental messages → correct tokens / loss_mask
`test_openai_compat`	same	chat.completions / tool_calls response is byte-identical to OpenAI
`test_trajectory_isolation`	same	Concurrent trajectories never cross-contaminate

Scenario	Expected
Colocate + Bridge	✅
Fully-async	✅ proxy correctly returns 503 during weight update
Partial rollout	✅ aborted samples resume correctly
MoE (routed_experts replay)	✅
Multimodal (Qwen3-VL)	✅ image URLs pass through
Elastic rollout scale-out	✅ proxy is engine-transparent

#	Task	Effort	Deliverable
M1	Proxy + message-diff encoding	3 d	Can intercept completions and build Samples correctly
M2	`AgenticRolloutManager` + subprocess launcher + `generate_rollout_agentic`	2 d	One GSM8K agent integration works via HTTP front-end
M3	`python_coroutine` launcher + `AgenticRollout` base class	1 d	verl/AReaL-style front-end works (same Sample output)
M4	L1 + L2 tests green	1 d	Unit + smoke tests pass
M5	L3 end-to-end training parity	2 d	Curves match the DeepEyes baseline
M6	Ray-actor + docker launchers	2 d	Production-grade launchers
M7	L5 real-agent PoCs (LangChain, SWE-agent)	3 d	2 real-agent example projects
M8	Docs + migration guide	1 d	`docs/agentic-rollout.md`; DeepEyes path marked legacy
Total		~15 working days

Risk	Likelihood	Mitigation
Token-level message diff breaks when chat templates change	Medium	LCS-based token alignment as a fallback + rich assertions
OpenAI API compatibility edge cases (streaming, `n > 1`)	Medium	Initially support only `stream=false, n=1`; reject others with HTTP 400
Agent subprocess leaks / timeouts	Low	Launcher-level kill on timeout + Ray worker-level GC
Regression in the existing DeepEyes path	Low	New path is a separate module; switchable via `--rollout-function-path`; old path preserved

[Feature] Agentic Rollout — training real agent apps with strict app–RL-framework separation and lightweight API integration #16

Description

Agentic Rollout — Training Real Agent Apps with Strict App–RL-Framework Separation and Lightweight API Integration

1. Motivation

2. Current State & Pain Points

2.1 Rollout layer — users copy 400+ lines of "framework code"

2.2 Env layer — BaseInteractionEnv.step(response_text) is a synchronous callback

2.3 Intrusion points — env must understand framework internals

3. Lessons from the Industry (verl / AReaL / OpenRLHF)

3.1 verl — AgentLoop (a.k.a. "rollout-as-a-service")

3.2 AReaL — Controller + LLMAgent

3.3 OpenRLHF — ExperienceMaker

3.4 Convergent pattern

4. Proposed Design — Inverted Control (IoC) Agentic Rollout

4.1 Three-layer architecture (aligned with verl / AReaL)

4.2 Deployment diagram

3.2 User integration — lightweight by construction

4.3 Alternative — Python front-end (verl/AReaL style)

4.4 End-to-end sequence diagram

5. Implementation Plan

Step 1 · Add AgenticRolloutProxy component (3–4 days)

Step 2 · Add AgenticRolloutManager + generate_rollout_agentic (2 days)

Step 3 · Agent-launcher abstraction (1.5 days)

Step 4 · New CLI arguments (0.5 day)

Step 5 · First example (1–2 days)

6. Key Technical Challenges & Mitigations

7. Testing & Validation Plan

L1 · Unit tests (0.5 day)

L2 · Integration smoke test (1 day)

L3 · End-to-end training (2 days)

L4 · Compatibility matrix (1 day)

L5 · Real agent-framework integration (2–3 days)

8. Hardware Plan

Development (single engineer)

Validation (baseline comparison)

Production (at scale)

Special cases

9. Milestones & Effort

10. Risks & Rollback

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2.2 Env layer — `BaseInteractionEnv.step(response_text)` is a synchronous callback

3.1 verl — `AgentLoop` (a.k.a. "rollout-as-a-service")

3.2 AReaL — `Controller` + `LLMAgent`

3.3 OpenRLHF — `ExperienceMaker`

Step 1 · Add `AgenticRolloutProxy` component (3–4 days)

Step 2 · Add `AgenticRolloutManager` + `generate_rollout_agentic` (2 days)