Skip to content

[Feature] Agentic Rollout — training real agent apps with strict app–RL-framework separation and lightweight API integration #16

@sjmshsh

Description

@sjmshsh

Agentic Rollout — Training Real Agent Apps with Strict App–RL-Framework Separation and Lightweight API Integration


1. Motivation

Modern RL-with-LLM workflows increasingly revolve around real agent applications — LangChain / LangGraph agents, SWE-agent, OpenHands, browser-use, Claude-Code-style assistants, and in-house tool-using agents. These apps already ship as independently runnable services with their own main loops, tool dispatch, state management, and (often) UIs.

The question this issue asks is: how do we train these agents with RL without forcing them to rewrite themselves against our framework?

Today Relax has a working Agentic rollout (examples/deepeyes/rollout.py + BaseInteractionEnv), but integrating a new agent app into training is heavy. We want:

  • Real agent apps — not just synthetic single-turn chat, but multi-turn, tool-calling, branching, possibly multi-agent workflows.
  • Strict app ↔ RL-framework separation — the agent code does not import the RL framework; the framework does not understand the agent's business logic (how many tools, how to parse them, how to keep state).
  • Lightweight API integration — ideally a couple of lines of code, or just a change of base_url, is all it takes to plug an existing agent into training.

2. Current State & Pain Points

Evidence collected from the code base (examples/deepeyes/, relax/engine/rollout/):

2.1 Rollout layer — users copy 400+ lines of "framework code"

examples/deepeyes/rollout.py is ~564 LOC. ~85 % of it is generic multi-turn scaffolding:

  • Token concatenation across turns
  • loss_mask construction
  • Multimodal accumulation
  • Budget / max-turn management
  • Partial-rollout (abort / resume) bookkeeping

Every new agent example today has to copy-paste this file, patch DEFAULT_ENV_MODULE, and maintain its own drift-prone fork.

2.2 Env layer — BaseInteractionEnv.step(response_text) is a synchronous callback

# Current model (examples/deepeyes/base_env.py)
env.reset()
for turn in range(max_turns):
    response = llm_generate(tokens)        # framework drives the loop
    obs, done, info = env.step(response)   # env is a passive callback
    tokens += encode(obs)

This does not fit:

Agent shape Why the current model hurts
Browser / Code agents They already have while not done: plan → act → observe; being forced back into a callback requires major surgery
LangGraph / DAG agents Control flow is a graph, not a linear step
Multi-agent setups Multiple agents call the LLM independently — cannot be mapped to a single env
Async agents _process_env_step supports awaitable, but execution is still step-wise & sequential

2.3 Intrusion points — env must understand framework internals

  • format_observation() must return a chat-message dict (role + content list).
  • The env has to know the tokenizer and the multimodal encoding rules.
  • The env must mutate framework-private fields like sample.metadata["_env_current_image"] to make partial-rollout recovery work.

Net effect: "integrate a new agent" today means "read 500+ lines of framework code, copy half of it, and learn our internal Sample layout". That is neither light-weight nor separated.


3. Lessons from the Industry (verl / AReaL / OpenRLHF)

Before proposing a design, we align with existing open-source practice. Three frameworks have converged on a very similar pattern:

3.1 verl — AgentLoop (a.k.a. "rollout-as-a-service")

verl introduced AgentLoop to solve exactly the problem this issue raises:

  • A AgentLoopManager (Ray actor) owns N AgentLoopWorker processes.
  • Each Worker hosts a pluggable BaseAgentLoop subclass (ToolAgentLoop, SingleTurnAgentLoop, user-defined MyAgentLoop) — this is where business logic lives.
  • The Worker talks to a headless vLLM/SGLang server via AsyncLLMServerManager.generate(...), which is an OpenAI-style async generate call.
  • The framework never looks inside the loop; it only consumes the returned AgentLoopOutput(prompt_ids, response_ids, response_mask, num_turns, metrics).

Key insight: the user only writes one class with one async def run(...) method. Everything else (batching, dispatching, weight sync, load-balancing SGLang) is framework-owned.

3.2 AReaL — Controller + LLMAgent

AReaL splits concerns even more explicitly:

  • RolloutController holds a WorkflowExecutor with an input queue & result queue.
  • The user writes a RolloutWorkflow.arun_episode(engine, data) coroutine — this is the agent.
  • Inside arun_episode, the user calls engine.agenerate(...) (an HTTP client to the remote inference server) as many times as they want, in any shape.
  • The result is returned as a flat list of Trajectory objects; the trainer side calls controller.prepare_batch() to drain.

Key insight: the agent is just an async Python coroutine. No callback, no state machine, no message-diff trickery.

3.3 OpenRLHF — ExperienceMaker

Thinner, synchronous, but the same idea: the agent logic is a callable owned by the user and is invoked by the framework with a remote LLM handle.

3.4 Convergent pattern

Concern verl AReaL OpenRLHF Relax (proposed)
Who owns the loop? AgentLoop subclass RolloutWorkflow coroutine ExperienceMaker AgenticRollout (user class) OR OpenAI-base_url (zero-code)
How does the loop talk to the engine? AsyncLLMServerManager.generate engine.agenerate direct inference call OpenAI-compatible HTTP + optional Python AsyncEngineClient
Where is token/mask/logprob bookkeeping? Framework (AgentLoopWorker) Framework (WorkflowExecutor) Framework Framework (AgenticRolloutProxy)
How is a trajectory finalized? return AgentLoopOutput return Trajectory return dict finish(reward, success) call, or return from user coroutine
Framework import in user code? 1 base class 1 base class 1 base class 0 (HTTP) or 1 base class (Python)

Takeaway for Relax: we should offer two interoperable front-ends on top of the same backend:

  1. Zero-dep HTTP front-end (our unique selling point, for LangChain / SWE-agent / browser-use / external apps).
  2. verl/AReaL-style Python front-end (AgenticRollout subclass) for users who do want a thin in-process coroutine and max throughput.

Both drop into the same AgenticRolloutProxy / AgentLoopManager internals, so there is one backend to maintain.


4. Proposed Design — Inverted Control (IoC) Agentic Rollout

Core idea: Invert the control flow. Instead of the framework driving the env, the agent app drives itself and calls the framework's LLM endpoint over an OpenAI-compatible HTTP API (or a thin Python coroutine à la verl/AReaL). The framework passively records the trajectory.

4.1 Three-layer architecture (aligned with verl / AReaL)

┌──────────────────────────────────────────────────────────────────┐
│  Layer 1 · Agent (user code)                                     │
│    Front-end A: pure HTTP (OpenAI SDK, langchain, swe-agent …)   │
│    Front-end B: `class MyAgent(AgenticRollout): async def run()` │
│                      (verl AgentLoop / AReaL Workflow style)     │
└──────────────────────────────────────────────────────────────────┘
                          │  requests                     ▲
                          ▼                               │ samples
┌──────────────────────────────────────────────────────────────────┐
│  Layer 2 · Orchestration (framework, new)                        │
│    • `AgenticRolloutManager` (Ray actor) — like verl AgentLoopMgr│
│    • Spawns N `AgentWorker`s (subprocess | ray_actor | docker)   │
│    • `AgenticRolloutProxy` (FastAPI) — OpenAI-compatible ingress │
│    • Per-trajectory Sample state machine + asyncio.Queue         │
└──────────────────────────────────────────────────────────────────┘
                          │  engine.agenerate (async)     ▲
                          ▼                               │ tokens+logprobs
┌──────────────────────────────────────────────────────────────────┐
│  Layer 3 · Inference (existing, unchanged)                       │
│    • SGLang router + engines                                     │
│    • Partial rollout / weight update / MoE routed_experts        │
└──────────────────────────────────────────────────────────────────┘

4.2 Deployment diagram

┌─────────────────────────────────────────────────────────────┐
│  Agent App (user code, zero framework dependency)           │
│    - LangChain / LangGraph / Custom / ...                   │
│    - while not done:                                        │
│        reply = openai.chat.completions.create(...) ──┐      │
│        tool_result = run_tool(reply.tool_calls)      │      │
│        messages.append(tool_result)                  │      │
└──────────────────────────────────────────────────────┼──────┘
                                                       │  HTTP  /v1/chat/completions
                                                       │  (OpenAI-compatible)
                                                       ▼
┌─────────────────────────────────────────────────────────────┐
│  Relax Rollout Gateway (NEW: AgenticRolloutProxy)           │
│   • Intercepts each completion request                      │
│   • Keyed by trajectory_id (X-Relax-Trajectory-Id header)   │
│   • Calls SGLang engine, records logprobs + routed_experts  │
│   • Incrementally builds a per-trajectory Sample:           │
│       - messages_in  → loss_mask=0 span (env input)         │
│       - completion  → loss_mask=1 span (model output)       │
│   • Trajectory finishes via /v1/trajectories/{id}/finish    │
└─────────────────────────────────────────────────────────────┘
                    │
                    ▼   (tokens, loss_mask, logprobs, routed_experts)
          ┌──────────────────────────┐
          │  Existing Sample & training │
          │  GRPO / PPO / Reward / ...  │
          └──────────────────────────┘

3.2 User integration — lightweight by construction

Before (an existing agent app):

import openai
client = openai.OpenAI(base_url="https://api.openai.com/v1", api_key="sk-xxx")

After (two-line change):

import openai
client = openai.OpenAI(
    base_url=f"{RELAX_ROLLOUT_URL}/v1",                         # ① swap endpoint
    default_headers={"X-Relax-Trajectory-Id": task_id},         # ② tag trajectory
)
# ... the rest of the agent code is unchanged ...

At the end of a trajectory the agent reports reward/termination:

client.post(f"{RELAX_ROLLOUT_URL}/v1/trajectories/{task_id}/finish",
            json={"reward": my_reward, "success": True})

Result: grep -r "relax" agent_app/ returns zero matches.

4.3 Alternative — Python front-end (verl/AReaL style)

For users who prefer an in-process coroutine (no subprocess, no HTTP overhead, easy debugging):

# examples/agentic_gsm8k/agent_loop.py
from relax.engine.rollout.agentic import AgenticRollout, register_agentic

@register_agentic("gsm8k_react")
class GSM8KReActAgent(AgenticRollout):
    async def run(self, sample, engine):
        messages = [{"role": "user", "content": sample.prompt}]
        for turn in range(5):
            resp = await engine.agenerate(messages, tools=TOOLS)  # remote, async
            messages.append(resp.message)
            if resp.finish_reason == "stop":
                break
            tool_result = await run_tool(resp.tool_calls[0])
            messages.append({"role": "tool", "content": tool_result})
        return {"reward": grade(messages[-1].content, sample.answer),
                "success": True}

This is semantically identical to the HTTP path; both produce the same Sample objects downstream. The framework picks the front-end based on --agent-launcher-type:

  • python_coroutine → Python front-end (max throughput, like verl AgentLoop)
  • subprocess / ray_actor / docker / http → HTTP front-end (max isolation, for real agent apps)

4.4 End-to-end sequence diagram

sequenceDiagram
    participant T as Trainer (existing)
    participant M as AgenticRolloutManager
    participant W as AgentWorker #k
    participant A as Agent App (user)
    participant P as AgenticRolloutProxy
    participant E as SGLang Engine

    T->>M: generate_rollout_agentic(prompts)
    loop for each prompt
        M->>W: spawn(trajectory_id, prompt, proxy_url)
        W->>A: exec agent_app.py (env: URL + TID + PROMPT)
    end

    loop agent main loop
        A->>P: POST /v1/chat/completions (X-Relax-Trajectory-Id)
        P->>P: diff messages → new_input_ids, loss_mask=0
        P->>E: async generate(tokens, mm_inputs)
        E-->>P: response_ids, logprobs, routed_experts
        P->>P: append response, loss_mask=1
        P-->>A: OpenAI-format completion
        A->>A: run_tool(tool_calls)
    end

    A->>P: POST /v1/trajectories/{id}/finish (reward, success)
    P->>P: sample.status = COMPLETED, put on queue
    M->>P: drain(N)
    P-->>M: List[Sample]
    M-->>T: build_rollout_fn_output(...)
Loading

5. Implementation Plan

Step 1 · Add AgenticRolloutProxy component (3–4 days)

New file: relax/engine/rollout/agentic/proxy.py

class AgenticRolloutProxy:
    """FastAPI ingress implementing OpenAI-compatible /v1/chat/completions.
    Each request is traced by X-Relax-Trajectory-Id and token-level data is
    accumulated into per-trajectory Sample objects."""

    def __init__(self, args, sglang_router_url):
        self._samples: dict[str, Sample] = {}          # trajectory_id -> Sample
        self._pending: asyncio.Queue[Sample] = asyncio.Queue()
        self._tokenizer = load_tokenizer(args)
        self._processor = load_processor(args)

    @app.post("/v1/chat/completions")
    async def chat_completions(self, body, headers):
        tid = headers["X-Relax-Trajectory-Id"]
        sample = self._samples.setdefault(tid, Sample(id=tid, tokens=[], loss_mask=[]))
        # 1. diff messages -> new input tokens (loss_mask=0)
        new_input_ids, mm_inputs = self._diff_and_encode(sample, body["messages"])
        sample.tokens.extend(new_input_ids)
        sample.loss_mask.extend([0] * len(new_input_ids))
        # 2. forward to SGLang engine, receive response + logprobs
        resp = await self._sglang_generate(sample.tokens, body["sampling_params"], mm_inputs)
        # 3. append response (loss_mask=1, real logprobs)
        sample.tokens.extend(resp.tokens)
        sample.loss_mask.extend([1] * len(resp.tokens))
        sample.rollout_log_probs.extend(resp.logprobs)
        # 4. return OpenAI-format response to the agent
        return _to_openai_format(resp)

    @app.post("/v1/trajectories/{tid}/finish")
    async def finish(self, tid, body):
        sample = self._samples.pop(tid)
        sample.reward = body["reward"]
        sample.status = (
            Sample.Status.COMPLETED if body.get("success")
            else Sample.Status.TRUNCATED
        )
        await self._pending.put(sample)

    async def drain(self, n: int) -> list[Sample]:
        return [await self._pending.get() for _ in range(n)]

Step 2 · Add AgenticRolloutManager + generate_rollout_agentic (2 days)

New files: relax/engine/rollout/agentic/manager.py, relax/engine/rollout/agentic/generate.py

AgenticRolloutManager is the analogue of verl's AgentLoopManager:

@ray.remote
class AgenticRolloutManager:
    def __init__(self, args, proxy_handle, sglang_router_url):
        self._launcher = build_launcher(args.agent_launcher_type)
        self._proxy = proxy_handle
        self._concurrency = args.agent_concurrency or args.rollout_batch_size

    async def run_batch(self, prompts: list[Sample]) -> list[Sample]:
        sem = asyncio.Semaphore(self._concurrency)
        async def _one(p):
            async with sem:
                await self._launcher.start(p, self._proxy.url)
                return await self._proxy.wait(p.id)  # blocks until /finish
        return await asyncio.gather(*[_one(p) for p in prompts])

And generate.py is the tiny adapter that plugs into the existing rollout entrypoint:

async def generate_rollout_agentic(
    args, rollout_id, data_source, data_system_client, evaluation=False,
):
    """Drop-in replacement for sglang_rollout.generate_rollout that routes
    prompts through the AgenticRolloutManager + Proxy."""
    manager = AgenticRolloutManager.get_or_create(args)
    prompts = data_source.get_samples(args.rollout_batch_size)
    samples = await manager.run_batch.remote(prompts)
    return build_rollout_fn_output(
        samples, rollout_id=rollout_id, args=args,
        data_system_client=data_system_client, evaluation=evaluation,
    )

Step 3 · Agent-launcher abstraction (1.5 days)

New package: relax/engine/rollout/agentic/launchers/

Launcher Front-end Use-case
python_coroutine.py Python AgenticRollout subclass, in-process, fastest (verl-style)
subprocess.py HTTP Start the agent app as a Python subprocess (most generic)
ray_actor.py HTTP Ray Actor (GPU-friendly, e.g. browser agents with vision models)
docker.py HTTP Containerized sandbox (code agents, SWE-agent)
http_callback.py HTTP The user already runs an agent server; framework just POSTs /run

Common interface:

class BaseAgentLauncher:
    async def start(self, sample: Sample, proxy_url: str) -> None: ...
    async def kill(self, sample_id: str) -> None: ...

Step 4 · New CLI arguments (0.5 day)

In relax/utils/arguments.py, rollout argument group:

parser.add_argument(
    "--rollout-function-path",
    default="relax.engine.rollout.sglang_rollout.generate_rollout",
)
# To use agentic rollout:
#   --rollout-function-path relax.engine.rollout.agentic.generate.generate_rollout_agentic

parser.add_argument(
    "--agent-launcher", type=str, default=None,
    help=("Entrypoint of the agent app. "
          "For python_coroutine: 'examples.agentic_gsm8k.agent_loop:GSM8KReActAgent'. "
          "For subprocess: 'examples.swe_agent.run:main'. "
          "For http: 'http://agent-svc/run'."),
)
parser.add_argument(
    "--agent-launcher-type",
    choices=["python_coroutine", "subprocess", "ray_actor", "docker", "http"],
    default="subprocess",
)
parser.add_argument("--agent-timeout-secs", type=int, default=600)
parser.add_argument(
    "--agent-concurrency", type=int, default=None,
    help="Max concurrent agent instances; default = rollout_batch_size",
)

Step 5 · First example (1–2 days)

New directory: examples/agentic_gsm8k/

examples/agentic_gsm8k/
├── agent_app.py           # pure user code: a ~50-line ReAct agent using openai SDK
├── run_agentic_gsm8k.sh   # launch script
└── README.md

agent_app.pyno import relax anywhere:

import os, json, openai

client = openai.OpenAI(
    base_url=os.environ["RELAX_ROLLOUT_URL"] + "/v1",
    default_headers={"X-Relax-Trajectory-Id": os.environ["RELAX_TRAJECTORY_ID"]},
    api_key="dummy",
)
messages = [{"role": "user", "content": os.environ["RELAX_PROMPT"]}]
for turn in range(5):
    resp = client.chat.completions.create(model="policy", messages=messages, tools=TOOLS)
    messages.append(resp.choices[0].message)
    if resp.choices[0].finish_reason == "stop":
        break
    tool_result = run_tool(resp.choices[0].message.tool_calls[0])
    messages.append({"role": "tool", "content": tool_result, "tool_call_id": ...})

answer = messages[-1].content
reward = 1.0 if extract_answer(answer) == ground_truth else 0.0
client.post(
    f"/v1/trajectories/{os.environ['RELAX_TRAJECTORY_ID']}/finish",
    json={"reward": reward, "success": True},
)

6. Key Technical Challenges & Mitigations

Challenge Mitigation
Token-level diff of messages Maintain sample.last_messages_hash; on each new request, tokenize only the delta via tokenizer.apply_chat_template — the same trick used in DeepEyes _encode_observation_for_generation
Tool messages carry no tokens yet Route tool results through the chat template → encode → loss_mask=0, consistent with the existing DeepEyes path
Multimodal inputs (images / video URLs) Proxy downloads / base64-encodes and injects into SGLang image_data; transparent to the user
Partial-rollout compatibility On abort, proxy keeps the sample state; the next request simply resumes appending
Weight-update blocking When can_do_update_weight_for_async() is active, proxy rejects new completions with HTTP 503; the agent's OpenAI client retries naturally
Concurrency & isolation Per-trajectory asyncio.Lock; proxy-internal asyncio.Queue; reuse the existing sglang_router load balancing
Per-step reward attribution /v1/trajectories/{id}/finish optionally accepts turn_rewards: [...]; written to sample.metadata["step_rewards"] for custom reward functions
MoE routing replay Read routed_experts from SGLang meta_info and accumulate into sample.rollout_routed_experts, identical to the DeepEyes _update_routed_experts path
Async / fully-async modes Proxy lives inside the Rollout process, so it inherits TransferQueue / DCS / staleness semantics for free

7. Testing & Validation Plan

L1 · Unit tests (0.5 day)

Test File Focus
test_message_diff_encoding tests/rollout/test_agentic_proxy.py Incremental messages → correct tokens / loss_mask
test_openai_compat same chat.completions / tool_calls response is byte-identical to OpenAI
test_trajectory_isolation same Concurrent trajectories never cross-contaminate

L2 · Integration smoke test (1 day)

  1. Start Relax rollout with the proxy only (no training loop).
  2. Run agent_app.py + a real SGLang engine against 10 GSM8K prompts.
  3. Assertions:
    • Exactly 10 Samples collected.
    • sum(loss_mask) ≈ #generated tokens.
    • len(sample.rollout_log_probs) == sum(loss_mask).
    • Reward distribution is reasonable.

L3 · End-to-end training (2 days)

  • Hardware: 8×H800 single node (colocate mode), matching the DeepEyes minimal setup.
  • Model: Qwen3-4B (fast iteration).
  • Dataset: GSM8K with tool calling.
  • Baseline: same data, same model, DeepEyes-style rollout (current path).
  • Metrics:
    • Throughput (tokens/s) within 10 % of baseline.
    • Reward-vs-step curve shape matches baseline.
    • After 100 iters, GSM8K pass@1 gap < 1 %.
  • Success criteria: no curve regression, ≥ 70 % LOC reduction for new agent integrations.

L4 · Compatibility matrix (1 day)

Scenario Expected
Colocate + Bridge
Fully-async ✅ proxy correctly returns 503 during weight update
Partial rollout ✅ aborted samples resume correctly
MoE (routed_experts replay)
Multimodal (Qwen3-VL) ✅ image URLs pass through
Elastic rollout scale-out ✅ proxy is engine-transparent

L5 · Real agent-framework integration (2–3 days)

Pick 2–3 representative real agents:

  • LangChain ReAct agent (classic tool calling).
  • SWE-agent / OpenHands (code agent with its own complex main loop).
  • browser-use (multimodal + tools).

Target: only base_url changes, zero business-logic change → each runs one full RL iteration successfully.


8. Hardware Plan

This issue is primarily a framework capability; hardware demand is existing-training-demand plus N agent subprocesses.

Development (single engineer)

  • 1× H800 80 GB + 8-core CPU
  • Qwen3-4B, batch = 4, concurrency = 4
  • Agents run as host-local subprocesses (CPU overhead negligible)

Validation (baseline comparison)

  • 8× H800 single node, Qwen3-4B fully-async, batch = 64
  • Agent concurrency = 64 → CPU ≥ 32 cores (tool execution / parsing)

Production (at scale)

  • 32× H800 (4 nodes), Qwen3-32B, batch = 256
  • Recommended: schedule agent workers onto dedicated CPU nodes via Ray placement groups:
    --agent-launcher-type ray_actor \
    --resource '{"actor":[3,8], "rollout":[1,8], "agent_workers":[1,0]}'
  • Network: agent workers ↔ rollout proxy on intra-cluster gRPC/HTTP, < 1 ms latency

Special cases

  • Browser agents: each worker runs a headless Chrome → 1 node ≈ 16 workers (CPU-bound).
  • Code agents: each worker uses a Docker sandbox → use docker_launcher; 1 node ≈ 8 workers.

9. Milestones & Effort

# Task Effort Deliverable
M1 Proxy + message-diff encoding 3 d Can intercept completions and build Samples correctly
M2 AgenticRolloutManager + subprocess launcher + generate_rollout_agentic 2 d One GSM8K agent integration works via HTTP front-end
M3 python_coroutine launcher + AgenticRollout base class 1 d verl/AReaL-style front-end works (same Sample output)
M4 L1 + L2 tests green 1 d Unit + smoke tests pass
M5 L3 end-to-end training parity 2 d Curves match the DeepEyes baseline
M6 Ray-actor + docker launchers 2 d Production-grade launchers
M7 L5 real-agent PoCs (LangChain, SWE-agent) 3 d 2 real-agent example projects
M8 Docs + migration guide 1 d docs/agentic-rollout.md; DeepEyes path marked legacy
Total ~15 working days

10. Risks & Rollback

Risk Likelihood Mitigation
Token-level message diff breaks when chat templates change Medium LCS-based token alignment as a fallback + rich assertions
OpenAI API compatibility edge cases (streaming, n > 1) Medium Initially support only stream=false, n=1; reject others with HTTP 400
Agent subprocess leaks / timeouts Low Launcher-level kill on timeout + Ray worker-level GC
Regression in the existing DeepEyes path Low New path is a separate module; switchable via --rollout-function-path; old path preserved

Rollback strategy: all changes are additive except four CLI arguments in arguments.py. If anything misbehaves, flip --rollout-function-path back to relax.engine.rollout.sglang_rollout.generate_rollout.


Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions