You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Agentic Rollout — Training Real Agent Apps with Strict App–RL-Framework Separation and Lightweight API Integration
1. Motivation
Modern RL-with-LLM workflows increasingly revolve around real agent applications — LangChain / LangGraph agents, SWE-agent, OpenHands, browser-use, Claude-Code-style assistants, and in-house tool-using agents. These apps already ship as independently runnable services with their own main loops, tool dispatch, state management, and (often) UIs.
The question this issue asks is: how do we train these agents with RL without forcing them to rewrite themselves against our framework?
Today Relax has a working Agentic rollout (examples/deepeyes/rollout.py + BaseInteractionEnv), but integrating a new agent app into training is heavy. We want:
Real agent apps — not just synthetic single-turn chat, but multi-turn, tool-calling, branching, possibly multi-agent workflows.
Strict app ↔ RL-framework separation — the agent code does not import the RL framework; the framework does not understand the agent's business logic (how many tools, how to parse them, how to keep state).
Lightweight API integration — ideally a couple of lines of code, or just a change of base_url, is all it takes to plug an existing agent into training.
2. Current State & Pain Points
Evidence collected from the code base (examples/deepeyes/, relax/engine/rollout/):
Every new agent example today has to copy-paste this file, patch DEFAULT_ENV_MODULE, and maintain its own drift-prone fork.
2.2 Env layer — BaseInteractionEnv.step(response_text) is a synchronous callback
# Current model (examples/deepeyes/base_env.py)env.reset()
forturninrange(max_turns):
response=llm_generate(tokens) # framework drives the loopobs, done, info=env.step(response) # env is a passive callbacktokens+=encode(obs)
This does not fit:
Agent shape
Why the current model hurts
Browser / Code agents
They already have while not done: plan → act → observe; being forced back into a callback requires major surgery
LangGraph / DAG agents
Control flow is a graph, not a linear step
Multi-agent setups
Multiple agents call the LLM independently — cannot be mapped to a single env
Async agents
_process_env_step supports awaitable, but execution is still step-wise & sequential
2.3 Intrusion points — env must understand framework internals
format_observation() must return a chat-message dict (role + content list).
The env has to know the tokenizer and the multimodal encoding rules.
The env must mutate framework-private fields like sample.metadata["_env_current_image"] to make partial-rollout recovery work.
Net effect: "integrate a new agent" today means "read 500+ lines of framework code, copy half of it, and learn our internal Sample layout". That is neither light-weight nor separated.
3. Lessons from the Industry (verl / AReaL / OpenRLHF)
Before proposing a design, we align with existing open-source practice. Three frameworks have converged on a very similar pattern:
verl introduced AgentLoop to solve exactly the problem this issue raises:
A AgentLoopManager (Ray actor) owns N AgentLoopWorker processes.
Each Worker hosts a pluggable BaseAgentLoop subclass (ToolAgentLoop, SingleTurnAgentLoop, user-defined MyAgentLoop) — this is where business logic lives.
The Worker talks to a headless vLLM/SGLang server via AsyncLLMServerManager.generate(...), which is an OpenAI-style async generate call.
The framework never looks inside the loop; it only consumes the returned AgentLoopOutput(prompt_ids, response_ids, response_mask, num_turns, metrics).
Key insight: the user only writes one class with one async def run(...) method. Everything else (batching, dispatching, weight sync, load-balancing SGLang) is framework-owned.
finish(reward, success) call, or return from user coroutine
Framework import in user code?
1 base class
1 base class
1 base class
0 (HTTP) or 1 base class (Python)
Takeaway for Relax: we should offer two interoperable front-ends on top of the same backend:
Zero-dep HTTP front-end (our unique selling point, for LangChain / SWE-agent / browser-use / external apps).
verl/AReaL-style Python front-end (AgenticRollout subclass) for users who do want a thin in-process coroutine and max throughput.
Both drop into the same AgenticRolloutProxy / AgentLoopManager internals, so there is one backend to maintain.
4. Proposed Design — Inverted Control (IoC) Agentic Rollout
Core idea: Invert the control flow. Instead of the framework driving the env, the agent app drives itself and calls the framework's LLM endpoint over an OpenAI-compatible HTTP API (or a thin Python coroutine à la verl/AReaL). The framework passively records the trajectory.
4.1 Three-layer architecture (aligned with verl / AReaL)
importopenaiclient=openai.OpenAI(
base_url=f"{RELAX_ROLLOUT_URL}/v1", # ① swap endpointdefault_headers={"X-Relax-Trajectory-Id": task_id}, # ② tag trajectory
)
# ... the rest of the agent code is unchanged ...
At the end of a trajectory the agent reports reward/termination:
This is semantically identical to the HTTP path; both produce the same Sample objects downstream. The framework picks the front-end based on --agent-launcher-type:
python_coroutine → Python front-end (max throughput, like verl AgentLoop)
subprocess / ray_actor / docker / http → HTTP front-end (max isolation, for real agent apps)
4.4 End-to-end sequence diagram
sequenceDiagram
participant T as Trainer (existing)
participant M as AgenticRolloutManager
participant W as AgentWorker #k
participant A as Agent App (user)
participant P as AgenticRolloutProxy
participant E as SGLang Engine
T->>M: generate_rollout_agentic(prompts)
loop for each prompt
M->>W: spawn(trajectory_id, prompt, proxy_url)
W->>A: exec agent_app.py (env: URL + TID + PROMPT)
end
loop agent main loop
A->>P: POST /v1/chat/completions (X-Relax-Trajectory-Id)
P->>P: diff messages → new_input_ids, loss_mask=0
P->>E: async generate(tokens, mm_inputs)
E-->>P: response_ids, logprobs, routed_experts
P->>P: append response, loss_mask=1
P-->>A: OpenAI-format completion
A->>A: run_tool(tool_calls)
end
A->>P: POST /v1/trajectories/{id}/finish (reward, success)
P->>P: sample.status = COMPLETED, put on queue
M->>P: drain(N)
P-->>M: List[Sample]
M-->>T: build_rollout_fn_output(...)
Maintain sample.last_messages_hash; on each new request, tokenize only the delta via tokenizer.apply_chat_template — the same trick used in DeepEyes _encode_observation_for_generation
Tool messages carry no tokens yet
Route tool results through the chat template → encode → loss_mask=0, consistent with the existing DeepEyes path
Multimodal inputs (images / video URLs)
Proxy downloads / base64-encodes and injects into SGLang image_data; transparent to the user
Partial-rollout compatibility
On abort, proxy keeps the sample state; the next request simply resumes appending
Weight-update blocking
When can_do_update_weight_for_async() is active, proxy rejects new completions with HTTP 503; the agent's OpenAI client retries naturally
Concurrency & isolation
Per-trajectory asyncio.Lock; proxy-internal asyncio.Queue; reuse the existing sglang_router load balancing
Per-step reward attribution
/v1/trajectories/{id}/finish optionally accepts turn_rewards: [...]; written to sample.metadata["step_rewards"] for custom reward functions
MoE routing replay
Read routed_experts from SGLang meta_info and accumulate into sample.rollout_routed_experts, identical to the DeepEyes _update_routed_experts path
Async / fully-async modes
Proxy lives inside the Rollout process, so it inherits TransferQueue / DCS / staleness semantics for free
7. Testing & Validation Plan
L1 · Unit tests (0.5 day)
Test
File
Focus
test_message_diff_encoding
tests/rollout/test_agentic_proxy.py
Incremental messages → correct tokens / loss_mask
test_openai_compat
same
chat.completions / tool_calls response is byte-identical to OpenAI
test_trajectory_isolation
same
Concurrent trajectories never cross-contaminate
L2 · Integration smoke test (1 day)
Start Relax rollout with the proxy only (no training loop).
Run agent_app.py + a real SGLang engine against 10 GSM8K prompts.
Assertions:
Exactly 10 Samples collected.
sum(loss_mask) ≈ #generated tokens.
len(sample.rollout_log_probs) == sum(loss_mask).
Reward distribution is reasonable.
L3 · End-to-end training (2 days)
Hardware: 8×H800 single node (colocate mode), matching the DeepEyes minimal setup.
Model: Qwen3-4B (fast iteration).
Dataset: GSM8K with tool calling.
Baseline: same data, same model, DeepEyes-style rollout (current path).
Metrics:
Throughput (tokens/s) within 10 % of baseline.
Reward-vs-step curve shape matches baseline.
After 100 iters, GSM8K pass@1 gap < 1 %.
Success criteria: no curve regression, ≥ 70 % LOC reduction for new agent integrations.
L4 · Compatibility matrix (1 day)
Scenario
Expected
Colocate + Bridge
✅
Fully-async
✅ proxy correctly returns 503 during weight update
Partial rollout
✅ aborted samples resume correctly
MoE (routed_experts replay)
✅
Multimodal (Qwen3-VL)
✅ image URLs pass through
Elastic rollout scale-out
✅ proxy is engine-transparent
L5 · Real agent-framework integration (2–3 days)
Pick 2–3 representative real agents:
LangChain ReAct agent (classic tool calling).
SWE-agent / OpenHands (code agent with its own complex main loop).
browser-use (multimodal + tools).
Target: only base_url changes, zero business-logic change → each runs one full RL iteration successfully.
8. Hardware Plan
This issue is primarily a framework capability; hardware demand is existing-training-demand plus N agent subprocesses.
Development (single engineer)
1× H800 80 GB + 8-core CPU
Qwen3-4B, batch = 4, concurrency = 4
Agents run as host-local subprocesses (CPU overhead negligible)
Validation (baseline comparison)
8× H800 single node, Qwen3-4B fully-async, batch = 64
Token-level message diff breaks when chat templates change
Medium
LCS-based token alignment as a fallback + rich assertions
OpenAI API compatibility edge cases (streaming, n > 1)
Medium
Initially support only stream=false, n=1; reject others with HTTP 400
Agent subprocess leaks / timeouts
Low
Launcher-level kill on timeout + Ray worker-level GC
Regression in the existing DeepEyes path
Low
New path is a separate module; switchable via --rollout-function-path; old path preserved
Rollback strategy: all changes are additive except four CLI arguments in arguments.py. If anything misbehaves, flip --rollout-function-path back to relax.engine.rollout.sglang_rollout.generate_rollout.
Agentic Rollout — Training Real Agent Apps with Strict App–RL-Framework Separation and Lightweight API Integration
1. Motivation
Modern RL-with-LLM workflows increasingly revolve around real agent applications — LangChain / LangGraph agents, SWE-agent, OpenHands, browser-use, Claude-Code-style assistants, and in-house tool-using agents. These apps already ship as independently runnable services with their own main loops, tool dispatch, state management, and (often) UIs.
The question this issue asks is: how do we train these agents with RL without forcing them to rewrite themselves against our framework?
Today Relax has a working Agentic rollout (
examples/deepeyes/rollout.py+BaseInteractionEnv), but integrating a new agent app into training is heavy. We want:base_url, is all it takes to plug an existing agent into training.2. Current State & Pain Points
Evidence collected from the code base (
examples/deepeyes/,relax/engine/rollout/):2.1 Rollout layer — users copy 400+ lines of "framework code"
examples/deepeyes/rollout.pyis ~564 LOC. ~85 % of it is generic multi-turn scaffolding:loss_maskconstructionEvery new agent example today has to copy-paste this file, patch
DEFAULT_ENV_MODULE, and maintain its own drift-prone fork.2.2 Env layer —
BaseInteractionEnv.step(response_text)is a synchronous callbackThis does not fit:
while not done: plan → act → observe; being forced back into a callback requires major surgery_process_env_stepsupportsawaitable, but execution is still step-wise & sequential2.3 Intrusion points — env must understand framework internals
format_observation()must return a chat-message dict (role + content list).sample.metadata["_env_current_image"]to make partial-rollout recovery work.Net effect: "integrate a new agent" today means "read 500+ lines of framework code, copy half of it, and learn our internal
Samplelayout". That is neither light-weight nor separated.3. Lessons from the Industry (verl / AReaL / OpenRLHF)
Before proposing a design, we align with existing open-source practice. Three frameworks have converged on a very similar pattern:
3.1 verl —
AgentLoop(a.k.a. "rollout-as-a-service")verl introduced
AgentLoopto solve exactly the problem this issue raises:AgentLoopManager(Ray actor) owns NAgentLoopWorkerprocesses.BaseAgentLoopsubclass (ToolAgentLoop,SingleTurnAgentLoop, user-definedMyAgentLoop) — this is where business logic lives.AsyncLLMServerManager.generate(...), which is an OpenAI-style async generate call.AgentLoopOutput(prompt_ids, response_ids, response_mask, num_turns, metrics).Key insight: the user only writes one class with one
async def run(...)method. Everything else (batching, dispatching, weight sync, load-balancing SGLang) is framework-owned.3.2 AReaL —
Controller+LLMAgentAReaL splits concerns even more explicitly:
RolloutControllerholds aWorkflowExecutorwith an input queue & result queue.RolloutWorkflow.arun_episode(engine, data)coroutine — this is the agent.arun_episode, the user callsengine.agenerate(...)(an HTTP client to the remote inference server) as many times as they want, in any shape.Trajectoryobjects; the trainer side callscontroller.prepare_batch()to drain.Key insight: the agent is just an async Python coroutine. No callback, no state machine, no message-diff trickery.
3.3 OpenRLHF —
ExperienceMakerThinner, synchronous, but the same idea: the agent logic is a callable owned by the user and is invoked by the framework with a remote LLM handle.
3.4 Convergent pattern
AgentLoopsubclassRolloutWorkflowcoroutineExperienceMakerAgenticRollout(user class) OR OpenAI-base_url (zero-code)AsyncLLMServerManager.generateengine.agenerateAsyncEngineClientAgentLoopWorker)WorkflowExecutor)AgenticRolloutProxy)AgentLoopOutputTrajectoryfinish(reward, success)call, or return from user coroutineTakeaway for Relax: we should offer two interoperable front-ends on top of the same backend:
AgenticRolloutsubclass) for users who do want a thin in-process coroutine and max throughput.Both drop into the same
AgenticRolloutProxy/AgentLoopManagerinternals, so there is one backend to maintain.4. Proposed Design — Inverted Control (IoC) Agentic Rollout
Core idea: Invert the control flow. Instead of the framework driving the env, the agent app drives itself and calls the framework's LLM endpoint over an OpenAI-compatible HTTP API (or a thin Python coroutine à la verl/AReaL). The framework passively records the trajectory.
4.1 Three-layer architecture (aligned with verl / AReaL)
4.2 Deployment diagram
3.2 User integration — lightweight by construction
Before (an existing agent app):
After (two-line change):
At the end of a trajectory the agent reports reward/termination:
Result:
grep -r "relax" agent_app/returns zero matches.4.3 Alternative — Python front-end (verl/AReaL style)
For users who prefer an in-process coroutine (no subprocess, no HTTP overhead, easy debugging):
This is semantically identical to the HTTP path; both produce the same
Sampleobjects downstream. The framework picks the front-end based on--agent-launcher-type:python_coroutine→ Python front-end (max throughput, like verl AgentLoop)subprocess/ray_actor/docker/http→ HTTP front-end (max isolation, for real agent apps)4.4 End-to-end sequence diagram
sequenceDiagram participant T as Trainer (existing) participant M as AgenticRolloutManager participant W as AgentWorker #k participant A as Agent App (user) participant P as AgenticRolloutProxy participant E as SGLang Engine T->>M: generate_rollout_agentic(prompts) loop for each prompt M->>W: spawn(trajectory_id, prompt, proxy_url) W->>A: exec agent_app.py (env: URL + TID + PROMPT) end loop agent main loop A->>P: POST /v1/chat/completions (X-Relax-Trajectory-Id) P->>P: diff messages → new_input_ids, loss_mask=0 P->>E: async generate(tokens, mm_inputs) E-->>P: response_ids, logprobs, routed_experts P->>P: append response, loss_mask=1 P-->>A: OpenAI-format completion A->>A: run_tool(tool_calls) end A->>P: POST /v1/trajectories/{id}/finish (reward, success) P->>P: sample.status = COMPLETED, put on queue M->>P: drain(N) P-->>M: List[Sample] M-->>T: build_rollout_fn_output(...)5. Implementation Plan
Step 1 · Add
AgenticRolloutProxycomponent (3–4 days)New file:
relax/engine/rollout/agentic/proxy.pyStep 2 · Add
AgenticRolloutManager+generate_rollout_agentic(2 days)New files:
relax/engine/rollout/agentic/manager.py,relax/engine/rollout/agentic/generate.pyAgenticRolloutManageris the analogue of verl'sAgentLoopManager:And
generate.pyis the tiny adapter that plugs into the existing rollout entrypoint:Step 3 · Agent-launcher abstraction (1.5 days)
New package:
relax/engine/rollout/agentic/launchers/python_coroutine.pyAgenticRolloutsubclass, in-process, fastest (verl-style)subprocess.pyray_actor.pydocker.pyhttp_callback.py/runCommon interface:
Step 4 · New CLI arguments (0.5 day)
In
relax/utils/arguments.py, rollout argument group:Step 5 · First example (1–2 days)
New directory:
examples/agentic_gsm8k/agent_app.py— noimport relaxanywhere:6. Key Technical Challenges & Mitigations
sample.last_messages_hash; on each new request, tokenize only the delta viatokenizer.apply_chat_template— the same trick used in DeepEyes_encode_observation_for_generationloss_mask=0, consistent with the existing DeepEyes pathimage_data; transparent to the usercan_do_update_weight_for_async()is active, proxy rejects new completions with HTTP 503; the agent's OpenAI client retries naturallyasyncio.Lock; proxy-internalasyncio.Queue; reuse the existing sglang_router load balancing/v1/trajectories/{id}/finishoptionally acceptsturn_rewards: [...]; written tosample.metadata["step_rewards"]for custom reward functionsrouted_expertsfrom SGLangmeta_infoand accumulate intosample.rollout_routed_experts, identical to the DeepEyes_update_routed_expertspath7. Testing & Validation Plan
L1 · Unit tests (0.5 day)
test_message_diff_encodingtests/rollout/test_agentic_proxy.pytest_openai_compattest_trajectory_isolationL2 · Integration smoke test (1 day)
agent_app.py+ a real SGLang engine against 10 GSM8K prompts.sum(loss_mask) ≈ #generated tokens.len(sample.rollout_log_probs) == sum(loss_mask).L3 · End-to-end training (2 days)
L4 · Compatibility matrix (1 day)
L5 · Real agent-framework integration (2–3 days)
Pick 2–3 representative real agents:
Target: only
base_urlchanges, zero business-logic change → each runs one full RL iteration successfully.8. Hardware Plan
This issue is primarily a framework capability; hardware demand is existing-training-demand plus N agent subprocesses.
Development (single engineer)
Validation (baseline comparison)
Production (at scale)
--agent-launcher-type ray_actor \ --resource '{"actor":[3,8], "rollout":[1,8], "agent_workers":[1,0]}'Special cases
docker_launcher; 1 node ≈ 8 workers.9. Milestones & Effort
AgenticRolloutManager+ subprocess launcher +generate_rollout_agenticpython_coroutinelauncher +AgenticRolloutbase classdocs/agentic-rollout.md; DeepEyes path marked legacy10. Risks & Rollback
n > 1)stream=false, n=1; reject others with HTTP 400--rollout-function-path; old path preservedRollback strategy: all changes are additive except four CLI arguments in
arguments.py. If anything misbehaves, flip--rollout-function-pathback torelax.engine.rollout.sglang_rollout.generate_rollout.