[RFC] Agent Abstractions and Trajectory Gateway for VERL
Summary
This RFC proposes two new abstractions for VERL's agent-based reinforcement learning pipeline:
- AgentFramework — an abstract base class for agent lifecycle management and reward computation, replacing the current tight coupling between
AgentLoopManager and specific agent implementations.
- AgentGateway — a Gateway subsystem owned by the serving layer (
LLMServerManager) that intercepts agent LLM calls via the OpenAI Chat Completions API, performs canonical tokenization, and assembles token-level trajectory data with strict token-truth guarantees.
Together, they enable any OpenAI-compatible agent system to be integrated into VERL's training loop without modifications to agent code, while producing continuous multi-turn trajectories with loss masks directly consumable by VERL's training engine.
Motivation
VERL's current agent integration (AgentLoopManager + AgentLoopBase) tightly couples three concerns: LLM infrastructure management, agent lifecycle, and trajectory collection. This creates friction when integrating new agent types:
- Each new agent framework requires dedicated adapter code inside the agent loop.
- Trajectory collection logic is embedded in the agent loop itself, making it non-reusable.
- Only coroutine-based agents are natively supported; subprocess and remote agents require ad-hoc integration (e.g., SWE-Agent's custom
ModelProxy).
Community contributions such as AWS AgentCore (PR #4216) and Aliyun Remote Agent (Issue #5737) further demonstrate the need for a pluggable agent abstraction that cleanly separates these concerns.
This RFC addresses these issues by:
- Defining
AgentFramework as a thin standard interface for agent-based rollout. The common contract is generate_sequences(prompts: DataProto) -> DataProto; internal execution structure, reward computation flow, and batching strategy remain implementation-specific.
- Extracting trajectory collection into
AgentGateway, a serving-side subsystem that works with any Framework implementation. The Gateway handles tokenization, prefix consistency, and trajectory assembly — concerns that are orthogonal to how agents are launched or managed.
- Extracting infrastructure management (LLM server initialization, load balancing) out of the framework layer, so that
AgentFramework remains focused on trainer-facing generation semantics rather than serving ownership.
Design Overview
Architecture
VERL Training Loop
│
│
├── AgentFramework
│ Agent lifecycle management
│ Reward computation
│ Batch orchestration + DataProto assembly
│
└── Serving Runtime
Owns: LLMServerManager, load balancer, Gateway subsystem
├── GatewayManager (internal session-routing component)
├── Gateway Actor 1 (Ray actor, FastAPI)
├── Gateway Actor 2 (Ray actor, FastAPI)
└── Gateway Actor N (Ray actor, FastAPI)
AgentFramework and AgentGateway are independent — the Gateway does not know which Framework implementation is using it, and the Framework does not know the Gateway's internal trajectory assembly logic. They interact only through a well-defined session API.
To avoid single-point bottlenecks, multiple Gateway instances run as Ray actors, each hosting a FastAPI HTTP server. An AgentGatewayManager routes session creation requests across Gateway actors (e.g., round-robin or least-loaded). Once a session is created on a specific Gateway actor, all subsequent requests for that session are pinned to that actor. This follows the same pattern as VERL's existing GlobalRequestLoadBalancer for LLM server replicas.
Data Flow
A single session proceeds as follows:
- Framework creates a session on the
AgentGatewayManager, which selects a Gateway actor and returns a session-specific base_url.
- Framework starts the agent (subprocess, coroutine, or remote call), injecting the
base_url as the agent's LLM endpoint.
- Agent makes standard OpenAI Chat Completion requests to the assigned Gateway actor. On each request, the Gateway tokenizes, checks prefix consistency, routes to the inference backend, records the token-level interaction, and returns a standard OpenAI response. The agent is unaware of the interception.
- Agent completes (process exits, coroutine returns, or calls the optional
/complete endpoint).
- Framework finalizes the session via the
AgentGatewayManager, receiving the assembled trajectories.
- Framework computes trajectory-aligned rewards and packages the resulting samples into
DataProto.
AgentFramework
Interface
class AgentFramework(ABC):
@abstractmethod
async def generate_sequences(self, prompts: DataProto) -> DataProto:
"""Process a trainer batch and return a training-ready DataProto."""
...
AgentFramework is intentionally thin. Users are expected to implement agent lifecycle management, reward computation, and batch orchestration through the generate_sequences interface. Common patterns may be factored into subclasses and helpers where useful.
This keeps the interface compatible with heterogeneous execution models:
- VERL-native agent loops
- subprocess-based OpenAI-compatible agents
- remote services and cloud-hosted agent frameworks
Reward Computation
Reward assignment remains a shared concern, but not a required abstract method on AgentFramework.
Typical patterns include:
- Framework-collected. When the framework can directly access agent output (subprocess stdout, coroutine return value), it parses the relevant information locally.
- Agent-uploaded. When the agent runs remotely and the framework has no direct access to its output, the agent uploads
reward_info via the Gateway's optional /complete endpoint, and the framework reads it back during session finalization.
- Helper-normalized. Framework implementations may reuse shared helpers to normalize one session-level reward into per-trajectory rewards, validate trajectory/reward alignment, and assemble
DataProto.
Reference Implementations
OpenAICompatibleAgentFramework is the preferred first implementation target. It launches or contacts an OpenAI-compatible agent, injects a session base_url, and relies on Gateway-backed /v1/chat/completions traffic as the trajectory truth source.
CliAgentFramework is a straightforward variant that launches external agent programs as subprocesses, injecting the Gateway session URL via the OPENAI_BASE_URL environment variable. Any agent that uses the OpenAI Chat Completions API works without code changes. Completion is detected via process exit.
AgentLoopManager remains an important follow-up migration target, but it is not part of the first implementation milestone. Supporting it cleanly will likely require a token-request ingress in addition to the chat-completions path described in this RFC.
Custom implementations can support other execution models such as remote services or cloud-hosted frameworks by using the same session API. For remote agents without an external notification channel, the Gateway provides wait_for_completion() to block until the agent signals completion via /complete.
Migration from AgentLoopManager
The current AgentLoopManager remains on the legacy path in the first implementation stage.
The staged plan is:
- First, land a Gateway-backed chat-completions path for OpenAI-compatible / remote-style agents.
- Then, add token-request ingress as a dedicated extension point.
- Finally, migrate
AgentLoopManager and related VERL-native agent loops onto the new serving/Gateway model without introducing double trajectory bookkeeping in production.
AgentGateway
Overview
Each AgentGateway instance is a Ray actor running a FastAPI HTTP server that exposes the OpenAI Chat Completions API. It manages multiple concurrent sessions, each maintaining independent trajectory state. The Gateway is the single canonical tokenization authority for the chat-completions path — all messages -> token_ids conversions happen here, using the inference backend's tokenizer and chat template.
In the first implementation stage, /v1/chat/completions is the primary ingress. A token-request ingress for legacy VERL-native paths is reserved as a follow-up extension.
Multiple Gateway actors are managed by an AgentGatewayManager, which handles session routing and provides a unified interface to the Framework.
Gateway Manager and Scaling
Each Gateway actor manages its own sessions independently. The manager's only responsibility is routing — selecting which actor handles a new session and forwarding subsequent calls to the correct actor.
class AgentGatewayManager:
"""Manages multiple Gateway actors with session routing."""
def __init__(self, gateways: list[AgentGateway]):
self.gateways = gateways
self._session_to_gateway: dict[str, AgentGateway] = {}
async def create_session(self, session_id: str) -> GatewaySession:
"""Select a Gateway actor (e.g., round-robin) and create a session."""
gateway = self._select_gateway()
session = await gateway.create_session.remote(session_id)
self._session_to_gateway[session_id] = gateway
return session
async def finalize_session(self, session_id: str) -> list[Trajectory]:
"""Route to the correct Gateway actor and finalize."""
gateway = self._session_to_gateway.pop(session_id)
return await gateway.finalize_session.remote(session_id)
async def abort_session(self, session_id: str) -> None:
gateway = self._session_to_gateway.pop(session_id, None)
if gateway:
await gateway.abort_session.remote(session_id)
async def wait_for_completion(self, session_id: str, timeout: float) -> None:
gateway = self._session_to_gateway[session_id]
await gateway.wait_for_completion.remote(session_id, timeout)
Session Management
The Gateway provides a session API for Framework to manage session lifecycles:
@ray.remote
class AgentGateway:
def __init__(
self,
tokenizer: AutoTokenizer,
chat_template: str,
backend: InferenceBackend,
config: GatewayConfig,
): ...
async def create_session(self, session_id: str) -> GatewaySession:
"""Create a trajectory session."""
...
async def finalize_session(self, session_id: str) -> list[Trajectory]:
"""Assemble and return trajectories, clean up session state.
Returns one trajectory per prefix-consistent segment."""
...
async def abort_session(self, session_id: str) -> None:
"""Discard session state."""
...
async def wait_for_completion(self, session_id: str, timeout: float) -> None:
"""Block until agent calls /complete. For remote agents only."""
...
create_session returns a GatewaySession containing a session-specific base_url (e.g., http://{gateway_host}:{port}/sessions/{session_id}/v1). The agent uses this URL for all LLM calls, and the Gateway routes requests to the correct session by URL path. This approach requires no special headers or client modifications — the agent simply uses a different base URL.
HTTP Endpoints
The Gateway exposes two endpoints per session:
POST /sessions/{id}/v1/chat/completions — Standard OpenAI Chat Completions. The agent calls this as its normal LLM endpoint. The Gateway intercepts the request, performs tokenization, routes to the inference backend, records the interaction, and returns a standard response. This is the mandatory first-stage endpoint.
POST /sessions/{id}/complete — Optional. Allows the agent to explicitly signal session completion and optionally upload reward-related information. This is useful for remote agents that have no other completion notification channel, or for VERL-native agent loops that want to pass structured results. Agents that do not call this endpoint are unaffected — Framework detects completion through other means (process exit, coroutine return, etc.).
Request Handling
On each Chat Completion request, the Gateway performs:
- Message-level prefix check. Compare the incoming normalized messages with the session's recorded message history. If the prefix matches, only the new incremental messages are tokenized and appended to the accumulated token sequence. If the prefix does not match, the current trajectory is finalized and a new trajectory begins with full tokenization.
- Inference routing. Send
prompt_ids to the inference backend via its token-level generation API (e.g. AsyncLLMServerManager). Receive response_ids and logprobs.
- Interaction recording. Record the turn's
prompt_ids, response_ids, and logprobs. Update the session's accumulated token sequence and message history.
- Response reconstruction. Detokenize the response and construct a standard OpenAI Chat Completion response for the agent. When tool calling is enabled, structured
tool_calls fields are reconstructed from the raw token output.
Trajectory Output
After a session completes, finalize_session assembles the recorded interactions into a list[Trajectory]. Each trajectory is a prefix-consistent, continuous token sequence that constitutes an independent training sample:
@dataclass
class Trajectory:
uid: str # Each prompt has a unique uuid from dataset
session_id: int # Each group sampling has a session_id: [0, n)
trajectory_id: int # Each sampling outputs m trajectories: [0, m)
reward_info: dict
prompt_ids: list[int]
response_ids: list[int]
response_logprobs: list[float]
loss_mask: list[int] # 1 for response tokens, 0 for prompt
...
A session produces multiple trajectories when the Gateway detects a message prefix mismatch mid-session, possibly due to context compression, skill switching, or other agent-side context rewrites. The Gateway does not need to understand why the context changed; it only enforces consistency within each trajectory. In the final DataProto output, each trajectory becomes one row.
Extensions
Multimodal Support
The Gateway architecture supports multimodal inputs via an optional preprocessor. When present, the Gateway applies the multimodal preprocessor during tokenization and stores processor outputs alongside token sequences. Specific processor adapters will be added as model support grows.
Tool Call Reconstruction
For models that produce tool calls via special tokens, the Gateway uses a configurable tool parser to reconstruct structured tool_calls fields in the OpenAI response returned to the agent. The raw token sequence is always preserved as-is for training.
Token-Request Ingress
A token-request ingress is reserved for follow-up work so that legacy VERL-native paths such as AgentLoopManager can migrate onto the same Gateway/session model without keeping legacy trajectory bookkeeping as a production truth source. This extension is explicitly out of scope for the first implementation stage.
Prefix-Sharing Storage
In scenarios with repeated sampling or partial context overlap, multiple trajectories may share common prefixes. Tree-structured storage could reduce memory and disk usage, but training-side benefits depend on algorithm-level support (e.g., DTA). This is deferred to future work pending further analysis.
Update on May 11
#6299 is the most updated draft PR, superseding PR #5931.
What's done:
- The core implementation for AgentGateway, including the gateway actor and the gateway serving runtime.
- A high-level example implementation of an OpenAI-request-compatible framework
- adaptation to the new main_ppo_sync.py entrance based on Transfer Queue.
- Multi-modal and tool parsing support
- Many AI-generated unit tests and a few smoke tests that may need to be trimmed down.
WIP:
Future directions:
[RFC] Agent Abstractions and Trajectory Gateway for VERL
Summary
This RFC proposes two new abstractions for VERL's agent-based reinforcement learning pipeline:
AgentLoopManagerand specific agent implementations.LLMServerManager) that intercepts agent LLM calls via the OpenAI Chat Completions API, performs canonical tokenization, and assembles token-level trajectory data with strict token-truth guarantees.Together, they enable any OpenAI-compatible agent system to be integrated into VERL's training loop without modifications to agent code, while producing continuous multi-turn trajectories with loss masks directly consumable by VERL's training engine.
Motivation
VERL's current agent integration (
AgentLoopManager+AgentLoopBase) tightly couples three concerns: LLM infrastructure management, agent lifecycle, and trajectory collection. This creates friction when integrating new agent types:ModelProxy).Community contributions such as AWS AgentCore (PR #4216) and Aliyun Remote Agent (Issue #5737) further demonstrate the need for a pluggable agent abstraction that cleanly separates these concerns.
This RFC addresses these issues by:
AgentFrameworkas a thin standard interface for agent-based rollout. The common contract isgenerate_sequences(prompts: DataProto) -> DataProto; internal execution structure, reward computation flow, and batching strategy remain implementation-specific.AgentGateway, a serving-side subsystem that works with any Framework implementation. The Gateway handles tokenization, prefix consistency, and trajectory assembly — concerns that are orthogonal to how agents are launched or managed.AgentFrameworkremains focused on trainer-facing generation semantics rather than serving ownership.Design Overview
Architecture
AgentFrameworkandAgentGatewayare independent — the Gateway does not know which Framework implementation is using it, and the Framework does not know the Gateway's internal trajectory assembly logic. They interact only through a well-defined session API.To avoid single-point bottlenecks, multiple Gateway instances run as Ray actors, each hosting a FastAPI HTTP server. An
AgentGatewayManagerroutes session creation requests across Gateway actors (e.g., round-robin or least-loaded). Once a session is created on a specific Gateway actor, all subsequent requests for that session are pinned to that actor. This follows the same pattern as VERL's existingGlobalRequestLoadBalancerfor LLM server replicas.Data Flow
A single session proceeds as follows:
AgentGatewayManager, which selects a Gateway actor and returns a session-specificbase_url.base_urlas the agent's LLM endpoint./completeendpoint).AgentGatewayManager, receiving the assembled trajectories.DataProto.AgentFramework
Interface
AgentFrameworkis intentionally thin. Users are expected to implement agent lifecycle management, reward computation, and batch orchestration through thegenerate_sequencesinterface. Common patterns may be factored into subclasses and helpers where useful.This keeps the interface compatible with heterogeneous execution models:
Reward Computation
Reward assignment remains a shared concern, but not a required abstract method on
AgentFramework.Typical patterns include:
reward_infovia the Gateway's optional/completeendpoint, and the framework reads it back during session finalization.DataProto.Reference Implementations
OpenAICompatibleAgentFramework is the preferred first implementation target. It launches or contacts an OpenAI-compatible agent, injects a session
base_url, and relies on Gateway-backed/v1/chat/completionstraffic as the trajectory truth source.CliAgentFramework is a straightforward variant that launches external agent programs as subprocesses, injecting the Gateway session URL via the
OPENAI_BASE_URLenvironment variable. Any agent that uses the OpenAI Chat Completions API works without code changes. Completion is detected via process exit.AgentLoopManager remains an important follow-up migration target, but it is not part of the first implementation milestone. Supporting it cleanly will likely require a token-request ingress in addition to the chat-completions path described in this RFC.
Custom implementations can support other execution models such as remote services or cloud-hosted frameworks by using the same session API. For remote agents without an external notification channel, the Gateway provides
wait_for_completion()to block until the agent signals completion via/complete.Migration from AgentLoopManager
The current
AgentLoopManagerremains on the legacy path in the first implementation stage.The staged plan is:
AgentLoopManagerand related VERL-native agent loops onto the new serving/Gateway model without introducing double trajectory bookkeeping in production.AgentGateway
Overview
Each AgentGateway instance is a Ray actor running a FastAPI HTTP server that exposes the OpenAI Chat Completions API. It manages multiple concurrent sessions, each maintaining independent trajectory state. The Gateway is the single canonical tokenization authority for the chat-completions path — all
messages -> token_idsconversions happen here, using the inference backend's tokenizer and chat template.In the first implementation stage,
/v1/chat/completionsis the primary ingress. A token-request ingress for legacy VERL-native paths is reserved as a follow-up extension.Multiple Gateway actors are managed by an
AgentGatewayManager, which handles session routing and provides a unified interface to the Framework.Gateway Manager and Scaling
Each Gateway actor manages its own sessions independently. The manager's only responsibility is routing — selecting which actor handles a new session and forwarding subsequent calls to the correct actor.
Session Management
The Gateway provides a session API for Framework to manage session lifecycles:
create_sessionreturns aGatewaySessioncontaining a session-specificbase_url(e.g.,http://{gateway_host}:{port}/sessions/{session_id}/v1). The agent uses this URL for all LLM calls, and the Gateway routes requests to the correct session by URL path. This approach requires no special headers or client modifications — the agent simply uses a different base URL.HTTP Endpoints
The Gateway exposes two endpoints per session:
POST /sessions/{id}/v1/chat/completions— Standard OpenAI Chat Completions. The agent calls this as its normal LLM endpoint. The Gateway intercepts the request, performs tokenization, routes to the inference backend, records the interaction, and returns a standard response. This is the mandatory first-stage endpoint.POST /sessions/{id}/complete— Optional. Allows the agent to explicitly signal session completion and optionally upload reward-related information. This is useful for remote agents that have no other completion notification channel, or for VERL-native agent loops that want to pass structured results. Agents that do not call this endpoint are unaffected — Framework detects completion through other means (process exit, coroutine return, etc.).Request Handling
On each Chat Completion request, the Gateway performs:
prompt_idsto the inference backend via its token-level generation API (e.g.AsyncLLMServerManager). Receiveresponse_idsandlogprobs.prompt_ids,response_ids, andlogprobs. Update the session's accumulated token sequence and message history.tool_callsfields are reconstructed from the raw token output.Trajectory Output
After a session completes,
finalize_sessionassembles the recorded interactions into alist[Trajectory]. Each trajectory is a prefix-consistent, continuous token sequence that constitutes an independent training sample:A session produces multiple trajectories when the Gateway detects a message prefix mismatch mid-session, possibly due to context compression, skill switching, or other agent-side context rewrites. The Gateway does not need to understand why the context changed; it only enforces consistency within each trajectory. In the final
DataProtooutput, each trajectory becomes one row.Extensions
Multimodal Support
The Gateway architecture supports multimodal inputs via an optional preprocessor. When present, the Gateway applies the multimodal preprocessor during tokenization and stores processor outputs alongside token sequences. Specific processor adapters will be added as model support grows.
Tool Call Reconstruction
For models that produce tool calls via special tokens, the Gateway uses a configurable tool parser to reconstruct structured
tool_callsfields in the OpenAI response returned to the agent. The raw token sequence is always preserved as-is for training.Token-Request Ingress
A token-request ingress is reserved for follow-up work so that legacy VERL-native paths such as
AgentLoopManagercan migrate onto the same Gateway/session model without keeping legacy trajectory bookkeeping as a production truth source. This extension is explicitly out of scope for the first implementation stage.Prefix-Sharing Storage
In scenarios with repeated sampling or partial context overlap, multiple trajectories may share common prefixes. Tree-structured storage could reduce memory and disk usage, but training-side benefits depend on algorithm-level support (e.g., DTA). This is deferred to future work pending further analysis.
Update on May 11
#6299 is the most updated draft PR, superseding PR #5931.
What's done:
WIP:
Future directions: