Skip to content

[RFC] Agent Abstractions and Trajectory Gateway for VERL #5790

@zackcxb

Description

@zackcxb

[RFC] Agent Abstractions and Trajectory Gateway for VERL

Summary

This RFC proposes two new abstractions for VERL's agent-based reinforcement learning pipeline:

  1. AgentFramework — an abstract base class for agent lifecycle management and reward computation, replacing the current tight coupling between AgentLoopManager and specific agent implementations.
  2. AgentGateway — a Gateway subsystem owned by the serving layer (LLMServerManager) that intercepts agent LLM calls via the OpenAI Chat Completions API, performs canonical tokenization, and assembles token-level trajectory data with strict token-truth guarantees.

Together, they enable any OpenAI-compatible agent system to be integrated into VERL's training loop without modifications to agent code, while producing continuous multi-turn trajectories with loss masks directly consumable by VERL's training engine.

Motivation

VERL's current agent integration (AgentLoopManager + AgentLoopBase) tightly couples three concerns: LLM infrastructure management, agent lifecycle, and trajectory collection. This creates friction when integrating new agent types:

  • Each new agent framework requires dedicated adapter code inside the agent loop.
  • Trajectory collection logic is embedded in the agent loop itself, making it non-reusable.
  • Only coroutine-based agents are natively supported; subprocess and remote agents require ad-hoc integration (e.g., SWE-Agent's custom ModelProxy).

Community contributions such as AWS AgentCore (PR #4216) and Aliyun Remote Agent (Issue #5737) further demonstrate the need for a pluggable agent abstraction that cleanly separates these concerns.

This RFC addresses these issues by:

  1. Defining AgentFramework as a thin standard interface for agent-based rollout. The common contract is generate_sequences(prompts: DataProto) -> DataProto; internal execution structure, reward computation flow, and batching strategy remain implementation-specific.
  2. Extracting trajectory collection into AgentGateway, a serving-side subsystem that works with any Framework implementation. The Gateway handles tokenization, prefix consistency, and trajectory assembly — concerns that are orthogonal to how agents are launched or managed.
  3. Extracting infrastructure management (LLM server initialization, load balancing) out of the framework layer, so that AgentFramework remains focused on trainer-facing generation semantics rather than serving ownership.

Design Overview

Image

Architecture

VERL Training Loop
│ 
│
├── AgentFramework
│     Agent lifecycle management
│     Reward computation
│     Batch orchestration + DataProto assembly
│
└── Serving Runtime
      Owns: LLMServerManager, load balancer, Gateway subsystem
      ├── GatewayManager (internal session-routing component)
      ├── Gateway Actor 1 (Ray actor, FastAPI)
      ├── Gateway Actor 2 (Ray actor, FastAPI)
      └── Gateway Actor N (Ray actor, FastAPI)

AgentFramework and AgentGateway are independent — the Gateway does not know which Framework implementation is using it, and the Framework does not know the Gateway's internal trajectory assembly logic. They interact only through a well-defined session API.

To avoid single-point bottlenecks, multiple Gateway instances run as Ray actors, each hosting a FastAPI HTTP server. An AgentGatewayManager routes session creation requests across Gateway actors (e.g., round-robin or least-loaded). Once a session is created on a specific Gateway actor, all subsequent requests for that session are pinned to that actor. This follows the same pattern as VERL's existing GlobalRequestLoadBalancer for LLM server replicas.

Data Flow

A single session proceeds as follows:

  1. Framework creates a session on the AgentGatewayManager, which selects a Gateway actor and returns a session-specific base_url.
  2. Framework starts the agent (subprocess, coroutine, or remote call), injecting the base_url as the agent's LLM endpoint.
  3. Agent makes standard OpenAI Chat Completion requests to the assigned Gateway actor. On each request, the Gateway tokenizes, checks prefix consistency, routes to the inference backend, records the token-level interaction, and returns a standard OpenAI response. The agent is unaware of the interception.
  4. Agent completes (process exits, coroutine returns, or calls the optional /complete endpoint).
  5. Framework finalizes the session via the AgentGatewayManager, receiving the assembled trajectories.
  6. Framework computes trajectory-aligned rewards and packages the resulting samples into DataProto.

AgentFramework

Interface

class AgentFramework(ABC):
    @abstractmethod
    async def generate_sequences(self, prompts: DataProto) -> DataProto:
        """Process a trainer batch and return a training-ready DataProto."""
        ...

AgentFramework is intentionally thin. Users are expected to implement agent lifecycle management, reward computation, and batch orchestration through the generate_sequences interface. Common patterns may be factored into subclasses and helpers where useful.

This keeps the interface compatible with heterogeneous execution models:

  • VERL-native agent loops
  • subprocess-based OpenAI-compatible agents
  • remote services and cloud-hosted agent frameworks

Reward Computation

Reward assignment remains a shared concern, but not a required abstract method on AgentFramework.

Typical patterns include:

  • Framework-collected. When the framework can directly access agent output (subprocess stdout, coroutine return value), it parses the relevant information locally.
  • Agent-uploaded. When the agent runs remotely and the framework has no direct access to its output, the agent uploads reward_info via the Gateway's optional /complete endpoint, and the framework reads it back during session finalization.
  • Helper-normalized. Framework implementations may reuse shared helpers to normalize one session-level reward into per-trajectory rewards, validate trajectory/reward alignment, and assemble DataProto.

Reference Implementations

OpenAICompatibleAgentFramework is the preferred first implementation target. It launches or contacts an OpenAI-compatible agent, injects a session base_url, and relies on Gateway-backed /v1/chat/completions traffic as the trajectory truth source.

CliAgentFramework is a straightforward variant that launches external agent programs as subprocesses, injecting the Gateway session URL via the OPENAI_BASE_URL environment variable. Any agent that uses the OpenAI Chat Completions API works without code changes. Completion is detected via process exit.

AgentLoopManager remains an important follow-up migration target, but it is not part of the first implementation milestone. Supporting it cleanly will likely require a token-request ingress in addition to the chat-completions path described in this RFC.

Custom implementations can support other execution models such as remote services or cloud-hosted frameworks by using the same session API. For remote agents without an external notification channel, the Gateway provides wait_for_completion() to block until the agent signals completion via /complete.

Migration from AgentLoopManager

The current AgentLoopManager remains on the legacy path in the first implementation stage.

The staged plan is:

  • First, land a Gateway-backed chat-completions path for OpenAI-compatible / remote-style agents.
  • Then, add token-request ingress as a dedicated extension point.
  • Finally, migrate AgentLoopManager and related VERL-native agent loops onto the new serving/Gateway model without introducing double trajectory bookkeeping in production.

AgentGateway

Overview

Each AgentGateway instance is a Ray actor running a FastAPI HTTP server that exposes the OpenAI Chat Completions API. It manages multiple concurrent sessions, each maintaining independent trajectory state. The Gateway is the single canonical tokenization authority for the chat-completions path — all messages -> token_ids conversions happen here, using the inference backend's tokenizer and chat template.

In the first implementation stage, /v1/chat/completions is the primary ingress. A token-request ingress for legacy VERL-native paths is reserved as a follow-up extension.

Multiple Gateway actors are managed by an AgentGatewayManager, which handles session routing and provides a unified interface to the Framework.

Gateway Manager and Scaling

Each Gateway actor manages its own sessions independently. The manager's only responsibility is routing — selecting which actor handles a new session and forwarding subsequent calls to the correct actor.

class AgentGatewayManager:
    """Manages multiple Gateway actors with session routing."""

    def __init__(self, gateways: list[AgentGateway]):
        self.gateways = gateways
        self._session_to_gateway: dict[str, AgentGateway] = {}

    async def create_session(self, session_id: str) -> GatewaySession:
        """Select a Gateway actor (e.g., round-robin) and create a session."""
        gateway = self._select_gateway()
        session = await gateway.create_session.remote(session_id)
        self._session_to_gateway[session_id] = gateway
        return session

    async def finalize_session(self, session_id: str) -> list[Trajectory]:
        """Route to the correct Gateway actor and finalize."""
        gateway = self._session_to_gateway.pop(session_id)
        return await gateway.finalize_session.remote(session_id)

    async def abort_session(self, session_id: str) -> None:
        gateway = self._session_to_gateway.pop(session_id, None)
        if gateway:
            await gateway.abort_session.remote(session_id)

    async def wait_for_completion(self, session_id: str, timeout: float) -> None:
        gateway = self._session_to_gateway[session_id]
        await gateway.wait_for_completion.remote(session_id, timeout)

Session Management

The Gateway provides a session API for Framework to manage session lifecycles:

@ray.remote
class AgentGateway:

    def __init__(
        self,
        tokenizer: AutoTokenizer,
        chat_template: str,
        backend: InferenceBackend,
        config: GatewayConfig,
    ): ...

    async def create_session(self, session_id: str) -> GatewaySession:
        """Create a trajectory session."""
        ...

    async def finalize_session(self, session_id: str) -> list[Trajectory]:
        """Assemble and return trajectories, clean up session state.
        Returns one trajectory per prefix-consistent segment."""
        ...

    async def abort_session(self, session_id: str) -> None:
        """Discard session state."""
        ...

    async def wait_for_completion(self, session_id: str, timeout: float) -> None:
        """Block until agent calls /complete. For remote agents only."""
        ...

create_session returns a GatewaySession containing a session-specific base_url (e.g., http://{gateway_host}:{port}/sessions/{session_id}/v1). The agent uses this URL for all LLM calls, and the Gateway routes requests to the correct session by URL path. This approach requires no special headers or client modifications — the agent simply uses a different base URL.

HTTP Endpoints

The Gateway exposes two endpoints per session:

POST /sessions/{id}/v1/chat/completions — Standard OpenAI Chat Completions. The agent calls this as its normal LLM endpoint. The Gateway intercepts the request, performs tokenization, routes to the inference backend, records the interaction, and returns a standard response. This is the mandatory first-stage endpoint.

POST /sessions/{id}/complete — Optional. Allows the agent to explicitly signal session completion and optionally upload reward-related information. This is useful for remote agents that have no other completion notification channel, or for VERL-native agent loops that want to pass structured results. Agents that do not call this endpoint are unaffected — Framework detects completion through other means (process exit, coroutine return, etc.).

Request Handling

On each Chat Completion request, the Gateway performs:

  1. Message-level prefix check. Compare the incoming normalized messages with the session's recorded message history. If the prefix matches, only the new incremental messages are tokenized and appended to the accumulated token sequence. If the prefix does not match, the current trajectory is finalized and a new trajectory begins with full tokenization.
  2. Inference routing. Send prompt_ids to the inference backend via its token-level generation API (e.g. AsyncLLMServerManager). Receive response_ids and logprobs.
  3. Interaction recording. Record the turn's prompt_ids, response_ids, and logprobs. Update the session's accumulated token sequence and message history.
  4. Response reconstruction. Detokenize the response and construct a standard OpenAI Chat Completion response for the agent. When tool calling is enabled, structured tool_calls fields are reconstructed from the raw token output.

Trajectory Output

After a session completes, finalize_session assembles the recorded interactions into a list[Trajectory]. Each trajectory is a prefix-consistent, continuous token sequence that constitutes an independent training sample:

@dataclass
class Trajectory:
    uid: str                         # Each prompt has a unique uuid from dataset
    session_id: int                  # Each group sampling has a session_id: [0, n)
    trajectory_id: int               # Each sampling outputs m trajectories: [0, m)
    reward_info: dict

    prompt_ids: list[int]
    response_ids: list[int]
    response_logprobs: list[float]
    loss_mask: list[int]             # 1 for response tokens, 0 for prompt
    ...

A session produces multiple trajectories when the Gateway detects a message prefix mismatch mid-session, possibly due to context compression, skill switching, or other agent-side context rewrites. The Gateway does not need to understand why the context changed; it only enforces consistency within each trajectory. In the final DataProto output, each trajectory becomes one row.

Image

Extensions

Multimodal Support

The Gateway architecture supports multimodal inputs via an optional preprocessor. When present, the Gateway applies the multimodal preprocessor during tokenization and stores processor outputs alongside token sequences. Specific processor adapters will be added as model support grows.

Tool Call Reconstruction

For models that produce tool calls via special tokens, the Gateway uses a configurable tool parser to reconstruct structured tool_calls fields in the OpenAI response returned to the agent. The raw token sequence is always preserved as-is for training.

Token-Request Ingress

A token-request ingress is reserved for follow-up work so that legacy VERL-native paths such as AgentLoopManager can migrate onto the same Gateway/session model without keeping legacy trajectory bookkeeping as a production truth source. This extension is explicitly out of scope for the first implementation stage.

Prefix-Sharing Storage

In scenarios with repeated sampling or partial context overlap, multiple trajectories may share common prefixes. Tree-structured storage could reduce memory and disk usage, but training-side benefits depend on algorithm-level support (e.g., DTA). This is deferred to future work pending further analysis.

Update on May 11

#6299 is the most updated draft PR, superseding PR #5931.

What's done:

  • The core implementation for AgentGateway, including the gateway actor and the gateway serving runtime.
  • A high-level example implementation of an OpenAI-request-compatible framework
  • adaptation to the new main_ppo_sync.py entrance based on Transfer Queue.
  • Multi-modal and tool parsing support
  • Many AI-generated unit tests and a few smoke tests that may need to be trimmed down.

WIP:

  • A deepeyes_with_gateway recipe
  • A SWE-agent recipe
  • A CLIAgentFramework example implementation
  • Integrate Gateway/Framework configs into the current VERL config system
  • CI hygiene

Future directions:

  • Turn-wise/completion-wise trajectory collection
  • Multi-agent support
  • Default deployment strategy for gateway actors

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions