[RFC] Agent Abstractions and Trajectory Gateway for VERL

# [RFC] Agent Abstractions and Trajectory Gateway for VERL

## Summary

This RFC proposes two new abstractions for VERL's agent-based reinforcement learning pipeline:

1. **AgentFramework** — an abstract base class for agent lifecycle management and reward computation, replacing the current tight coupling between `AgentLoopManager` and specific agent implementations.
1. **AgentGateway** — a Gateway subsystem owned by the serving layer (`LLMServerManager`) that intercepts agent LLM calls via the OpenAI Chat Completions API, performs canonical tokenization, and assembles token-level trajectory data with strict token-truth guarantees.

Together, they enable any OpenAI-compatible agent system to be integrated into VERL's training loop without modifications to agent code, while producing continuous multi-turn trajectories with loss masks directly consumable by VERL's training engine.

## Motivation

VERL's current agent integration (`AgentLoopManager` + `AgentLoopBase`) tightly couples three concerns: LLM infrastructure management, agent lifecycle, and trajectory collection. This creates friction when integrating new agent types:

- Each new agent framework requires dedicated adapter code inside the agent loop.
- Trajectory collection logic is embedded in the agent loop itself, making it non-reusable.
- Only coroutine-based agents are natively supported; subprocess and remote agents require ad-hoc integration (e.g., SWE-Agent's custom `ModelProxy`).

Community contributions such as AWS AgentCore (PR #4216) and Aliyun Remote Agent (Issue #5737) further demonstrate the need for a pluggable agent abstraction that cleanly separates these concerns.

This RFC addresses these issues by:

1. **Defining** `AgentFramework` as a thin standard interface for agent-based rollout. The common contract is `generate_sequences(prompts: DataProto) -> DataProto`; internal execution structure, reward computation flow, and batching strategy remain implementation-specific.
1. **Extracting trajectory collection into** `AgentGateway`, a serving-side subsystem that works with any Framework implementation. The Gateway handles tokenization, prefix consistency, and trajectory assembly — concerns that are orthogonal to how agents are launched or managed.
1. **Extracting infrastructure management** (LLM server initialization, load balancing) out of the framework layer, so that `AgentFramework` remains focused on trainer-facing generation semantics rather than serving ownership.

## Design Overview

<img width="1264" height="701" alt="Image" src="https://github.com/user-attachments/assets/ce0b71b8-d01d-467f-966f-ec982f6c504c" />

### Architecture

```text
VERL Training Loop
│ 
│
├── AgentFramework
│     Agent lifecycle management
│     Reward computation
│     Batch orchestration + DataProto assembly
│
└── Serving Runtime
      Owns: LLMServerManager, load balancer, Gateway subsystem
      ├── GatewayManager (internal session-routing component)
      ├── Gateway Actor 1 (Ray actor, FastAPI)
      ├── Gateway Actor 2 (Ray actor, FastAPI)
      └── Gateway Actor N (Ray actor, FastAPI)
```

`AgentFramework` and `AgentGateway` are independent — the Gateway does not know which Framework implementation is using it, and the Framework does not know the Gateway's internal trajectory assembly logic. They interact only through a well-defined session API.

To avoid single-point bottlenecks, multiple Gateway instances run as Ray actors, each hosting a FastAPI HTTP server. An `AgentGatewayManager` routes session creation requests across Gateway actors (e.g., round-robin or least-loaded). Once a session is created on a specific Gateway actor, all subsequent requests for that session are pinned to that actor. This follows the same pattern as VERL's existing `GlobalRequestLoadBalancer` for LLM server replicas.

### Data Flow

A single session proceeds as follows:

1. **Framework** creates a session on the `AgentGatewayManager`, which selects a Gateway actor and returns a session-specific `base_url`.
1. **Framework** starts the agent (subprocess, coroutine, or remote call), injecting the `base_url` as the agent's LLM endpoint.
1. **Agent** makes standard OpenAI Chat Completion requests to the assigned Gateway actor. On each request, the Gateway tokenizes, checks prefix consistency, routes to the inference backend, records the token-level interaction, and returns a standard OpenAI response. The agent is unaware of the interception.
1. **Agent** completes (process exits, coroutine returns, or calls the optional `/complete` endpoint).
1. **Framework** finalizes the session via the `AgentGatewayManager`, receiving the assembled trajectories.
1. **Framework** computes trajectory-aligned rewards and packages the resulting samples into `DataProto`.

## AgentFramework

### Interface

```python
class AgentFramework(ABC):
    @abstractmethod
    async def generate_sequences(self, prompts: DataProto) -> DataProto:
        """Process a trainer batch and return a training-ready DataProto."""
        ...
```

`AgentFramework` is intentionally thin. Users are expected to implement agent lifecycle management, reward computation, and batch orchestration through the `generate_sequences` interface. Common patterns may be factored into subclasses and helpers where useful.

This keeps the interface compatible with heterogeneous execution models:

- VERL-native agent loops
- subprocess-based OpenAI-compatible agents
- remote services and cloud-hosted agent frameworks

### Reward Computation

Reward assignment remains a shared concern, but not a required abstract method on `AgentFramework`.

Typical patterns include:

- **Framework-collected.** When the framework can directly access agent output (subprocess stdout, coroutine return value), it parses the relevant information locally.
- **Agent-uploaded.** When the agent runs remotely and the framework has no direct access to its output, the agent uploads `reward_info` via the Gateway's optional `/complete` endpoint, and the framework reads it back during session finalization.
- **Helper-normalized.** Framework implementations may reuse shared helpers to normalize one session-level reward into per-trajectory rewards, validate trajectory/reward alignment, and assemble `DataProto`.

### Reference Implementations

**OpenAICompatibleAgentFramework** is the preferred first implementation target. It launches or contacts an OpenAI-compatible agent, injects a session `base_url`, and relies on Gateway-backed `/v1/chat/completions` traffic as the trajectory truth source.

**CliAgentFramework** is a straightforward variant that launches external agent programs as subprocesses, injecting the Gateway session URL via the `OPENAI_BASE_URL` environment variable. Any agent that uses the OpenAI Chat Completions API works without code changes. Completion is detected via process exit.

**AgentLoopManager** remains an important follow-up migration target, but it is not part of the first implementation milestone. Supporting it cleanly will likely require a token-request ingress in addition to the chat-completions path described in this RFC.

Custom implementations can support other execution models such as remote services or cloud-hosted frameworks by using the same session API. For remote agents without an external notification channel, the Gateway provides `wait_for_completion()` to block until the agent signals completion via `/complete`.

### Migration from AgentLoopManager

The current `AgentLoopManager` remains on the legacy path in the first implementation stage.

The staged plan is:

- First, land a Gateway-backed chat-completions path for OpenAI-compatible / remote-style agents.
- Then, add token-request ingress as a dedicated extension point.
- Finally, migrate `AgentLoopManager` and related VERL-native agent loops onto the new serving/Gateway model without introducing double trajectory bookkeeping in production.

## AgentGateway

### Overview

Each AgentGateway instance is a Ray actor running a FastAPI HTTP server that exposes the OpenAI Chat Completions API. It manages multiple concurrent sessions, each maintaining independent trajectory state. The Gateway is the single canonical tokenization authority for the chat-completions path — all `messages -> token_ids` conversions happen here, using the inference backend's tokenizer and chat template.

In the first implementation stage, `/v1/chat/completions` is the primary ingress. A token-request ingress for legacy VERL-native paths is reserved as a follow-up extension.

Multiple Gateway actors are managed by an `AgentGatewayManager`, which handles session routing and provides a unified interface to the Framework.

#### Gateway Manager and Scaling

Each Gateway actor manages its own sessions independently. The manager's only responsibility is routing — selecting which actor handles a new session and forwarding subsequent calls to the correct actor.

```python
class AgentGatewayManager:
    """Manages multiple Gateway actors with session routing."""

    def __init__(self, gateways: list[AgentGateway]):
        self.gateways = gateways
        self._session_to_gateway: dict[str, AgentGateway] = {}

    async def create_session(self, session_id: str) -> GatewaySession:
        """Select a Gateway actor (e.g., round-robin) and create a session."""
        gateway = self._select_gateway()
        session = await gateway.create_session.remote(session_id)
        self._session_to_gateway[session_id] = gateway
        return session

    async def finalize_session(self, session_id: str) -> list[Trajectory]:
        """Route to the correct Gateway actor and finalize."""
        gateway = self._session_to_gateway.pop(session_id)
        return await gateway.finalize_session.remote(session_id)

    async def abort_session(self, session_id: str) -> None:
        gateway = self._session_to_gateway.pop(session_id, None)
        if gateway:
            await gateway.abort_session.remote(session_id)

    async def wait_for_completion(self, session_id: str, timeout: float) -> None:
        gateway = self._session_to_gateway[session_id]
        await gateway.wait_for_completion.remote(session_id, timeout)
```

### Session Management

The Gateway provides a session API for Framework to manage session lifecycles:

```python
@ray.remote
class AgentGateway:

    def __init__(
        self,
        tokenizer: AutoTokenizer,
        chat_template: str,
        backend: InferenceBackend,
        config: GatewayConfig,
    ): ...

    async def create_session(self, session_id: str) -> GatewaySession:
        """Create a trajectory session."""
        ...

    async def finalize_session(self, session_id: str) -> list[Trajectory]:
        """Assemble and return trajectories, clean up session state.
        Returns one trajectory per prefix-consistent segment."""
        ...

    async def abort_session(self, session_id: str) -> None:
        """Discard session state."""
        ...

    async def wait_for_completion(self, session_id: str, timeout: float) -> None:
        """Block until agent calls /complete. For remote agents only."""
        ...
```

`create_session` returns a `GatewaySession` containing a session-specific `base_url` (e.g., `http://{gateway_host}:{port}/sessions/{session_id}/v1`). The agent uses this URL for all LLM calls, and the Gateway routes requests to the correct session by URL path. This approach requires no special headers or client modifications — the agent simply uses a different base URL.

### HTTP Endpoints

The Gateway exposes two endpoints per session:

`POST /sessions/{id}/v1/chat/completions` — Standard OpenAI Chat Completions. The agent calls this as its normal LLM endpoint. The Gateway intercepts the request, performs tokenization, routes to the inference backend, records the interaction, and returns a standard response. This is the mandatory first-stage endpoint.

`POST /sessions/{id}/complete` — Optional. Allows the agent to explicitly signal session completion and optionally upload reward-related information. This is useful for remote agents that have no other completion notification channel, or for VERL-native agent loops that want to pass structured results. Agents that do not call this endpoint are unaffected — Framework detects completion through other means (process exit, coroutine return, etc.).

### Request Handling

On each Chat Completion request, the Gateway performs:

1. **Message-level prefix check.** Compare the incoming normalized messages with the session's recorded message history. If the prefix matches, only the new incremental messages are tokenized and appended to the accumulated token sequence. If the prefix does not match, the current trajectory is finalized and a new trajectory begins with full tokenization.
1. **Inference routing.** Send `prompt_ids` to the inference backend via its token-level generation API (e.g. `AsyncLLMServerManager`). Receive `response_ids` and `logprobs`.
1. **Interaction recording.** Record the turn's `prompt_ids`, `response_ids`, and `logprobs`. Update the session's accumulated token sequence and message history.
1. **Response reconstruction.** Detokenize the response and construct a standard OpenAI Chat Completion response for the agent. When tool calling is enabled, structured `tool_calls` fields are reconstructed from the raw token output.

### Trajectory Output

After a session completes, `finalize_session` assembles the recorded interactions into a `list[Trajectory]`. Each trajectory is a prefix-consistent, continuous token sequence that constitutes an independent training sample:

```python
@dataclass
class Trajectory:
    uid: str                         # Each prompt has a unique uuid from dataset
    session_id: int                  # Each group sampling has a session_id: [0, n)
    trajectory_id: int               # Each sampling outputs m trajectories: [0, m)
    reward_info: dict

    prompt_ids: list[int]
    response_ids: list[int]
    response_logprobs: list[float]
    loss_mask: list[int]             # 1 for response tokens, 0 for prompt
    ...
```

A session produces multiple trajectories when the Gateway detects a message prefix mismatch mid-session, possibly due to context compression, skill switching, or other agent-side context rewrites. The Gateway does not need to understand why the context changed; it only enforces consistency within each trajectory. In the final `DataProto` output, each trajectory becomes one row.

<img width="2560" height="840" alt="Image" src="https://github.com/user-attachments/assets/e1a127c1-3a04-497e-bd58-0da01d951293" />

## Extensions

### Multimodal Support

The Gateway architecture supports multimodal inputs via an optional preprocessor. When present, the Gateway applies the multimodal preprocessor during tokenization and stores processor outputs alongside token sequences. Specific processor adapters will be added as model support grows.

### Tool Call Reconstruction

For models that produce tool calls via special tokens, the Gateway uses a configurable tool parser to reconstruct structured `tool_calls` fields in the OpenAI response returned to the agent. The raw token sequence is always preserved as-is for training.

### Token-Request Ingress

A token-request ingress is reserved for follow-up work so that legacy VERL-native paths such as `AgentLoopManager` can migrate onto the same Gateway/session model without keeping legacy trajectory bookkeeping as a production truth source. This extension is explicitly out of scope for the first implementation stage.

### Prefix-Sharing Storage

In scenarios with repeated sampling or partial context overlap, multiple trajectories may share common prefixes. Tree-structured storage could reduce memory and disk usage, but training-side benefits depend on algorithm-level support (e.g., DTA). This is deferred to future work pending further analysis.

## Update on May 11

https://github.com/verl-project/verl/pull/6299 is the most updated draft PR, superseding PR #5931. 

### What's done:

- The core implementation for AgentGateway, including the gateway actor and the gateway serving runtime.
- A high-level example implementation of an OpenAI-request-compatible framework
- adaptation to the new main_ppo_sync.py entrance based on Transfer Queue.
- Multi-modal and tool parsing support
- Many AI-generated unit tests and a few smoke tests that may need to be trimmed down.

### WIP:

- [ ] A deepeyes_with_gateway recipe
- [ ] A SWE-agent recipe
- [ ] A CLIAgentFramework example implementation
- [ ] Integrate Gateway/Framework configs into the current VERL config system
- [ ] CI hygiene

### Future directions:
- [ ] Turn-wise/completion-wise trajectory collection
- [ ] Multi-agent support
- [ ] Default deployment strategy for gateway actors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Agent Abstractions and Trajectory Gateway for VERL #5790

[RFC] Agent Abstractions and Trajectory Gateway for VERL

Summary

Motivation

Design Overview

Architecture

Data Flow

AgentFramework

Interface

Reward Computation

Reference Implementations

Migration from AgentLoopManager

AgentGateway

Overview

Gateway Manager and Scaling

Session Management

HTTP Endpoints

Request Handling

Trajectory Output

Extensions

Multimodal Support

Tool Call Reconstruction

Token-Request Ingress

Prefix-Sharing Storage

Update on May 11

What's done:

WIP:

Future directions:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC] Agent Abstractions and Trajectory Gateway for VERL #5790

Description

[RFC] Agent Abstractions and Trajectory Gateway for VERL

Summary

Motivation

Design Overview

Architecture

Data Flow

AgentFramework

Interface

Reward Computation

Reference Implementations

Migration from AgentLoopManager

AgentGateway

Overview

Gateway Manager and Scaling

Session Management

HTTP Endpoints

Request Handling

Trajectory Output

Extensions

Multimodal Support

Tool Call Reconstruction

Token-Request Ingress

Prefix-Sharing Storage

Update on May 11

What's done:

WIP:

Future directions:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions