Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 184 additions & 0 deletions docs/advance/agent_framework.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
Agent Framework
===============

Last updated: 05/21/2026.

.. versionadded:: 0.8.0
[status: alpha]

.. warning::
Agent Framework is ready for use, but the API may change in future releases.

Agent Framework is a session-based orchestration layer for agentic RL training.
It runs user-defined agent logic (tool calls, multi-turn reasoning, environment
interaction) inside gateway-managed sessions, collects token-level trajectories,
and writes them to the TransferQueue for sync GRPO/PPO training.

Agent Framework coexists with the legacy :doc:`Agent Loop <agent_loop>` path.
Both produce the same trainer-consumable output; Agent Framework adds
session-level isolation, an OpenAI-compatible HTTP interface per session, and
structured reward dispatch.


Overview
--------

**Design goals:**

- Black-box agent runner: any async function that speaks OpenAI chat completions
- Session isolation: each rollout sample gets its own HTTP endpoint
- Reward flexibility: inline scoring via ``reward_loop_worker_handles`` or
framework-level ``reward.custom_reward_function`` bridge
- Subclass extensibility: ``AgentFramework`` is abstract; ship your own

**Non-goals:**

- Defining tool semantics (that is the agent runner's job)
- Replacing Agent Loop for single-turn or simple multi-turn use cases


System Architecture
-------------------

.. code-block:: text

┌─────────────────────────────────────────────────────────────┐
│ Trainer (main_ppo_sync.py) │
│ └── AgentFrameworkRolloutAdapter.generate_sequences(batch) │
└────────────────────────────┬────────────────────────────────┘
│ TensorDict prompts
┌─────────────────────────────────────────────────────────────┐
│ OpenAICompatibleAgentFramework │
│ ├── create sessions (1 per sample × rollout.n) │
│ ├── launch agent_runner coroutines │
│ ├── wait for completion / finalize │
│ ├── score trajectories (reward dispatch) │
│ └── write to TransferQueue │
└────────────────────────────┬────────────────────────────────┘
│ session lifecycle
┌─────────────────────────────────────────────────────────────┐
│ GatewayServingRuntime │
│ ├── GatewayManager (round-robin session routing) │
│ └── GatewayActor ×N (HTTP /v1/chat/completions per session)│
│ └── backend: LLMServerClient.generate(token-level) │
└─────────────────────────────────────────────────────────────┘


System Components
-----------------

+--------------------------------------+-----------------------------------------------------------------------+
| Component | Role |
+======================================+=======================================================================+
| ``AgentFramework`` | Abstract base class. Subclasses implement ``from_config`` and |
| | ``generate_sequences``. |
+--------------------------------------+-----------------------------------------------------------------------+
| ``OpenAICompatibleAgentFramework`` | Default subclass. Manages sessions, runs agent_runner coroutines, |
| | dispatches reward scoring, writes TQ output. |
+--------------------------------------+-----------------------------------------------------------------------+
| ``GatewayServingRuntime`` | Owns gateway actor lifecycle. ``gateway_count=0`` degrades to a thin |
| | LLM client passthrough (no HTTP layer). |
+--------------------------------------+-----------------------------------------------------------------------+
| ``GatewayActor`` | Ray actor running an HTTP server. Exposes ``/v1/chat/completions`` |
| | to the agent runner and collects token-level trajectories. |
+--------------------------------------+-----------------------------------------------------------------------+
| ``AgentFrameworkRolloutAdapter`` | Trainer-facing glue in ``entry.py``. Satisfies the |
| | ``agent_loop_manager_class`` extension point contract. |
+--------------------------------------+-----------------------------------------------------------------------+


Writing a Custom Agent Runner
-----------------------------

An agent runner is any async callable with this signature:

.. code:: python

async def my_agent_runner(
*,
raw_prompt: list[dict], # OpenAI-format messages
session: SessionHandle, # .base_url is the per-session endpoint
sample_index: int,
**kwargs, # extra fields from dataset non_tensor columns
) -> None:
"""Run agent logic against the gateway session."""
import httpx

async with httpx.AsyncClient() as client:
response = await client.post(
f"{session.base_url}/chat/completions",
json={"model": "any", "messages": raw_prompt},
)
# ... tool calls, multi-turn loops, etc.

# Signal that the session is complete (triggers trajectory finalization)
await client.post(session.base_url.removesuffix("/v1") + "/complete")

The framework handles session creation, trajectory collection, reward scoring,
and TQ writes. The agent runner only needs to make HTTP requests and signal
completion.


Configuration Reference
-----------------------

All fields live under ``actor_rollout_ref.rollout.custom.agent_framework``:

.. code:: yaml

actor_rollout_ref:
rollout:
agent:
agent_loop_manager_class: verl.agent.framework.entry.AgentFrameworkRolloutAdapter
custom:
agent_framework:
# Required: FQN of your agent runner function
agent_runner_fqn: my_package.my_module.my_agent_runner

# Number of gateway actors (HTTP servers). 0 = no gateway, passthrough only.
gateway_count: 8

# Optional: kwargs passed to agent_runner via functools.partial
agent_runner_kwargs:
max_turns: 5

# Optional: tool config yaml for tool initialization
tool_config_path: path/to/tool_config.yaml

# Optional: timeout for session completion (seconds). null = no wait.
completion_timeout_seconds: 30

# Optional: max concurrent sessions (0 = unlimited)
max_concurrent_sessions: 0

# Optional: FQN of framework subclass (default: OpenAICompatibleAgentFramework)
framework_class_fqn: verl.agent.framework.framework.OpenAICompatibleAgentFramework


Usage Example
-------------

**Full training run** (requires GPU cluster + judge model):

.. code:: bash

bash examples/grpo_trainer/run_deepeyes_gateway_grpo.sh

**Minimal CPU-only tutorial** (no GPU required):

.. code:: bash

python examples/tutorial/agent_framework_get_started/minimal_e2e.py

The tutorial demonstrates the runtime → framework → generate_sequences path
with a fake rollout server, real gateway actor, and real framework orchestration.


See Also
--------

- :doc:`Agent Loop <agent_loop>` — legacy single/multi-turn rollout path
- :doc:`Agentic RL overview <../start/agentic_rl>` — high-level introduction
- :doc:`Reward Loop <reward_loop>` — reward worker integration
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,7 @@ verl is fast with:
advance/rollout_trace.rst
advance/rollout_skip.rst
advance/agent_loop
advance/agent_framework
advance/reward_loop
data/transfer_queue.md
advance/grafana_prometheus.md
Expand Down
16 changes: 13 additions & 3 deletions docs/start/agentic_rl.rst
Original file line number Diff line number Diff line change
Expand Up @@ -109,18 +109,28 @@ Follow :doc:`Rollout trace<../advance/rollout_trace>` to known more about trace
Agent Framework
---------------

For the session-based Agent Framework (``verl.agent.framework``), which provides
per-session HTTP isolation and structured reward dispatch for agentic RL, see
:doc:`Agent Framework <../advance/agent_framework>`.

The LangGraph-based agent path below is a separate recipe that uses LangChain
abstractions on top of the same inference backend.

LangGraph Agent
~~~~~~~~~~~~~~~

System Architecture
~~~~~~~~~~~~~~~~~~~
^^^^^^^^^^^^^^^^^^^

.. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/langgraph_agent.png?raw=true

System Components
~~~~~~~~~~~~~~~~~
^^^^^^^^^^^^^^^^^

+--------------------------+-----------------------------------------------------------------------------------------------+
| Component | Role |
+==========================+===============================================================================================+
| ChatModel | LLM object of LangChain, used to adapt to the generate api provided by LLMServerClient |
| ChatModel | LLM object of LangChain, used to adapt to the "generate" api provided by LLMServerClient |
+--------------------------+-----------------------------------------------------------------------------------------------+
| ReactAgentLoop | Agent adaptation layer, which by default supports a naive LangGraph Agentic. |
| | New classes can be derived to support user-defined Agents, and the run function needs to be |
Expand Down
84 changes: 84 additions & 0 deletions examples/grpo_trainer/run_deepeyes_gateway_grpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
#!/usr/bin/env bash
# GRPO | Agent Framework + Gateway | DeepEyes multimodal tool-use
#
# This script trains a vision-language model with agentic tool-use rollouts
# using the Agent Framework + Gateway stack. Each rollout sample gets its own
# HTTP session where the agent runner can make multi-turn chat completions
# requests and invoke tools (e.g., image zoom).
#
# Prerequisites:
# - A judge/reward model serving at LLM_AS_A_JUDGE_BASE (default: localhost:18901)
# - DeepEyes dataset parquet file at TRAIN_FILE
# - Model checkpoint at MODEL_PATH (or HuggingFace model ID)
#
# See docs/advance/agent_framework.rst for architecture details.

set -xeuo pipefail

########################### user-adjustable ###########################
MODEL_PATH=${MODEL_PATH:-/data1/models/Qwen/Qwen3.5-4B}
TRAIN_FILE=${TRAIN_FILE:-/data1/datasets/deepeyes/data/data_0.1.2_visual_toolbox_v2.parquet}
VAL_FILE=${VAL_FILE:-${TRAIN_FILE}}

NGPUS_PER_NODE=${NGPUS_PER_NODE:-7}
TOTAL_TRAINING_STEPS=${TOTAL_TRAINING_STEPS:-50}
TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-14}

# Agent Framework specific
GATEWAY_COUNT=${GATEWAY_COUNT:-7}
MAX_TURNS=${MAX_TURNS:-5}
COMPLETION_TIMEOUT=${COMPLETION_TIMEOUT:-}

# Reward judge endpoint
LLM_AS_A_JUDGE_BASE=${LLM_AS_A_JUDGE_BASE:-http://127.0.0.1:18901/v1}

PROJECT_NAME=${PROJECT_NAME:-deepeyes_gateway_grpo}
EXPERIMENT_NAME=${EXPERIMENT_NAME:-qwen35_4b_deepeyes_gateway_grpo}
########################### end user-adjustable ###########################

VERL_REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
cd "${VERL_REPO_ROOT}"

export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6}"
export VERL_FORCE_TQ_NESTED_READBACK="${VERL_FORCE_TQ_NESTED_READBACK:-1}"
export LLM_AS_A_JUDGE_BASE
export WANDB_MODE="${WANDB_MODE:-offline}"
export NCCL_P2P_DISABLE="${NCCL_P2P_DISABLE:-1}"
export NCCL_SHM_DISABLE="${NCCL_SHM_DISABLE:-1}"
export PYTHONUNBUFFERED=1
export HYDRA_FULL_ERROR=1

python3 -m verl.trainer.main_ppo_sync \
--config-path="${VERL_REPO_ROOT}/recipe/deepeyes_with_gateway/configs" \
--config-name=deepeyes_gateway_grpo \
data.train_files="${TRAIN_FILE}" \
"data.val_files=[${VAL_FILE}]" \
data.train_batch_size="${TRAIN_BATCH_SIZE}" \
data.max_prompt_length=4096 \
data.max_response_length=1024 \
trainer.total_training_steps="${TOTAL_TRAINING_STEPS}" \
trainer.val_before_train=False \
trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
trainer.nnodes=1 \
'trainer.logger=[console,wandb,tensorboard]' \
trainer.project_name="${PROJECT_NAME}" \
trainer.experiment_name="${EXPERIMENT_NAME}" \
trainer.save_freq=-1 \
trainer.test_freq=-1 \
actor_rollout_ref.model.use_fused_kernels=False \
actor_rollout_ref.model.use_remove_padding=True \
'+actor_rollout_ref.model.override_config.attn_implementation=eager' \
actor_rollout_ref.actor.ppo_mini_batch_size="${TRAIN_BATCH_SIZE}" \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.response_length=1024 \
actor_rollout_ref.rollout.max_model_len=8192 \
actor_rollout_ref.rollout.max_num_seqs=4 \
actor_rollout_ref.rollout.max_num_batched_tokens=16384 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.55 \
actor_rollout_ref.rollout.enforce_eager=True \
actor_rollout_ref.rollout.dtype=float16 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.custom.agent_framework.gateway_count="${GATEWAY_COUNT}" \
actor_rollout_ref.rollout.custom.agent_framework.agent_runner_kwargs.max_turns="${MAX_TURNS}"
60 changes: 60 additions & 0 deletions examples/tutorial/agent_framework_get_started/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Agent Framework Get Started

Minimal runnable entry for the `verl.agent.framework` + `verl.agent.gateway`
stack (PR #6299).

It demonstrates three boundaries:

1. The caller creates `GatewayServingRuntime` externally (entry.py does this
in production; here we do it manually for visibility).
2. `GatewayServingRuntime` is injected into `OpenAICompatibleAgentFramework`.
3. The framework is exercised with one `generate_sequences(...)` call on a
minimal `TensorDict`.

Inside the script, the agent side is split into two layers:

- `agent_runner(...)`: the framework-facing adapter that receives a
`SessionHandle` and extracts `session.base_url`
- `run_mock_agent(base_url, raw_prompt)`: an external-agent-style function
that only knows an OpenAI-compatible backend URL plus prompt messages

That keeps the gateway-specific lifecycle shim visible, while showing how a
normal agent can treat the gateway as its backend URL.

This is intentionally **not** a trainer integration example. It uses:

- a tiny fake rollout server actor (Ray remote),
- the real `GlobalRequestLoadBalancer`,
- the real `GatewayServingRuntime` with `gateway_count=1`,
- the real `GatewayActor` (HTTP server),
- the real `OpenAICompatibleAgentFramework`.

The example runs CPU-only and requires no GPU. `reward_loop_worker_handles=None`
means reward scoring is skipped; `rm_scores` is zero-filled in the TQ output
(matching the framework's default behavior when no reward workers are available).

## Run

```bash
python examples/tutorial/agent_framework_get_started/minimal_e2e.py
```

The script will:

1. Start Ray (local mode).
2. Start one fake rollout server actor.
3. Create a `GlobalRequestLoadBalancer`.
4. Create a `GatewayServingRuntime` with one gateway actor.
5. Construct `OpenAICompatibleAgentFramework` with the runtime.
6. Send one chat-completions request through the gateway.
7. Call `generate_sequences(...)` which writes to a fake TransferQueue.
8. Print a JSON summary of the output.
9. Shut down the runtime and Ray.

## Architecture Reference

For the full architecture, configuration reference, and production usage, see
[docs/advance/agent_framework.rst](../../../docs/advance/agent_framework.rst).

For a full training run with GPU cluster, see
[examples/grpo_trainer/run_deepeyes_gateway_grpo.sh](../../grpo_trainer/run_deepeyes_gateway_grpo.sh).
Loading