verl-project · zackcxb · May 8, 2026 · May 8, 2026 · May 8, 2026 · May 16, 2026
@@ -0,0 +1,184 @@
+Agent Framework
+===============
+
+Last updated: 05/21/2026.
+
+.. versionadded:: 0.8.0
+   [status: alpha]
+
+.. warning::
+   Agent Framework is ready for use, but the API may change in future releases.
+
+Agent Framework is a session-based orchestration layer for agentic RL training.
+It runs user-defined agent logic (tool calls, multi-turn reasoning, environment
+interaction) inside gateway-managed sessions, collects token-level trajectories,
+and writes them to the TransferQueue for sync GRPO/PPO training.
+
+Agent Framework coexists with the legacy :doc:`Agent Loop <agent_loop>` path.
+Both produce the same trainer-consumable output; Agent Framework adds
+session-level isolation, an OpenAI-compatible HTTP interface per session, and
+structured reward dispatch.
+
+
+Overview
+--------
+
+**Design goals:**
+
+- Black-box agent runner: any async function that speaks OpenAI chat completions
+- Session isolation: each rollout sample gets its own HTTP endpoint
+- Reward flexibility: inline scoring via ``reward_loop_worker_handles`` or
+  framework-level ``reward.custom_reward_function`` bridge
+- Subclass extensibility: ``AgentFramework`` is abstract; ship your own
+
+**Non-goals:**
+
+- Defining tool semantics (that is the agent runner's job)
+- Replacing Agent Loop for single-turn or simple multi-turn use cases
+
+
+System Architecture
+-------------------
+
+.. code-block:: text
+
+   ┌─────────────────────────────────────────────────────────────┐
+   │ Trainer (main_ppo_sync.py)                                  │
+   │   └── AgentFrameworkRolloutAdapter.generate_sequences(batch) │
+   └────────────────────────────┬────────────────────────────────┘
+                                │ TensorDict prompts
+                                ▼
+   ┌─────────────────────────────────────────────────────────────┐
+   │ OpenAICompatibleAgentFramework                              │
+   │   ├── create sessions (1 per sample × rollout.n)            │
+   │   ├── launch agent_runner coroutines                        │
+   │   ├── wait for completion / finalize                        │
+   │   ├── score trajectories (reward dispatch)                  │
+   │   └── write to TransferQueue                                │
+   └────────────────────────────┬────────────────────────────────┘
+                                │ session lifecycle
+                                ▼
+   ┌─────────────────────────────────────────────────────────────┐
+   │ GatewayServingRuntime                                       │
+   │   ├── GatewayManager (round-robin session routing)          │
+   │   └── GatewayActor ×N (HTTP /v1/chat/completions per session)│
+   │         └── backend: LLMServerClient.generate(token-level)  │
+   └─────────────────────────────────────────────────────────────┘
+
+
+System Components
+-----------------
+
++--------------------------------------+-----------------------------------------------------------------------+
+| Component                            | Role                                                                  |
++======================================+=======================================================================+
+| ``AgentFramework``                   | Abstract base class. Subclasses implement ``from_config`` and         |
+|                                      | ``generate_sequences``.                                               |
++--------------------------------------+-----------------------------------------------------------------------+
+| ``OpenAICompatibleAgentFramework``   | Default subclass. Manages sessions, runs agent_runner coroutines,     |
+|                                      | dispatches reward scoring, writes TQ output.                          |
++--------------------------------------+-----------------------------------------------------------------------+
+| ``GatewayServingRuntime``            | Owns gateway actor lifecycle. ``gateway_count=0`` degrades to a thin  |
+|                                      | LLM client passthrough (no HTTP layer).                               |
++--------------------------------------+-----------------------------------------------------------------------+
+| ``GatewayActor``                     | Ray actor running an HTTP server. Exposes ``/v1/chat/completions``    |
+|                                      | to the agent runner and collects token-level trajectories.            |
++--------------------------------------+-----------------------------------------------------------------------+
+| ``AgentFrameworkRolloutAdapter``     | Trainer-facing glue in ``entry.py``. Satisfies the                    |
+|                                      | ``agent_loop_manager_class`` extension point contract.                |
++--------------------------------------+-----------------------------------------------------------------------+
+
+
+Writing a Custom Agent Runner
+-----------------------------
+
+An agent runner is any async callable with this signature:
+
+.. code:: python
+
+   async def my_agent_runner(
+       *,
+       raw_prompt: list[dict],   # OpenAI-format messages
+       session: SessionHandle,   # .base_url is the per-session endpoint
+       sample_index: int,
+       **kwargs,                 # extra fields from dataset non_tensor columns
+   ) -> None:
+       """Run agent logic against the gateway session."""
+       import httpx
+
+       async with httpx.AsyncClient() as client:
+           response = await client.post(
+               f"{session.base_url}/chat/completions",
+               json={"model": "any", "messages": raw_prompt},
+           )
+           # ... tool calls, multi-turn loops, etc.
+
+       # Signal that the session is complete (triggers trajectory finalization)
+       await client.post(session.base_url.removesuffix("/v1") + "/complete")
+
+The framework handles session creation, trajectory collection, reward scoring,
+and TQ writes. The agent runner only needs to make HTTP requests and signal
+completion.
+
+
+Configuration Reference
+-----------------------
+
+All fields live under ``actor_rollout_ref.rollout.custom.agent_framework``:
+
+.. code:: yaml
+
+   actor_rollout_ref:
+     rollout:
+       agent:
+         agent_loop_manager_class: verl.agent.framework.entry.AgentFrameworkRolloutAdapter
+       custom:
+         agent_framework:
+           # Required: FQN of your agent runner function
+           agent_runner_fqn: my_package.my_module.my_agent_runner
+
+           # Number of gateway actors (HTTP servers). 0 = no gateway, passthrough only.
+           gateway_count: 8
+
+           # Optional: kwargs passed to agent_runner via functools.partial
+           agent_runner_kwargs:
+             max_turns: 5
+
+           # Optional: tool config yaml for tool initialization
+           tool_config_path: path/to/tool_config.yaml
+
+           # Optional: timeout for session completion (seconds). null = no wait.
+           completion_timeout_seconds: 30
+
+           # Optional: max concurrent sessions (0 = unlimited)
+           max_concurrent_sessions: 0
+
+           # Optional: FQN of framework subclass (default: OpenAICompatibleAgentFramework)
+           framework_class_fqn: verl.agent.framework.framework.OpenAICompatibleAgentFramework
+
+
+Usage Example
+-------------
+
+**Full training run** (requires GPU cluster + judge model):
+
+.. code:: bash
+
+   bash examples/grpo_trainer/run_deepeyes_gateway_grpo.sh
+
+**Minimal CPU-only tutorial** (no GPU required):
+
+.. code:: bash
+
+   python examples/tutorial/agent_framework_get_started/minimal_e2e.py
+
+The tutorial demonstrates the runtime → framework → generate_sequences path
+with a fake rollout server, real gateway actor, and real framework orchestration.
+
+
+See Also
+--------
+
+- :doc:`Agent Loop <agent_loop>` — legacy single/multi-turn rollout path
+- :doc:`Agentic RL overview <../start/agentic_rl>` — high-level introduction
+- :doc:`Reward Loop <reward_loop>` — reward worker integration
@@ -144,6 +144,7 @@ verl is fast with:
    advance/rollout_trace.rst
    advance/rollout_skip.rst
    advance/agent_loop
+   advance/agent_framework
    advance/reward_loop
    data/transfer_queue.md
    advance/grafana_prometheus.md

@@ -109,18 +109,28 @@ Follow :doc:`Rollout trace<../advance/rollout_trace>` to known more about trace
 Agent Framework
 ---------------
 
+For the session-based Agent Framework (``verl.agent.framework``), which provides
+per-session HTTP isolation and structured reward dispatch for agentic RL, see
+:doc:`Agent Framework <../advance/agent_framework>`.
+
+The LangGraph-based agent path below is a separate recipe that uses LangChain
+abstractions on top of the same inference backend.
+
+LangGraph Agent
+~~~~~~~~~~~~~~~
+
 System Architecture
-~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^
 
 .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/langgraph_agent.png?raw=true
 
 System Components
-~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^
 
 +--------------------------+-----------------------------------------------------------------------------------------------+
 | Component                | Role                                                                                          |
 +==========================+===============================================================================================+
-| ChatModel                | LLM object of LangChain, used to adapt to the “generate” api provided by LLMServerClient      |
+| ChatModel                | LLM object of LangChain, used to adapt to the "generate" api provided by LLMServerClient      |
 +--------------------------+-----------------------------------------------------------------------------------------------+
 | ReactAgentLoop           | Agent adaptation layer, which by default supports a naive LangGraph Agentic.                  |
 |                          | New classes can be derived to support user-defined Agents, and the run function needs to be   |

@@ -0,0 +1,84 @@
+#!/usr/bin/env bash
+# GRPO | Agent Framework + Gateway | DeepEyes multimodal tool-use
+#
+# This script trains a vision-language model with agentic tool-use rollouts
+# using the Agent Framework + Gateway stack. Each rollout sample gets its own
+# HTTP session where the agent runner can make multi-turn chat completions
+# requests and invoke tools (e.g., image zoom).
+#
+# Prerequisites:
+#   - A judge/reward model serving at LLM_AS_A_JUDGE_BASE (default: localhost:18901)
+#   - DeepEyes dataset parquet file at TRAIN_FILE
+#   - Model checkpoint at MODEL_PATH (or HuggingFace model ID)
+#
+# See docs/advance/agent_framework.rst for architecture details.
+
+set -xeuo pipefail
+
+########################### user-adjustable ###########################
+MODEL_PATH=${MODEL_PATH:-/data1/models/Qwen/Qwen3.5-4B}
+TRAIN_FILE=${TRAIN_FILE:-/data1/datasets/deepeyes/data/data_0.1.2_visual_toolbox_v2.parquet}
+VAL_FILE=${VAL_FILE:-${TRAIN_FILE}}
+
+NGPUS_PER_NODE=${NGPUS_PER_NODE:-7}
+TOTAL_TRAINING_STEPS=${TOTAL_TRAINING_STEPS:-50}
+TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-14}
+
+# Agent Framework specific
+GATEWAY_COUNT=${GATEWAY_COUNT:-7}
+MAX_TURNS=${MAX_TURNS:-5}
+COMPLETION_TIMEOUT=${COMPLETION_TIMEOUT:-}
+
+# Reward judge endpoint
+LLM_AS_A_JUDGE_BASE=${LLM_AS_A_JUDGE_BASE:-http://127.0.0.1:18901/v1}
+
+PROJECT_NAME=${PROJECT_NAME:-deepeyes_gateway_grpo}
+EXPERIMENT_NAME=${EXPERIMENT_NAME:-qwen35_4b_deepeyes_gateway_grpo}
+########################### end user-adjustable ###########################
+
+VERL_REPO_ROOT=$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." && pwd)
+cd "${VERL_REPO_ROOT}"
+
+export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:-0,1,2,3,4,5,6}"
+export VERL_FORCE_TQ_NESTED_READBACK="${VERL_FORCE_TQ_NESTED_READBACK:-1}"
+export LLM_AS_A_JUDGE_BASE
+export WANDB_MODE="${WANDB_MODE:-offline}"
+export NCCL_P2P_DISABLE="${NCCL_P2P_DISABLE:-1}"
+export NCCL_SHM_DISABLE="${NCCL_SHM_DISABLE:-1}"
+export PYTHONUNBUFFERED=1
+export HYDRA_FULL_ERROR=1
+
+python3 -m verl.trainer.main_ppo_sync \
+  --config-path="${VERL_REPO_ROOT}/recipe/deepeyes_with_gateway/configs" \
+  --config-name=deepeyes_gateway_grpo \
+  data.train_files="${TRAIN_FILE}" \
+  "data.val_files=[${VAL_FILE}]" \
+  data.train_batch_size="${TRAIN_BATCH_SIZE}" \
+  data.max_prompt_length=4096 \
+  data.max_response_length=1024 \
+  trainer.total_training_steps="${TOTAL_TRAINING_STEPS}" \
+  trainer.val_before_train=False \
+  trainer.n_gpus_per_node="${NGPUS_PER_NODE}" \
+  trainer.nnodes=1 \
+  'trainer.logger=[console,wandb,tensorboard]' \
+  trainer.project_name="${PROJECT_NAME}" \
+  trainer.experiment_name="${EXPERIMENT_NAME}" \
+  trainer.save_freq=-1 \
+  trainer.test_freq=-1 \
+  actor_rollout_ref.model.use_fused_kernels=False \
+  actor_rollout_ref.model.use_remove_padding=True \
+  '+actor_rollout_ref.model.override_config.attn_implementation=eager' \
+  actor_rollout_ref.actor.ppo_mini_batch_size="${TRAIN_BATCH_SIZE}" \
+  actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
+  actor_rollout_ref.rollout.name=vllm \
+  actor_rollout_ref.rollout.n=4 \
+  actor_rollout_ref.rollout.response_length=1024 \
+  actor_rollout_ref.rollout.max_model_len=8192 \
+  actor_rollout_ref.rollout.max_num_seqs=4 \
+  actor_rollout_ref.rollout.max_num_batched_tokens=16384 \
+  actor_rollout_ref.rollout.gpu_memory_utilization=0.55 \
+  actor_rollout_ref.rollout.enforce_eager=True \
+  actor_rollout_ref.rollout.dtype=float16 \
+  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
+  actor_rollout_ref.rollout.custom.agent_framework.gateway_count="${GATEWAY_COUNT}" \
+  actor_rollout_ref.rollout.custom.agent_framework.agent_runner_kwargs.max_turns="${MAX_TURNS}"
diff --git a/examples/tutorial/agent_framework_get_started/README.md b/examples/tutorial/agent_framework_get_started/README.md
@@ -0,0 +1,60 @@
+# Agent Framework Get Started
+
+Minimal runnable entry for the `verl.agent.framework` + `verl.agent.gateway`
+stack (PR #6299).
+
+It demonstrates three boundaries:
+
+1. The caller creates `GatewayServingRuntime` externally (entry.py does this
+   in production; here we do it manually for visibility).
+2. `GatewayServingRuntime` is injected into `OpenAICompatibleAgentFramework`.
+3. The framework is exercised with one `generate_sequences(...)` call on a
+   minimal `TensorDict`.
+
+Inside the script, the agent side is split into two layers:
+
+- `agent_runner(...)`: the framework-facing adapter that receives a
+  `SessionHandle` and extracts `session.base_url`
+- `run_mock_agent(base_url, raw_prompt)`: an external-agent-style function
+  that only knows an OpenAI-compatible backend URL plus prompt messages
+
+That keeps the gateway-specific lifecycle shim visible, while showing how a
+normal agent can treat the gateway as its backend URL.
+
+This is intentionally **not** a trainer integration example. It uses:
+
+- a tiny fake rollout server actor (Ray remote),
+- the real `GlobalRequestLoadBalancer`,
+- the real `GatewayServingRuntime` with `gateway_count=1`,
+- the real `GatewayActor` (HTTP server),
+- the real `OpenAICompatibleAgentFramework`.
+
+The example runs CPU-only and requires no GPU. `reward_loop_worker_handles=None`
+means reward scoring is skipped; `rm_scores` is zero-filled in the TQ output
+(matching the framework's default behavior when no reward workers are available).
+
+## Run
+
+```bash
+python examples/tutorial/agent_framework_get_started/minimal_e2e.py
+```
+
+The script will:
+
+1. Start Ray (local mode).
+2. Start one fake rollout server actor.
+3. Create a `GlobalRequestLoadBalancer`.
+4. Create a `GatewayServingRuntime` with one gateway actor.
+5. Construct `OpenAICompatibleAgentFramework` with the runtime.
+6. Send one chat-completions request through the gateway.
+7. Call `generate_sequences(...)` which writes to a fake TransferQueue.
+8. Print a JSON summary of the output.
+9. Shut down the runtime and Ray.
+
+## Architecture Reference
+
+For the full architecture, configuration reference, and production usage, see
+[docs/advance/agent_framework.rst](../../../docs/advance/agent_framework.rst).
+
+For a full training run with GPU cluster, see
+[examples/grpo_trainer/run_deepeyes_gateway_grpo.sh](../../grpo_trainer/run_deepeyes_gateway_grpo.sh).