Skip to content

[rollout, tool] feat: add experimental agent framework and gateway runtime#6299

Draft
zackcxb wants to merge 10 commits into
verl-project:mainfrom
zackcxb:gateway-framework-pr-source
Draft

[rollout, tool] feat: add experimental agent framework and gateway runtime#6299
zackcxb wants to merge 10 commits into
verl-project:mainfrom
zackcxb:gateway-framework-pr-source

Conversation

@zackcxb
Copy link
Copy Markdown

@zackcxb zackcxb commented May 9, 2026

What does this PR do?

This PR adds an experimental agent framework and gateway runtime for multi-turn agent-style rollout in VERL, according to #5790.

Specifically, it:

  • adds verl.experimental.agent_framework for a new abstraction for agent systems, with an example implementation that is compatible with TransferQueue,
  • adds verl.experimental.agent_gateway for OpenAI-compatible session serving and sticky session routing, and tool-parser wiring, with a GatewayServingRuntimethat delegates backend routing to LLMServerClient,
  • adds CPU tests covering framework contract, gateway actor / manager behavior, session runtime lifecycle, and multimodal postprocess,

Note: the PR currently also includes an opt-in TransferQueue nested-readback bridge in
verl.utils.transferqueue_utils (gated byVERL_FORCE_TQ_NESTED_READBACK=1). This is a temporary workaround for a known issue #6261. The commit is not part of the PR and will be removed once the bug is fixed.

Related: RFC #5790. Supersedes draft PR #5931 (will be closed).

WIP:

  • CLIAgentFramework base + reference external-agent recipe (Deepeyes and SWE-agent)
  • RewardLoopWorker / NaiveRewardManager integration (dict rewards, sandbox fusion, reward router)
  • GatewayActor default placement strategy (e.g. at least one per node) once multi-node validation is in
  • Remove the VERL_FORCE_TQ_NESTED_READBACK bridge when the upstream TransferQueue fix lands
  • Documentation under docs/ once the CLI framework direction is committed

Checklist Before Starting

Test

pytest -q tests/agent/tests

Result: 60 passed, 4 warnings (framework, gateway, runtime).

The tutorial doubles as a runnable smoke:

python examples/tutorial/agent_framework_get_started/minimal_e2e.py

Runs one full generate_sequences() through a CPU-only fake rollout
server and prints a summary JSON.

Real-rollout evidence from a downstream branch that carries a Deepeyes_with_gateway recipe (not part of this PR): a 24-step GRPO run on multi-turn multimodal data (Qwen3.5-4B, 7× RTX 3090 train + 1× local judge) produced a real learning curve — critic/rewards/mean moved from ~0.31 at step 1 to ~1.45 by step 24, with non-zero advantages throughout and actor/grad_norm stable in the 3–20 range.

API and Usage Example

Public APIs added:

  • verl.agent.frameworkAgentFramework,
    OpenAICompatibleAgentFramework, TrajectoryAssembler,
    apply_multi_modal_postprocess
  • verl.agent.gatewayGatewayServingRuntime, GatewayManager,
    GatewayActor

Minimum viable wiring (see
examples/tutorial/agent_framework_get_started/minimal_e2e.py for the
full runnable example):

from verl.agent.framework import OpenAICompatibleAgentFramework
from verl.agent.gateway.runtime import GatewayServingRuntime
from verl.workers.rollout.llm_server import LLMServerClient

runtime = GatewayServingRuntime(
    llm_client=llm_client,  # typically from LLMServerManager.get_client()
    gateway_count=1,
    gateway_actor_kwargs={"tokenizer": tokenizer, "host": "127.0.0.1"},
)

async def agent_runner(*, raw_prompt, session, sample_index):
    # The agent only needs the gateway base URL. In production this
    # would typically call a remote service or spawn an external agent.
    await run_external_agent(base_url=session.base_url, raw_prompt=raw_prompt)

framework = OpenAICompatibleAgentFramework(
    session_runtime=runtime,
    agent_runner=agent_runner,
    reward_fn=my_reward_fn,  # SessionRewardContext -> list[float]
)

stats = await framework.generate_sequences(
    prompts,
    global_steps=global_steps,
    partition_id="train",
)

generate_sequences() writes finalized trajectories directly to
TransferQueue with key "{uid}_{session_id}_{index}", matching
AgentLoopWorkerTQ._agent_loop_postprocess()'s field / tag layout, and
returns a stats dict with success / failure counts and short failure
reasons.

Design & Code Changes

High-level changes:

  • AgentFramework base class + OpenAICompatibleAgentFramework
    concrete implementation own session orchestration (create_session
    agent_runnerfinalize_session), trajectory assembly,
    multimodal post-processing, reward scoring, and TransferQueue writes.
    Per-session failures are isolated via
    asyncio.gather(..., return_exceptions=True) so one bad session
    does not cancel the rest of the batch.
  • GatewayActor provides OpenAI Chat Completions over sticky sessions
    with prefix-consistency checks, tool-parser decoding, and multimodal
    media accumulation. GatewayManager routes new sessions by
    least-active count. GatewayServingRuntime owns gateway actor
    lifecycle and delegates backend routing to LLMServerClient
    (no duplicate routing surface).
  • Multimodal trajectory post-process builds trainer-consumable
    multi_modal_inputs and (4, seq_len) position ids inside the
    framework, so VLM sessions do not need per-recipe glue.
  • Observability stays in standard logging; the trainer's metric dict
    is intentionally not touched in this PR.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks:
    pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
  • Add / Update the documentation. —
    deferred to a follow-up doc PR; inline docstrings and the tutorial
    README ship with this PR.
  • Add unit or end-to-end test(s) to the CI workflow
    to cover all the code. Focused framework / gateway CPU tests are
    included and routed to cpu_unit_tests.yml via the *_on_cpu.py
    naming convention.
  • Once your PR is ready for CI, send a message in
    the ci-request channel
    in the verl Slack workspace.
    (If not accessible, please try the Feishu group (飞书群).)
  • If your PR is related to the recipe submodule, please also
    update the reference to the submodule commit via
    git submodule update --remote or cd recipe && git pull origin main. —
    Not applicable: this PR does not include the recipe submodule.

zackcxb and others added 3 commits May 8, 2026 15:16
…ntime -- make the new public agent surface own session orchestration and backend routing

This reconstructs the framework and gateway surface directly on upstream/main, without carrying the old experimental-agent migration history.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… boundary, replay output API, and tutorial path reviewable

These tests exercise the framework/gateway contract on CPU and keep the minimal tutorial aligned with the unified generate_sequences entry.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llouts -- FSDP no-padding workers require readback tensors to keep jagged semantics

The upstream TransferQueue readback can densify equal-shaped sequence fields; the gateway sync smoke still needs an opt-in bridge to normalize token fields before compute_log_prob.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the verl.agent package, establishing a framework and an OpenAI-compatible gateway for agent-based reinforcement learning. Key additions include the OpenAICompatibleAgentFramework for session orchestration and trajectory scoring, and the GatewayActor for managing agent interactions and trajectory collection. The PR also includes TransferQueue utilities to ensure nested tensor consistency and a comprehensive suite of CPU-only unit tests. Review feedback identified a critical issue in the gateway where response_logprobs could become misaligned with response_ids if the backend returns inconsistent logprob data, potentially causing validation failures during trajectory assembly.

Comment on lines +580 to +581
if output.log_probs is not None:
active_trajectory.response_logprobs.extend(list(output.log_probs))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The response_logprobs list in TrajectoryBuffer can become misaligned with response_ids if the backend (rollout server) does not consistently return logprobs for every turn, or if logprobs tracking starts after the first turn.

Specifically:

  1. If Turn 1 returns no logprobs but Turn 2 does, response_logprobs will only contain Turn 2's logprobs, while response_ids contains tokens from both turns.
  2. If Turn 1 had logprobs but Turn 2 returns None, response_logprobs will not be extended for the new tokens, causing a length mismatch.

This mismatch will cause a ValueError in validate_trajectory during assembly. The gateway should ensure length alignment by padding with zeros when logprobs are partially missing or when tracking begins mid-session.

            if output.log_probs is not None:
                # Ensure alignment if we just started receiving logprobs for this trajectory
                if not active_trajectory.response_logprobs and active_trajectory.response_ids:
                    active_trajectory.response_logprobs = [0.0] * len(active_trajectory.response_ids)
                active_trajectory.response_logprobs.extend(list(output.log_probs))
            elif active_trajectory.response_logprobs:
                # Pad with zeros if we were tracking logprobs but this turn has none
                active_trajectory.response_logprobs.extend([0.0] * len(response_ids))

@wzhgba
Copy link
Copy Markdown

wzhgba commented May 14, 2026

Per-sample agent dispatch and compatibility with existing agent loops

First, thanks for this PR — the gateway + framework design is clean and the TQ output alignment means downstream trainers can adopt it with minimal changes. A question about agent compatibility:

The issue

In the current AgentLoopManager, the agent_name field in the RL dataset allows per-sample agent dispatch: some samples can use single_turn_agent (math reasoning), others can use tool_agent (code, tool-use tasks), and they are all handled within the same training batch.

# Current behavior: each sample picks its agent loop
batch.non_tensor_batch["agent_name"] = np.array([
    "single_turn_agent",   # sample 0: math
    "tool_agent",          # sample 1: coding
    ...
])

In OpenAICompatibleAgentFramework, the agent_runner is a single callable injected at construction time — there is currently no mechanism for the dataset's agent_name to dispatch to different runners. This means mixed-training setups (math RL + tool-use RL + sandbox RL) cannot coexist within one training run without a dispatcher.

What's already in place

The framework already has _runner_kwargs_for_sample() (line 326 in framework.py), which passes per-sample fields from the dataset to the runner. Adding agent_name to these kwargs would be a one-line, backward-compatible change:

def _runner_kwargs_for_sample(self, sample_fields):
    kwargs = {}
    if "tools_kwargs" in sample_fields:
        kwargs["tools_kwargs"] = sample_fields["tools_kwargs"]
    if "agent_name" in sample_fields:
        kwargs["agent_name"] = sample_fields["agent_name"]
    return kwargs

With that, a thin dispatcher wrapper (DispatchedAgentRunner) can route to different runner backends per sample — exactly the same semantics as the current AgentLoopManager.

Question

Are there plans to adapt the existing single_turn_agent and tool_agent as agent_runner implementations under the new framework, so that users can drop in a dispatcher and keep mixed-training workflows without regression? If not, I'm happy to contribute them as a follow-up.

session.request_tools = tools
self._touch_session(session)

return JSONResponse(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope to add compatibility of stream mode chat completion for the openclaw.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing the agent_name problem, I’ll resolve it soon.
We know stream output support is needed, yet it won’t be added in this initial PR.
Do you have existing use cases requiring seamless fully online RL with no impact on user experience?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick response!
On agent dispatch — looking forward to the update, happy to help test when ready.
On stream mode — completely understandable this sits outside the initial PR. OpenClaw doesn't currently support disabling streaming via config; doing so would require non-trivial changes to the upstream pi-agent dependency. We've cherry-picked this PR onto our branch and are doing light adaptation + experiments on our side. Stream support can be a follow-up and we'd be glad to collaborate on it then.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great! I have resolved the agent_name issue and also completed a series of other code refactors. Hopefully these adjustments won’t disrupt your established workflow.
We recognize the unique requirements of online RL scenarios and intend to add streaming mode support in future iterations. You are more than welcome to submit related contributions once this initial PR is finalized and merged.
Additionally, I’d like to mention that we plan to gradually migrate this PR to the uni-agent repository instead of keeping it within VERL: verl-project/uni-agent#25
Since the Gateway + Framework works as an independent standalone module, this migration should require little extra adaptation work.

zackcxb and others added 7 commits May 18, 2026 11:57
Remove the unused padded assembler surface and trajectory identity placeholders, then make OpenAICompatibleAgentFramework satisfy the sync trainer rollout-manager contract directly.

Add the transitional build_agent_framework entry factory so recipe adapters can shrink without adding a RolloutManager wrapper or hydra namespace fallback.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…traction

Switch framework reward path to direct RewardLoopWorker dispatch with
score-last + broadcast strategy, matching legacy AgentLoopWorkerTQ.
Delete _build_reward_dispatcher / RewardFn / SessionRewardContext —
extension via subclass override of _score_trajectories.

- entry.py: self-load HFModelConfig; promote adapter from recipe to core
- framework.py: inline _score_trajectories with reward_loop_worker_handles
- types.py: add complete_session to SessionRuntime Protocol
- helpers.py: delete (dead code after Phase A cleanup)
- __init__.py: remove stale re-exports

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arse tolerance

- runtime.py: spread GatewayActors across CPU nodes via
  NodeAffinitySchedulingStrategy round-robin (mirrors AgentLoopWorker)
- runtime.py + manager.py: expose complete_session for explicit
  session lifecycle signaling
- gateway.py: add _FINISH_REASON_MAP normalizing vLLM stop reasons to
  OpenAI spec values; document vLLM information loss
- gateway.py: loosen _validate_tools to accept non-dict tool schemas
- gateway.py: pass parsed_tools to extract_tool_calls with tolerant
  pydantic parse (try/except fallback to None)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ool parse

- test_generate_sequences: migrate from reward_fn lambda to
  monkeypatch pattern; add score-last broadcast + no-handles tests
- test_gateway_actor: parametrize finish_reason normalization;
  add tool parse tolerance tests
- test_session_runtime: add round-robin placement assertion
- support.py: delete (inlined into individual test files)
- minimal_e2e.py: align with entry.py adapter promotion

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…sion concurrency cap

Always write rollout_log_probs and rm_scores (zero-filled when absent)
so the sync trainer's select_fields never hits a missing field across a
mixed batch where some sessions lack logprobs/reward. Without this,
bypass-mode _compute_old_log_prob KeyErrors on rollout_log_probs.

Add opt-in max_concurrent_sessions (0 = unlimited) backed by a
lazy-initialized asyncio.Semaphore that rebinds if the running loop
changes, since Ray actors may run sessions on a different loop than
__init__.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ferences

Add docs/advance/agent_framework.rst covering architecture, components,
agent runner authoring guide, and configuration reference. Link from
docs/start/agentic_rl.rst and docs/index.rst toctree.

Add examples/grpo_trainer/run_deepeyes_gateway_grpo.sh as a parameterized
training script following the existing run_qwen2_5_vl_7b_fsdp.sh pattern.

Update tutorial README to reflect current architecture (reward dispatch,
zero-fill fields, session concurrency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wzhgba
Copy link
Copy Markdown

wzhgba commented May 28, 2026

The Agent Gateway data model is a 1:1 mapping: one session = one linear conversation = one trajectory. All state is stored as singleton fields on GatewaySessionState:

@dataclass
class GatewaySessionState:
    message_history: list[dict] = field(...)           # single conversation
    active_trajectory: TrajectoryBuffer | None = None  # single token buffer
    trajectories: list[Trajectory] = field(...)        # sequential history

In a sub-agent scenario, the parent agent invokes multiple tools that each run independent conversations through the same session's /v1/chat/completions endpoint. Because all sub-agents share the same singleton state, the gateway cannot maintain correct, isolated trajectories.

Limitation 1: Shared message_history Causes Context Cross-Contamination

After sub-agent A updates session.message_history, sub-agent B's next request is prefix-matched against A's modified history. Since B's conversation is logically independent from A's, _is_request_context_prefix produces a false negative — B's request is treated as a context divergence, triggering a full re-materialization and re-encoding. The KV cache optimization is lost, and worse, the resulting trajectories contain interleaved tokens from unrelated sub-conversations.

Limitation 2: Single active_trajectory Interleaves Unrelated Token Sequences

With only one TrajectoryBuffer, the response tokens from sub-agent A and sub-agent B are appended to the same buffer in arrival order. This produces semantically corrupted training data where a single trajectory jumps between unrelated dialogues.

Proposed Approach: Internal Multi-Trajectory Prefix Detection

The API contract does not need to change. The existing /v1/chat/completions endpoint can remain unchanged. Instead, the gateway should internally maintain a set of active sub-contexts — each with its own message_history, active_trajectory, and supporting metadata.

When a request arrives:

  1. Perform prefix detection against all active sub-contexts (not just the last one).
  2. Select the sub-context whose history is the best prefix match for the incoming messages.
  3. Append the assistant response to that sub-context's trajectory.
  4. On session finalization, collect trajectories from all sub-contexts.

Minimal Change Plan

# Change Location
1 Add SubContext holding message_history, active_trajectory, request_tools, image_data, video_data types.py
2 Replace singleton fields in GatewaySessionState with sub_contexts: list[SubContext] + `current_sub: SubContext None` for backward compatibility
3 In _handle_chat_completions, prefix-match incoming messages against all SubContexts; pick the best match (longest prefix) gateway.py
4 On finalization, materialize and collect trajectories from all sub-contexts gateway.py

The generation_lock and request serialization remain as-is — sequential state mutation is fine. The key insight is that the gateway can match each incoming request to the correct conversation branch internally, without any API-level change.


当前 Agent Gateway 的数据模型是 1:1 映射:一个 session = 一条线性对话 = 一个 trajectory。所有状态以单例字段形式存储于 GatewaySessionState

@dataclass
class GatewaySessionState:
    message_history: list[dict] = field(...)           # 单条对话
    active_trajectory: TrajectoryBuffer | None = None  # 单个 token 缓冲区
    trajectories: list[Trajectory] = field(...)        # 顺序历史

在 sub-agent 场景中,主 agent 调用多个工具,每个工具通过同一个 session 的 /v1/chat/completions 端点展开独立对话。由于所有子 agent 共享同一份单例状态,gateway 无法为每个子对话维护正确、隔离的轨迹。

限制 1:共享 message_history 导致上下文交叉污染

子 agent A 更新 session.message_history 后,子 agent B 的下一次请求会以 A 修改后的历史进行前缀匹配。由于 B 的对话逻辑上与 A 无关,_is_request_context_prefix 会产生误判——B 的请求被当作上下文发散,触发全量重编码。KV cache 优化丢失,且最终生成的轨迹中混入了不相关子对话的 token。

限制 2:单一 active_trajectory 交错写入无关 token 序列

由于只有一个 TrajectoryBuffer,子 agent A 和子 agent B 的响应 token 会按到达顺序追加到同一个缓冲区。结果是一条 trajectory 在互不相关的对话之间跳跃,产生语义混乱的训练数据。

建议方案:内部多轨迹前缀检测

API 契约无需变更。现有 /v1/chat/completions 端点保持不变。Gateway 内部维护一组活跃子上下文——每个子上下文拥有独立的 message_historyactive_trajectory 及相关元数据。

请求到达时的处理流程:

  1. 所有活跃子上下文做前缀检测(而非仅对最新的一个)。
  2. 选择与入站 messages 前缀匹配最长的那个子上下文。
  3. 将 assistant 响应追加到该子上下文的 trajectory。
  4. Session 最终化时,收集所有子上下文的 trajectories。

最小改动方案

# 改动 位置
1 新增 SubContext,包含 message_historyactive_trajectoryrequest_toolsimage_datavideo_data types.py
2 GatewaySessionState 中的单例字段替换为 sub_contexts: list[SubContext] + `current_sub: SubContext None`(向后兼容)
3 _handle_chat_completions 中,将入站 messages 与所有 SubContext 做前缀匹配,选最佳匹配(最长前缀) gateway.py
4 最终化时,从所有子上下文物化并收集 trajectories gateway.py

generation_lock 和请求串行化保持不变——顺序执行状态变更是合理的。核心思路是 gateway 能在内部将每个请求匹配到正确的对话分支,无需任何 API 层面的改动。

@zackcxb
Copy link
Copy Markdown
Author

zackcxb commented May 28, 2026

The Agent Gateway data model is a 1:1 mapping: one session = one linear conversation = one trajectory. All state is stored as singleton fields on GatewaySessionState:

@dataclass
class GatewaySessionState:
    message_history: list[dict] = field(...)           # single conversation
    active_trajectory: TrajectoryBuffer | None = None  # single token buffer
    trajectories: list[Trajectory] = field(...)        # sequential history

In a sub-agent scenario, the parent agent invokes multiple tools that each run independent conversations through the same session's /v1/chat/completions endpoint. Because all sub-agents share the same singleton state, the gateway cannot maintain correct, isolated trajectories.

Limitation 1: Shared message_history Causes Context Cross-Contamination

After sub-agent A updates session.message_history, sub-agent B's next request is prefix-matched against A's modified history. Since B's conversation is logically independent from A's, _is_request_context_prefix produces a false negative — B's request is treated as a context divergence, triggering a full re-materialization and re-encoding. The KV cache optimization is lost, and worse, the resulting trajectories contain interleaved tokens from unrelated sub-conversations.

Limitation 2: Single active_trajectory Interleaves Unrelated Token Sequences

With only one TrajectoryBuffer, the response tokens from sub-agent A and sub-agent B are appended to the same buffer in arrival order. This produces semantically corrupted training data where a single trajectory jumps between unrelated dialogues.

Proposed Approach: Internal Multi-Trajectory Prefix Detection

The API contract does not need to change. The existing /v1/chat/completions endpoint can remain unchanged. Instead, the gateway should internally maintain a set of active sub-contexts — each with its own message_history, active_trajectory, and supporting metadata.

When a request arrives:

  1. Perform prefix detection against all active sub-contexts (not just the last one).
  2. Select the sub-context whose history is the best prefix match for the incoming messages.
  3. Append the assistant response to that sub-context's trajectory.
  4. On session finalization, collect trajectories from all sub-contexts.

Minimal Change Plan

Change Location

1 Add SubContext holding message_history, active_trajectory, request_tools, image_data, video_data types.py
2 Replace singleton fields in GatewaySessionState with sub_contexts: list[SubContext] + current_sub: SubContext None for backward compatibility
3 In _handle_chat_completions, prefix-match incoming messages against all SubContexts; pick the best match (longest prefix) gateway.py
4 On finalization, materialize and collect trajectories from all sub-contexts gateway.py
The generation_lock and request serialization remain as-is — sequential state mutation is fine. The key insight is that the gateway can match each incoming request to the correct conversation branch internally, without any API-level change.

当前 Agent Gateway 的数据模型是 1:1 映射:一个 session = 一条线性对话 = 一个 trajectory。所有状态以单例字段形式存储于 GatewaySessionState

@dataclass
class GatewaySessionState:
    message_history: list[dict] = field(...)           # 单条对话
    active_trajectory: TrajectoryBuffer | None = None  # 单个 token 缓冲区
    trajectories: list[Trajectory] = field(...)        # 顺序历史

在 sub-agent 场景中,主 agent 调用多个工具,每个工具通过同一个 session 的 /v1/chat/completions 端点展开独立对话。由于所有子 agent 共享同一份单例状态,gateway 无法为每个子对话维护正确、隔离的轨迹。

限制 1:共享 message_history 导致上下文交叉污染

子 agent A 更新 session.message_history 后,子 agent B 的下一次请求会以 A 修改后的历史进行前缀匹配。由于 B 的对话逻辑上与 A 无关,_is_request_context_prefix 会产生误判——B 的请求被当作上下文发散,触发全量重编码。KV cache 优化丢失,且最终生成的轨迹中混入了不相关子对话的 token。

限制 2:单一 active_trajectory 交错写入无关 token 序列

由于只有一个 TrajectoryBuffer,子 agent A 和子 agent B 的响应 token 会按到达顺序追加到同一个缓冲区。结果是一条 trajectory 在互不相关的对话之间跳跃,产生语义混乱的训练数据。

建议方案:内部多轨迹前缀检测

API 契约无需变更。现有 /v1/chat/completions 端点保持不变。Gateway 内部维护一组活跃子上下文——每个子上下文拥有独立的 message_historyactive_trajectory 及相关元数据。

请求到达时的处理流程:

  1. 所有活跃子上下文做前缀检测(而非仅对最新的一个)。
  2. 选择与入站 messages 前缀匹配最长的那个子上下文。
  3. 将 assistant 响应追加到该子上下文的 trajectory。
  4. Session 最终化时,收集所有子上下文的 trajectories。

最小改动方案

改动 位置

1 新增 SubContext,包含 message_historyactive_trajectoryrequest_toolsimage_datavideo_data types.py
2 GatewaySessionState 中的单例字段替换为 sub_contexts: list[SubContext] + current_sub: SubContext None(向后兼容)
3 _handle_chat_completions 中,将入站 messages 与所有 SubContext 做前缀匹配,选最佳匹配(最长前缀) gateway.py
4 最终化时,从所有子上下文物化并收集 trajectories gateway.py
generation_lock 和请求串行化保持不变——顺序执行状态变更是合理的。核心思路是 gateway 能在内部将每个请求匹配到正确的对话分支,无需任何 API 层面的改动。

这个诉求很合理,我们最近确实在考虑对subagent场景的支持,允许维护多个active trajectories确实有意义。不过关于子上下文的匹配这里我有疑问,“选择与入站 messages 前缀匹配最长的那个子上下文”这个标准合理吗?subagent发出的openAI request和普通的请求是否有其他更稳固的办法做区分,还是说这个取决于agent的行为? 如果有具体的agent例子会很有帮助。

@gxlvera
Copy link
Copy Markdown

gxlvera commented May 28, 2026

The Agent Gateway data model is a 1:1 mapping: one session = one linear conversation = one trajectory. All state is stored as singleton fields on GatewaySessionState:

@dataclass
class GatewaySessionState:
    message_history: list[dict] = field(...)           # single conversation
    active_trajectory: TrajectoryBuffer | None = None  # single token buffer
    trajectories: list[Trajectory] = field(...)        # sequential history

In a sub-agent scenario, the parent agent invokes multiple tools that each run independent conversations through the same session's /v1/chat/completions endpoint. Because all sub-agents share the same singleton state, the gateway cannot maintain correct, isolated trajectories.

Limitation 1: Shared message_history Causes Context Cross-Contamination

After sub-agent A updates session.message_history, sub-agent B's next request is prefix-matched against A's modified history. Since B's conversation is logically independent from A's, _is_request_context_prefix produces a false negative — B's request is treated as a context divergence, triggering a full re-materialization and re-encoding. The KV cache optimization is lost, and worse, the resulting trajectories contain interleaved tokens from unrelated sub-conversations.

Limitation 2: Single active_trajectory Interleaves Unrelated Token Sequences

With only one TrajectoryBuffer, the response tokens from sub-agent A and sub-agent B are appended to the same buffer in arrival order. This produces semantically corrupted training data where a single trajectory jumps between unrelated dialogues.

Proposed Approach: Internal Multi-Trajectory Prefix Detection

The API contract does not need to change. The existing /v1/chat/completions endpoint can remain unchanged. Instead, the gateway should internally maintain a set of active sub-contexts — each with its own message_history, active_trajectory, and supporting metadata.

When a request arrives:

  1. Perform prefix detection against all active sub-contexts (not just the last one).
  2. Select the sub-context whose history is the best prefix match for the incoming messages.
  3. Append the assistant response to that sub-context's trajectory.
  4. On session finalization, collect trajectories from all sub-contexts.

Minimal Change Plan

Change Location

1 Add SubContext holding message_history, active_trajectory, request_tools, image_data, video_data types.py
2 Replace singleton fields in GatewaySessionState with sub_contexts: list[SubContext] + current_sub: SubContext None for backward compatibility
3 In _handle_chat_completions, prefix-match incoming messages against all SubContexts; pick the best match (longest prefix) gateway.py
4 On finalization, materialize and collect trajectories from all sub-contexts gateway.py
The generation_lock and request serialization remain as-is — sequential state mutation is fine. The key insight is that the gateway can match each incoming request to the correct conversation branch internally, without any API-level change.

当前 Agent Gateway 的数据模型是 1:1 映射:一个 session = 一条线性对话 = 一个 trajectory。所有状态以单例字段形式存储于 GatewaySessionState

@dataclass
class GatewaySessionState:
    message_history: list[dict] = field(...)           # 单条对话
    active_trajectory: TrajectoryBuffer | None = None  # 单个 token 缓冲区
    trajectories: list[Trajectory] = field(...)        # 顺序历史

在 sub-agent 场景中,主 agent 调用多个工具,每个工具通过同一个 session 的 /v1/chat/completions 端点展开独立对话。由于所有子 agent 共享同一份单例状态,gateway 无法为每个子对话维护正确、隔离的轨迹。

限制 1:共享 message_history 导致上下文交叉污染

子 agent A 更新 session.message_history 后,子 agent B 的下一次请求会以 A 修改后的历史进行前缀匹配。由于 B 的对话逻辑上与 A 无关,_is_request_context_prefix 会产生误判——B 的请求被当作上下文发散,触发全量重编码。KV cache 优化丢失,且最终生成的轨迹中混入了不相关子对话的 token。

限制 2:单一 active_trajectory 交错写入无关 token 序列

由于只有一个 TrajectoryBuffer,子 agent A 和子 agent B 的响应 token 会按到达顺序追加到同一个缓冲区。结果是一条 trajectory 在互不相关的对话之间跳跃,产生语义混乱的训练数据。

建议方案:内部多轨迹前缀检测

API 契约无需变更。现有 /v1/chat/completions 端点保持不变。Gateway 内部维护一组活跃子上下文——每个子上下文拥有独立的 message_historyactive_trajectory 及相关元数据。

请求到达时的处理流程:

  1. 所有活跃子上下文做前缀检测(而非仅对最新的一个)。
  2. 选择与入站 messages 前缀匹配最长的那个子上下文。
  3. 将 assistant 响应追加到该子上下文的 trajectory。
  4. Session 最终化时,收集所有子上下文的 trajectories。

最小改动方案

改动 位置

1 新增 SubContext,包含 message_historyactive_trajectoryrequest_toolsimage_datavideo_data types.py
2 GatewaySessionState 中的单例字段替换为 sub_contexts: list[SubContext] + current_sub: SubContext None(向后兼容)
3 _handle_chat_completions 中,将入站 messages 与所有 SubContext 做前缀匹配,选最佳匹配(最长前缀) gateway.py
4 最终化时,从所有子上下文物化并收集 trajectories gateway.py
generation_lock 和请求串行化保持不变——顺序执行状态变更是合理的。核心思路是 gateway 能在内部将每个请求匹配到正确的对话分支,无需任何 API 层面的改动。

Hi, thank you so much for your suggestion 😊 Your design is similar to mine. But I think the limitation you mentioned is not very accurate. Also, I want to add something about the real benefit of using prefix trie.

Limitation 1: Shared message_history Causes Context Cross-Contamination

After sub-agent A updates session.message_history, sub-agent B's next request is prefix-matched against A's modified history. Since B's conversation is logically independent from A's, _is_request_context_prefix produces a false negative — B's request is treated as a context divergence, triggering a full re-materialization and re-encoding. The KV cache optimization is lost, and worse, the resulting trajectories contain interleaved tokens from unrelated sub-conversations.

Limitation 2: Single active_trajectory Interleaves Unrelated Token Sequences

With only one TrajectoryBuffer, the response tokens from sub-agent A and sub-agent B are appended to the same buffer in arrival order. This produces semantically corrupted training data where a single trajectory jumps between unrelated dialogues.

In response to the above limitation, it is correct that session.message_history is shared at the session level, so sub-agent B's request is compared against sub-agent A's current history. If B is logically independent from A, _is_request_context_prefix(...) will usually return False, which triggers a full re-materialization and re-encoding path.

However, the claim that "the resulting trajectories contain interleaved tokens from unrelated sub-conversations" is not accurate for the current prefix-mismatch path. In that path, gateway does not append B's generated tokens into A's active trajectory. Instead, it materializes A's active trajectory, starts a fresh TrajectoryBuffer for B, and later installs B's buffer as the new session.active_trajectory.

The gateway checks:

elif _is_request_context_prefix(session=session, messages=messages, tools=tools):
    ...
else:
    materialized_trajectory = self._build_materialized_trajectory(
        session=session,
        active=session.active_trajectory,
    )
    ...   request_chat_template_kwargs=request_chat_template_kwargs,
    )
    active_trajectory = TrajectoryBuffer(prompt_ids=prompt_ids)

Because B's messages are not a prefix continuation of A's session.message_history, gateway enters the else branch.

That branch creates a fresh token buffer for B:

active_trajectory = TrajectoryBuffer(prompt_ids=prompt_ids)

This is the key point: B does not reuse A's active_trajectory. B starts from a newly created TrajectoryBuffer.

So the real limitation is not token-level interleaving inside one trajectory. The real limitation is that a session only keeps one current message_history, so when B update session state, A's raw message_history is no longer retained as the session's current message list. Plus, since the current design keeps a single shared active trajectory and messages, we have to use generation lock. But if we use a prefix trie to store the messages, we don't need to maintain a single active trajectory or materialize other trajectories, which allows concurrency for LLM generation to improve throughput.

The Real Problem and Gain of Prefix Trie Storage

The real problem of current limitation is

  1. A's raw messages list is lost so branch reattachment is not support under current implementation;
  2. Retokenization and duplicated storage for shared prefix for multiple trajectotires;
  3. Concurrency for LLM generation is not supported

I will explain why these three problems exist and why a Prefix Trie Storage approach could fix those problems.

1. Support branch reattachment

Current limitation only compares each new request against the current active trajectory, which is also the most recently committed branch. It does not compare against older materialized trajectories or a branch tree. Therefore, if an agent backtracks to an older branch and continues from there, gateway cannot recognize that continuation.

Consider this branching pattern:

root
└─ [sys] You are helpful. Reply in 1 short sentence.
    └─ [usr] Pick any random English word and reply with just that word.
        ├─ [asst] Luminous. 
        ├─ [asst] Serendipity. (winner) 
        │   └─ [usr] Now pick another.
        │       └─ [asst] Whimsical. 
        └─ [asst] Ephemeral. 

The agent sends the same parent messages multiple times and gets multiple candidate assistant responses:

[sys, usr] -> Luminous.
[sys, usr] -> Serendipity.
[sys, usr] -> Ephemeral.

Then it selects Serendipity. as the winner and continues:

[sys, usr, assistant Serendipity., user Now pick another.] -> Whimsical.

A branch-aware trajectory builder should produce 3 trajectories:

T1: [sys, usr] -> Luminous.
T2: [sys, usr] -> Serendipity. -> [user Now pick another.] -> Whimsical.
T3: [sys, usr] -> Ephemeral.

Current gateway instead only checks the continuation request against the most recent active history. If the most recent active branch is Ephemeral., then the Serendipity. continuation does not prefix-match. Gateway materializes Ephemeral. and starts a new full-encoded trajectory for:

[sys, usr, assistant Serendipity., user Now pick another.] 

So the session ends up with 4 trajectory segments:

T1: [sys, usr] -> Luminous.
T2: [sys, usr] -> Serendipity.
T3: [sys, usr] -> Ephemeral.
T4: [sys, usr, assistant Serendipity., user Now pick another.] -> Whimsical.

Therefore, the real limitation is:

current implementation:
only current active branch can be continued

needed for branching agents:
older branches must be findable and reattachable

Best-of-n, rejection sampling, and similar agents often sample multiple sibling candidates and later continue from one selected older candidate. Under the current implementation, we may collect trajectories that are duplicated but incomplete, which reduces trajectory effectiveness and training throughput. In addition, according to https://arxiv.org/pdf/2605.24220, the existence of such trajectories may potentially lead to reward hacking, although this depends on the reward function design and is outside the scope of this proposal for now.

2. Less Retokenization and Storage

A trajectory trie would also help reduce unnecessary retokenization.

In the current code, if the request does not prefix-match the current active history, gateway takes the full-encode path:

else:
    materialized_trajectory = self._build_materialized_trajectory(
        session=session,
        active=session.active_trajectory,
    )
    image_data, video_data = await self._extract_multi_modal_data(messages)
    prompt_ids = self._encode_full(
        messages,
        tools=tools,
        image_data=image_data,
        video_data=video_data,
        request_chat_template_kwargs=request_chat_template_kwargs,
    )
    active_trajectory = TrajectoryBuffer(prompt_ids=prompt_ids)

So on prefix mismatch, the current implementation re-renders and re-tokenizes the entire request messages, even if most of those messages overlap with an older branch.

A prefix trie could improve this by:

  • finding the longest reusable message/token prefix across existing branches;
  • continuing from the matched trie node instead of only from the latest active branch;
  • encoding only the incremental suffix when a reusable branch prefix exists;
  • reducing chat-template rendering and tokenizer CPU time;
  • avoiding duplicated full prompt token storage if the trie stores shared prefixes structurally.

3. Allow Concurrency for LLM Generation

The current implementation uses both generation_lock and request_lock.

generation_lock wraps the whole LLM generation flow for one session:

async with session.generation_lock:
    async with session.request_lock:
        # read session.message_history / session.active_trajectory
        # decide prefix-match vs split
        # build generation_context_ids

    output = await self._backend.generate(...)

    async with session.request_lock:
        # write the result back to the single active trajectory
        session.active_trajectory = active_trajectory
        session.message_history = messages + [assistant_msg]

Because _backend.generate(...) runs while holding generation_lock, two requests in the same session cannot call LLM generation concurrently. The second request must wait for the first request to finish generation and commit its session state.

This is mainly because the current state model has only one mutable session.active_trajectory and one mutable session.message_history. If two requests generated concurrently under this model, they could both read the same old active trajectory and then race to overwrite session.active_trajectory at commit time.

With a prefix trie / branch store, we do not need a single global active trajectory. Each request can match a trie node, generate using a request-local buffer, and then attach its assistant message back to that node:

  • before generation, briefly take request_lock to find the matching trie node and register the request;
  • during generation, do not hold a per-session generation_lock;
  • after generation, briefly take request_lock to attach the assistant message under the matched parent;
  • if two requests share the same parent, they commit as sibling branches instead of overwriting each other;

Hello,非常感谢你的设计。你的设计和我的想法比较相似。不过我认为你提到的 limitation 并不完全准确。另外,我也想补充一下使用 prefix trie 的真正收益。

Limitation 1: Shared message_history Causes Context Cross-Contamination

After sub-agent A updates session.message_history, sub-agent B's next request is prefix-matched against A's modified history. Since B's conversation is logically independent from A's, _is_request_context_prefix produces a false negative — B's request is treated as a context divergence, triggering a full re-materialization and re-encoding. The KV cache optimization is lost, and worse, the resulting trajectories contain interleaved tokens from unrelated sub-conversations.

Limitation 2: Single active_trajectory Interleaves Unrelated Token Sequences

With only one TrajectoryBuffer, the response tokens from sub-agent A and sub-agent B are appended to the same buffer in arrival order. This produces semantically corrupted training data where a single trajectory jumps between unrelated dialogues.

对于上面的 limitation,我认为 session.message_history 确实是在 session 级别共享的,所以 sub-agent B 的请求会和 sub-agent A 当前的 history 做比较。如果 B 在逻辑上和 A 是独立的,_is_request_context_prefix(...) 通常会返回 False,从而触发完整的 re-materialization 和 re-encoding 路径。

但是,“resulting trajectories contain interleaved tokens from unrelated sub-conversations” 这个说法,对于当前 prefix mismatch 路径来说并不准确。在这条路径里,gateway 不会把 B 生成的 tokens append 到 A 的 active trajectory 里。相反,它会先 materialize A 的 active trajectory,然后为 B 创建一个新的 TrajectoryBuffer,之后再把 B 的 buffer 设置为新的 session.active_trajectory

gateway 会做如下判断:

elif _is_request_context_prefix(session=session, messages=messages, tools=tools):
    ...
else:
    materialized_trajectory = self._build_materialized_trajectory(
        session=session,
        active=session.active_trajectory,
    )
    ...   request_chat_template_kwargs=request_chat_template_kwargs,
    )
    active_trajectory = TrajectoryBuffer(prompt_ids=prompt_ids)

因为 B 的 messages 不是 A 的 session.message_history 的 prefix continuation,所以 gateway 会进入 else 分支。

这个分支会为 B 创建一个新的 token buffer:

active_trajectory = TrajectoryBuffer(prompt_ids=prompt_ids)

关键点是:B 不会复用 A 的 active_trajectory。B 会从一个新创建的 TrajectoryBuffer 开始。

所以真正的问题不是一条 trajectory 内部发生 token-level interleaving。真正的问题是,一个 session 只保存一份当前的 message_history。因此,当 B 更新 session state 后,A 的原始 message_history 不再作为当前 session message list 被保留。另外,由于当前设计只维护一份共享的 active trajectory 和 messages,我们不得不使用 generation lock。但如果使用 prefix trie 来存储 messages,就不需要维护单一 active trajectory,也不需要为了切换分支而 materialize 其他 trajectory,从而可以支持 LLM generation 并发并提升吞吐。

Prefix Trie Storage 真正解决的问题和收益

当前实现真正的问题是:

  1. A 的原始 messages list 会丢失,所以当前实现不支持 branch reattachment;
  2. 对多条 trajectories 的共享 prefix,会发生重复 retokenization 和重复存储;
  3. 不支持 LLM generation 并发。

下面我会解释为什么这三个问题存在,以及为什么 Prefix Trie Storage 可以修复这些问题。

1. 支持 Branch Reattachment

当前 limitation 的核心在于:每个新请求只会和当前 active trajectory 比较,而当前 active trajectory 也是最近一次提交的 branch。它不会和更早 materialized 的 trajectories 比较,也不会和一棵 branch tree 比较。因此,如果 agent 回溯到一个较早的 branch 并从那里继续生长,gateway 无法识别出这个 continuation。

考虑下面这种 branching pattern:

root
└─ [sys] You are helpful. Reply in 1 short sentence.
    └─ [usr] Pick any random English word and reply with just that word.
        ├─ [asst] Luminous.
        ├─ [asst] Serendipity. (winner)
        │   └─ [usr] Now pick another.
        │       └─ [asst] Whimsical.
        └─ [asst] Ephemeral.

agent 对同一个 parent messages 发起多次请求,并得到多个候选 assistant response:

[sys, usr] -> Luminous.
[sys, usr] -> Serendipity.
[sys, usr] -> Ephemeral.

然后它选择 Serendipity. 作为 winner 并继续:

[sys, usr, assistant Serendipity., user Now pick another.] -> Whimsical.

一个 branch-aware trajectory builder 应该产出 3 条 trajectories:

T1: [sys, usr] -> Luminous.
T2: [sys, usr] -> Serendipity. -> [user Now pick another.] -> Whimsical.
T3: [sys, usr] -> Ephemeral.

而当前 gateway 只会把 continuation request 和最近的 active history 比较。如果最近的 active branch 是 Ephemeral.,那么 Serendipity. 这条 continuation 就不会 prefix-match。Gateway 会 materialize Ephemeral.,然后为下面这个上下文重新开启一条 full-encoded trajectory:

[sys, usr, assistant Serendipity., user Now pick another.]

所以这个 session 最终会得到 4 个 trajectory segments:

T1: [sys, usr] -> Luminous.
T2: [sys, usr] -> Serendipity.
T3: [sys, usr] -> Ephemeral.
T4: [sys, usr, assistant Serendipity., user Now pick another.] -> Whimsical.

因此,真正的 limitation 是:

current implementation:
only current active branch can be continued

needed for branching agents:
older branches must be findable and reattachable

这种 best-of-n、rejection sampling 等 agent 会采样多个 sibling candidates,然后稍后从某个被选中的旧 candidate 继续生长。如果按照目前的实现,我们会collect到一些重复但又不完整的trajectory,这会降低的 trajectory 有效性和训练的througput。同时根据https://arxiv.org/pdf/2605.24220,这类trajectory的存在有可能会导致reward hacking(不过这个涉及到reward function的设计,暂时不在本proposal的讨论范围内)。

2. 更少的 Retokenization 和存储

trajectory trie 也可以减少不必要的 retokenization。

在当前代码里,如果请求无法 prefix-match 当前 active history,gateway 会走 full-encode 路径:

else:
    materialized_trajectory = self._build_materialized_trajectory(
        session=session,
        active=session.active_trajectory,
    )
    image_data, video_data = await self._extract_multi_modal_data(messages)
    prompt_ids = self._encode_full(
        messages,
        tools=tools,
        image_data=image_data,
        video_data=video_data,
        request_chat_template_kwargs=request_chat_template_kwargs,
    )
    active_trajectory = TrajectoryBuffer(prompt_ids=prompt_ids)

所以在 prefix mismatch 时,当前实现会重新 render 并重新 tokenize 整个 request messages,即使这些 messages 中的大部分内容和某个旧 branch 重叠。

prefix trie 可以通过以下方式改进:

  • 在已有 branches 中找到最长可复用的 message/token prefix;
  • 从匹配到的 trie node 继续,而不是只能从最新的 active branch 继续;
  • 当存在可复用 branch prefix 时,只 encode incremental suffix;
  • 减少 chat-template rendering 和 tokenizer 的 CPU 时间;
  • 如果 trie 以结构化方式存储共享 prefix,也可以避免重复存储完整 prompt tokens。

3. 支持 LLM Generation 并发

当前实现同时使用 generation_lockrequest_lock

generation_lock 会包住一个 session 的整个 LLM generation 流程:

async with session.generation_lock:
    async with session.request_lock:
        # read session.message_history / session.active_trajectory
        # decide prefix-match vs split
        # build generation_context_ids

    output = await self._backend.generate(...)

    async with session.request_lock:
        # write the result back to the single active trajectory
        session.active_trajectory = active_trajectory
        session.message_history = messages + [assistant_msg]

因为 _backend.generate(...) 运行时仍然持有 generation_lock,同一个 session 内的两个请求无法并发调用 LLM generation。第二个请求必须等待第一个请求完成 generation 并提交 session state。

这主要是因为当前 state model 只有一个 mutable 的 session.active_trajectory 和一个 mutable 的 session.message_history。如果在这种模型下允许两个请求并发 generation,它们可能都会读到同一个旧的 active trajectory,然后在 commit 时竞争覆盖 session.active_trajectory

使用 prefix trie / branch store 后,我们不再需要一个全局唯一的 active trajectory。每个请求都可以匹配到一个 trie node,使用 request-local buffer 做 generation,然后把 assistant message 挂回到对应 node 上:

  • generation 前,短暂持有 request_lock,找到匹配的 trie node 并注册该请求;
  • generation 期间,不持有 session 级别的 generation_lock
  • generation 后,短暂持有 request_lock,把 assistant message 挂到匹配的 parent 下面;
  • 如果两个请求共享同一个 parent,它们会作为 sibling branches 提交,而不是互相覆盖。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants