Prefix Trie for Multi-Trajectory Storage
Why a prefix trie (vs. the Agent Gateway’s linear trajectory today)
- Single active branch only: A session keeps one
message_history and one active trajectory. When switching sub-agents, picking a resample path, or returning to an older branch, new requests cannot reattach to historical branches. A trie keeps every branch; incoming messages longest-prefix-match against any path and continue from there.
- Repeated encoding of shared prefixes: Message/token prefixes shared across trajectories are re-materialized and re-tokenized on every branch switch. A trie stores checkpoints on shared nodes; later calls clone from the matched node and tokenize incrementally.
- No concurrent inference: One shared state requires a generation lock and serial LLM calls. With a trie, each call owns a cloned branch state; tokenize and commit can interleave—supporting sub-agents, best-of-n, etc.
Core design
Idea: Maintain a prefix trie per session. Each message sequence is a path from root to some node; assistant nodes may hold a Checkpoint (token state at that prefix). On a new request: longest-prefix DFS → clone nearest checkpoint → after inference, commit to the new assistant leaf.
One generation lifecycle: prepare → tokenize → commit.
| Phase |
Role |
| prepare |
Walk the trie on incoming turns; deep-copy the matched checkpoint into this call’s working_state |
| tokenize |
Incrementally build this round’s prompt_ids from working_state |
| commit |
Write LLM output tokens; attach or refresh the assistant child and checkpoint at the leaf |
Prefix match key: Same edge if (role, text, tool metadata) match. Forking behavior:
| Example |
Trie behavior |
| Different system prompt |
Multiple system edges under root (multi-role) |
| Context condensation (recap / splice) |
New sibling at the fork point |
| Same input, different output |
Multiple assistant children under the same user (best-of-N) |
| Same input, same output |
Reuse existing assistant node (idempotent retry) |
Concurrency: Each generation’s working_state is an independent clone; tokenize, remote inference, and commit may overlap.
Relation to Agent Gateway: Replace “one active trajectory + one message_history” with a trie internally; external APIs can stay the same—only in-session branch routing changes.
Data structures (sketch)
@dataclass(frozen=True)
class MessageKey:
role: str
content: Any
name: str | None = None
tool_calls: tuple[Any, ...] | None = None
tool_call_id: str | None = None
reasoning_content: str | None = None
@dataclass
class BranchCheckpoint:
trajectory_buffer: TrajectoryBuffer # flattened token state up to this prefix
request_tools: list[dict[str, Any]] | None = None
chat_template_kwargs_key: tuple[Any, ...] | None = None
messages: list[dict[str, Any]] | None = None # optional; can be rebuilt from node path
@dataclass(frozen=True)
class BranchHandle:
generation_id: str
# Opaque to Agent Gateway; internally points to the pending attach node.
@dataclass
class PrepareResult:
trajectory_buffer: TrajectoryBuffer | None # cloned checkpoint buffer; None means full encode
checkpoint_messages: list[dict[str, Any]] # prefix covered by trajectory_buffer
branch_handle: BranchHandle # passed back to commit, not inspected by Gateway
class TrieNode:
key: MessageKey | None
message: dict[str, Any] | None # only the current message for this node
parent: TrieNode | None
checkpoint: BranchCheckpoint | None # usually only on committed assistant nodes
children: dict[MessageKey, TrieNode]
class PrefixTrie:
root: TrieNode
def match(self, messages) -> TrieNode:
"""Longest-prefix walk on MessageKey(message); return deepest node."""
...
def materialize_prompt_suffix(self, node, messages) -> TrieNode:
"""Attach prompt-side nodes before generation; no checkpoint yet."""
...
def upsert_assistant(self, parent, assistant_msg, ckpt) -> TrieNode:
"""Find or create assistant child; write ckpt (reuse if same message, sibling if not)."""
...
def prepare(self, messages, *, tools=None, chat_template_kwargs=None) -> PrepareResult:
node = self.match(messages)
attach_node = self.materialize_prompt_suffix(node, messages)
checkpoint_node, checkpoint = nearest_ckpt(attach_node)
return PrepareResult(
trajectory_buffer=clone(checkpoint.trajectory_buffer) if checkpoint else None,
checkpoint_messages=(checkpoint.messages or self.rebuild_messages(checkpoint_node))
if checkpoint
else [],
branch_handle=self.make_branch_handle(attach_node),
)
def commit(self, branch_handle, trajectory_buffer, assistant_msg, **checkpoint_metadata):
attach_node = self.resolve_branch_handle(branch_handle)
ckpt = BranchCheckpoint(trajectory_buffer=trajectory_buffer, **checkpoint_metadata)
self.upsert_assistant(attach_node, assistant_msg, ckpt)
@dataclass
class GatewaySessionState:
handle: SessionHandle
trie: PrefixTrie = field(default_factory=PrefixTrie)
reward_info: dict[str, Any] = field(default_factory=dict)
completed: asyncio.Event = field(default_factory=asyncio.Event)
phase: SessionPhase = SessionPhase.ACTIVE
request_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
Each trie node stores one message only. It does not duplicate the full path. A full transcript can be rebuilt by following parent pointers from the node to root. For current phase, we don't use per-node token ids. Each node will save the full concated token_id along the path. Note that in the future, we may only store token id of the incremental content of current node to further save memory but we need to think of a way to merge all token id on the path. We can't simply concate all token id, we need to deal with some boundary issue (e.g. the ambiguous bos and eos case of GLM 4.7) This could be integrated with the ContinuousTokenBuilder[WIP] imported from verl.
PrepareResult is the public contract between the trie and Agent Gateway. The Gateway reads and mutates only the request-local trajectory_buffer copy: prompt_ids, response_ids, response_mask, and response_logprobs. The trie node is hidden behind branch_handle and is only passed back at commit time.
** Gateway routing **: Compare incoming turns to each branch in the session by longest common prefix; continue on hit, otherwise start a new branch.
Fork types → trie ops (walk-through in appendix):
| Fork type |
Trie behavior |
| Sequential extension |
Single chain; append user → assistant |
| System-keyed split |
New system turn → new edge from root |
| Context condensation |
New user text at fork → sibling branch; ckpt cloned at splice point |
| Output-keyed split (best-of-N) |
Same parent user, multiple assistant children |
| Idempotent retry |
Identical assistant text → refresh ckpt on existing node |
| Warm-start |
Seeded assistant nodes exist but carry no ckpt |
Integrating with the current Agent Gateway
Current gateway behavior
The current GatewaySessionState is effectively a singleton branch:
request_tools: list[dict] | None
message_history: list[dict]
image_data: list[Any] | None
video_data: list[Any] | None
active_trajectory: TrajectoryBuffer | None
trajectories: list[Trajectory]
request_lock: asyncio.Lock
generation_lock: asyncio.Lock
_handle_chat_completions has three paths:
| Path |
Current behavior |
Limitation |
active_trajectory is None |
Full-encode incoming messages; create TrajectoryBuffer(prompt_ids) |
Good first-call behavior |
Prefix matches current message_history |
Copy active_trajectory; encode only messages[len(message_history):]; append those tokens with response_mask=0 |
Works only for the latest branch |
| Prefix mismatch |
Materialize current active buffer into trajectories; full-encode the new request; replace active_trajectory |
Produces linear segments, not branch reattachment |
Important nuance: on prefix mismatch, the current gateway does not append the new request's generated tokens into the old active buffer. It materializes the old buffer and starts a fresh TrajectoryBuffer. Therefore, the core issue is not token-level interleaving inside one buffer. The real issue is that only the latest branch remains attachable; older branches are stored as finalized linear segments and cannot be continued.
Why a trie, not only a list of BranchStates
A minimal implementation could introduce BranchState(message_history, active_trajectory, request_tools, image_data, video_data) and run longest-prefix matching across all branch states. This is a reasonable first step because it restores multiple attachable branches without changing the OpenAI-compatible API.
However, a prefix trie is the better steady-state representation:
BranchState still duplicates shared message prefixes and prompt tokens across siblings; trie nodes share them structurally.
BranchState matching is O(num_branches * history_len); trie matching walks the incoming messages once, with child lookup by canonical message key.
- Best-of-N naturally becomes several assistant children under one user node, instead of several full histories that happen to share a prefix.
- Context condensation and warm-start can be represented as ordinary fork / no-checkpoint cases.
- The trie gives a single place to store checkpoints and branch metadata for safe concurrent generation.
Request flow
The gateway request flow becomes the same prepare → generate → commit lifecycle described above.
1. Prepare under request_lock
messages, tools, request_chat_template_kwargs = normalize(payload)
prepare_result = session.trie.prepare(
messages,
tools=tools,
chat_template_kwargs=request_chat_template_kwargs,
)
branch_handle = prepare_result.branch_handle
working_buffer = prepare_result.trajectory_buffer
generation_id = branch_handle.generation_id
This is intentionally a two-level state transition. During prepare, the gateway may attach the incremental request messages first:
matched assistant checkpoint [ckpt]
└─ new user/tool message # pending, no checkpoint
Internally, the pending node becomes the attach parent for this generation. The LLM call then runs with a request-local working_buffer; the pending node itself does not own generated tokens yet. After decoding the LLM output, commit attaches the assistant child and writes the new checkpoint there:
matched assistant checkpoint [ckpt]
└─ new user/tool message # structural prompt node, no checkpoint
└─ generated assistant message [ckpt]
These newly attached prompt-side nodes are structural/pending nodes only. They should not be exported as trajectories and should not become reusable checkpoints until the LLM response is committed under an assistant child. This also makes concurrent sibling generation natural: multiple requests can materialize their prompt suffixes under the same matched parent, run generation with independent buffers, and later commit different assistant children without overwriting each other.
From the Agent Gateway's perspective, the pending node is not exposed directly. It is represented by branch_handle, while all token-level work happens on working_buffer.
2. Tokenize request-local state
If no checkpoint is available, prepare_result.trajectory_buffer is None; keep current full-encode path:
prompt_ids = _encode_full(messages, tools=tools, ...)
working_buffer = TrajectoryBuffer(prompt_ids=prompt_ids)
If a checkpoint is available, prepare_result.trajectory_buffer is already a request-local clone. Keep today's incremental behavior but anchor it to the matched branch, not to session.message_history:
working_buffer = prepare_result.trajectory_buffer
incremental_messages = messages[len(prepare_result.checkpoint_messages):]
incremental_ids = _encode_incremental(incremental_messages, ...)
working_buffer.response_ids.extend(incremental_ids)
working_buffer.response_mask.extend([0] * len(incremental_ids))
(note that the tokenization process will be replaced by ContinuousTokenBuilder from verl [WIP])
Then call the backend with generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids. For concurrent requests, use a per-generation backend request id such as f"{session_id}:{generation_id}", not only session_id.
3. Generate outside the session mutation lock
The current code holds generation_lock across _backend.generate(...) to protect one mutable active_trajectory. With trie-backed request-local buffers, the backend call does not need to hold a session-level generation lock:
output = await self._backend.generate(
request_id=f"{session_id}:{generation_id}",
prompt_ids=generation_context_ids,
sampling_params=sampling_params,
image_data=image_data,
video_data=video_data,
)
As a migration strategy, we can first keep generation_lock while replacing the state model, then remove or feature-flag it once prepare/commit invariants are covered by tests.
4. Commit under request_lock
After decoding the assistant response, append generated tokens to the request-local buffer and attach the result to the trie:
working_buffer.response_ids.extend(response_ids)
working_buffer.response_mask.extend([1] * len(response_ids))
append_or_pad_logprobs(working_buffer, output.log_probs, len(response_ids))
session.trie.commit(
branch_handle=branch_handle,
assistant_msg=assistant_msg,
trajectory_buffer=working_buffer,
request_tools=tools,
chat_template_kwargs_key=template_key,
image_data=image_data,
video_data=video_data,
)
Internally, commit(...) resolves branch_handle back to the pending attach node and upserts the assistant child. If two generations share the same parent and produce different assistant text, they become sibling assistant nodes. If they produce the same assistant text, commit refreshes the same node instead of creating a duplicate sibling.
Finalization
finalize_session should traverse the trie and emit branch-aware trajectories instead of only materializing one active_trajectory.
Default export policy:
- Emit terminal assistant checkpoints by default: rejected siblings remain short leaves, while a continued branch is represented by its deepest assistant checkpoint.
- Do not separately emit an intermediate assistant checkpoint when it has been continued into a longer branch, unless
export_all_checkpoints is explicitly enabled.
- Apply
reward_info to every emitted Trajectory, matching the current framework contract.
- Preserve
multi_modal_data, num_turns, finish_reason, and logprob alignment per branch checkpoint.
This produces the expected best-of-N shape: rejected siblings are emitted as short trajectories, while the selected winner is emitted as one continuous longer trajectory.
Compatibility and rollout
- Keep the external
/v1/chat/completions, /complete, create_session, and finalize_session APIs unchanged.
- Gate the new storage with a config flag such as
gateway_trie_enabled=true during rollout.
- Preserve current behavior for strictly linear sessions: token ids, masks, tool parsing, multimodal extraction, and response budgeting should match existing tests.
- Keep a read-only compatibility view for debugging (
current_head, num_branches, num_inflight_generations) to replace has_active_trajectory.
- Add tests for linear continuation, prefix mismatch segmentation parity, best-of-N reattachment, idempotent retry, concurrent sibling commit, tool-call key canonicalization, and multimodal branch reuse.
Observability
Instrument prepare / tokenize / commit on gateway or trainer: incoming turns, prompt length, trie topology summary, generation id. Wire to your deployment’s tracing stack.
多轨迹 Prefix Trie 存储
为何需要 prefix trie(相对当前 Agent Gateway 的 Linear Trajectory)
- 只认当前活跃分支:session 仅维护一条
message_history 与 active trajectory,sub-agent 切换、resample 选路或回退到旧分支时,无法把新请求挂回历史分支;trie 保留全部分支,incoming messages 可与任意路径做最长前缀匹配并续写。
- 共享前缀重复编码:多轨迹共用的 message/token 前缀在分支切换时要反复 materialize、retokenize;trie 在共享节点挂 checkpoint,后续 call 从命中节点 clone 后增量 tokenize 即可。
- 推理无法并发:单一共享状态需 generation lock 串行 LLM call;trie 每条 call 独占分支 state,tokenize / commit 可交错,支持 sub-agent、best-of-n 等并发场景。
核心实现
核心思路:session 内维护一棵 prefix trie——每条 message 序列对应 root 到某节点的路径;assistant 节点可挂 Checkpoint(该前缀上的 token 状态)。新请求 DFS 最长前缀匹配 → clone 最近 checkpoint → 推理后 commit 到 assistant 叶节点。
单次 generation 生命周期:prepare → tokenize → commit。
| 阶段 |
作用 |
| prepare |
按 incoming turns 在 trie 上 walk,深拷贝匹配到的 checkpoint → 本次 working_state |
| tokenize |
在 working_state 上增量拼出本轮 prompt_ids |
| commit |
写入 LLM output token,在叶节点 attach / refresh assistant 子节点与 checkpoint |
前缀匹配键:(role, text, tool metadata) 相等即视为同一边;由此产生各类分叉:
| 现象例子 |
trie 行为 |
| 不同 system prompt |
root 下多条 system 边(多角色) |
| 上下文 condensation(recap / splice) |
fork 点新增 sibling |
| 同输入、不同输出 |
同一 user 下多个 assistant 子节点(best-of-N) |
| 同输入、同输出 |
复用已有 assistant 节点(idempotent retry) |
并发:各 generation 的 working_state 均为独立 clone;tokenize / 远端推理 / commit 可交错进行。
与当前 Agent Gateway 的关系:Gateway 侧可用 trie 替换「单条 active trajectory + 单份 message_history」;整体接口可不变,仅在 session 内部做分支路由。
数据结构伪代码
@dataclass(frozen=True)
class MessageKey:
role: str
content: Any # canonical, hashable representation
name: str | None = None
tool_calls: tuple[Any, ...] | None = None
tool_call_id: str | None = None
reasoning_content: str | None = None
@dataclass
class BranchCheckpoint:
trajectory_buffer: TrajectoryBuffer # 截止该前缀的 flattened token state
request_tools: list[dict[str, Any]] | None = None
chat_template_kwargs_key: tuple[Any, ...] | None = None
messages: list[dict[str, Any]] | None = None # 可选;可由 node path 重建
@dataclass(frozen=True)
class BranchHandle:
generation_id: str
# 对 Agent Gateway 不透明;内部指向 pending attach node。
@dataclass
class PrepareResult:
trajectory_buffer: TrajectoryBuffer | None # clone 出来的 checkpoint buffer;None 表示需要 full encode
checkpoint_messages: list[dict[str, Any]] # trajectory_buffer 已覆盖的 prefix
branch_handle: BranchHandle # commit 时原样传回,Gateway 不解析
class TrieNode:
key: MessageKey | None
message: dict[str, Any] | None # 当前 node 只保存这一条 message
parent: TrieNode | None
checkpoint: BranchCheckpoint | None # 通常只有 committed assistant 节点有
children: dict[MessageKey, TrieNode]
class PrefixTrie:
root: TrieNode
def match(self, messages) -> TrieNode:
"""沿 MessageKey(message) 走最长前缀,返回最深节点。"""
...
def materialize_prompt_suffix(self, node, messages) -> TrieNode:
"""generation 前挂载缺失的 prompt-side nodes;此时不写 checkpoint。"""
...
def upsert_assistant(self, parent, assistant_msg, ckpt) -> TrieNode:
"""找或建 assistant 子节点,写入 ckpt(同 message 复用,不同 message 并列)。"""
...
def prepare(self, messages, *, tools=None, chat_template_kwargs=None) -> PrepareResult:
node = self.match(messages)
attach_node = self.materialize_prompt_suffix(node, messages)
checkpoint_node, checkpoint = nearest_ckpt(attach_node)
return PrepareResult(
trajectory_buffer=clone(checkpoint.trajectory_buffer) if checkpoint else None,
checkpoint_messages=(checkpoint.messages or self.rebuild_messages(checkpoint_node))
if checkpoint
else [],
branch_handle=self.make_branch_handle(attach_node),
)
def commit(self, branch_handle, trajectory_buffer, assistant_msg, **checkpoint_metadata):
attach_node = self.resolve_branch_handle(branch_handle)
ckpt = BranchCheckpoint(trajectory_buffer=trajectory_buffer, **checkpoint_metadata)
self.upsert_assistant(attach_node, assistant_msg, ckpt)
每个 trie node 只保存当前这一条 message,不在每个 node 上重复存整条 path。完整 transcript 可通过 parent 指针从当前 node 回溯到 root 重建。当前阶段我们不使用 per-node token ids;每个 node 会保存沿 path 拼接后的完整 token_id。未来为了进一步节省内存,可以只保存当前 node 的 incremental content token id,但这需要设计 path 上 token id 的合并方式。这里不能简单 concat,必须处理边界问题,例如 GLM 4.7 中 ambiguous BOS / EOS 的情况。这个方向后续可以和 verl 中引入的 ContinuousTokenBuilder[WIP] 集成。
PrepareResult 是 trie 与 Agent Gateway 之间的公开契约。Gateway 只读取和修改 request-local 的 trajectory_buffer copy:prompt_ids、response_ids、response_mask、response_logprobs。trie node 被隐藏在 branch_handle 后面,只在 commit 时原样传回。
Gateway 路由(同一思路):incoming turns 与 session 内各分支比最长公共前缀,命中则续写,否则新建分支。
分叉类型 → trie 操作(walk-through 见附录):
| 分叉类型 |
trie 行为 |
| 顺序延伸 |
单链追加 user → assistant |
| system 键分裂 |
新 system turn → root 下新边 |
| 上下文 condensation |
fork 处 user 文本变化 → 兄弟分支;splice 点 clone ckpt |
| 输出分叉(best-of-N) |
同一 user 下多个 assistant 子节点 |
| idempotent retry |
assistant 文本相同 → 刷新已有节点 ckpt |
| warm-start |
预设 assistant 有节点、无 ckpt |
与当前 Agent Gateway 集成
当前 gateway 行为
当前 GatewaySessionState 本质上是一条单例分支:
request_tools: list[dict] | None
message_history: list[dict]
image_data: list[Any] | None
video_data: list[Any] | None
active_trajectory: TrajectoryBuffer | None
trajectories: list[Trajectory]
request_lock: asyncio.Lock
generation_lock: asyncio.Lock
_handle_chat_completions 主要有三条路径:
| 路径 |
当前行为 |
局限 |
active_trajectory is None |
对 incoming messages 做 full encode,创建 TrajectoryBuffer(prompt_ids) |
首次调用合理 |
与当前 message_history prefix match |
copy 当前 active_trajectory,只 encode messages[len(message_history):],并以 response_mask=0 追加 |
只能续写最新分支 |
| prefix mismatch |
先把当前 active buffer materialize 到 trajectories,再 full encode 新请求并替换 active_trajectory |
得到的是线性切段,不是 branch reattachment |
这里有一个重要细节:prefix mismatch 时,当前 gateway 不会把新请求生成的 tokens append 到旧 active buffer 里。它会先 materialize 旧 buffer,再创建新的 TrajectoryBuffer。因此核心问题不是单个 buffer 内部发生 token-level interleaving;真正的问题是只有最新分支仍然可回接,旧分支会变成 finalized linear segment,后续无法继续增长。
为什么不是只维护 BranchState list
最小实现可以新增 BranchState(message_history, active_trajectory, request_tools, image_data, video_data),然后对所有 branch state 做最长前缀匹配。这个方向是合理的,因为它在不改变 OpenAI-compatible API 的前提下,让多个分支同时保持可回接。
但长期看 prefix trie 是更好的状态表达:
BranchState 仍会在 sibling 之间重复存储共享 message prefix 和 prompt token;trie 可以结构化共享。
BranchState 匹配复杂度是 O(num_branches * history_len);trie 基于 canonical message key 沿 incoming messages 走一遍即可。
- best-of-N 天然表达为同一 user 节点下的多个 assistant 子节点,而不是多份拥有相同前缀的完整 history。
- context condensation 和 warm-start 都可以作为普通 fork / no-checkpoint 节点表示。
- trie 为 checkpoint 和 branch metadata 提供统一落点,便于安全支持并发 generation。
请求处理流程
Gateway 请求流程变成上文的 prepare → generate → commit。
1. 在 request_lock 下 prepare
messages, tools, request_chat_template_kwargs = normalize(payload)
prepare_result = session.trie.prepare(
messages,
tools=tools,
chat_template_kwargs=request_chat_template_kwargs,
)
branch_handle = prepare_result.branch_handle
working_buffer = prepare_result.trajectory_buffer
generation_id = branch_handle.generation_id
这里刻意拆成两层状态转换。在 prepare 阶段,gateway 可以先把本次 request 里缺失的 incremental messages 挂上去:
matched assistant checkpoint [ckpt]
└─ new user/tool message # pending, no checkpoint
内部实现里,这个 pending node 会成为本次 generation 的 attach parent。随后 LLM call 使用 request-local working_buffer 运行;pending node 本身还不拥有模型生成 tokens。等 LLM output decode 完后,commit 再挂载 assistant child,并把新的 checkpoint 写在 assistant 节点上:
matched assistant checkpoint [ckpt]
└─ new user/tool message # structural prompt node, no checkpoint
└─ generated assistant message [ckpt]
这些新挂载的 prompt-side nodes 只是 structural / pending nodes。它们不应被导出为 trajectory,也不应成为可复用 checkpoint;只有 LLM response commit 到 assistant child 后,才形成新的 checkpoint。这样也更自然地支持并发 sibling generation:多个请求可以在同一个 matched parent 下 materialize 各自的 prompt suffix,用独立 buffer 并发 generation,最后再 commit 成不同 assistant children,而不会互相覆盖。
从 Agent Gateway 的视角看,pending node 不会直接暴露出来。它被封装在 branch_handle 里;所有 token-level 工作都发生在 working_buffer 上。
2. 对 request-local state 做 tokenize
如果没有可复用 checkpoint,prepare_result.trajectory_buffer 为 None,则保留当前 full-encode 路径:
prompt_ids = _encode_full(messages, tools=tools, ...)
working_buffer = TrajectoryBuffer(prompt_ids=prompt_ids)
如果有 checkpoint,prepare_result.trajectory_buffer 已经是 request-local clone。此时保留当前 incremental 行为,但 anchor 到命中的 branch,而不是 session.message_history:
working_buffer = prepare_result.trajectory_buffer
incremental_messages = messages[len(prepare_result.checkpoint_messages):]
incremental_ids = _encode_incremental(incremental_messages, ...)
working_buffer.response_ids.extend(incremental_ids)
working_buffer.response_mask.extend([0] * len(incremental_ids))
(注意:tokenization 过程后续会替换为 verl 的 ContinuousTokenBuilder[WIP]。)
随后用 generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids 调 backend。为了支持并发请求,backend request id 应使用每次 generation 唯一的 id,例如 f"{session_id}:{generation_id}",而不是只有 session_id。
3. 在 session mutation lock 之外 generate
当前代码用 generation_lock 包住 _backend.generate(...),是为了保护唯一可变的 active_trajectory。当 trie 使用 request-local buffer 后,backend call 不需要持有 session 级别的 generation lock:
output = await self._backend.generate(
request_id=f"{session_id}:{generation_id}",
prompt_ids=generation_context_ids,
sampling_params=sampling_params,
image_data=image_data,
video_data=video_data,
)
迁移时可以先保留 generation_lock,只替换状态模型;等 prepare/commit 不变量由测试覆盖后,再移除或用 feature flag 控制。
4. 在 request_lock 下 commit
decode assistant response 后,把模型生成 tokens 追加到 request-local buffer,并挂回 trie:
working_buffer.response_ids.extend(response_ids)
working_buffer.response_mask.extend([1] * len(response_ids))
append_or_pad_logprobs(working_buffer, output.log_probs, len(response_ids))
session.trie.commit(
branch_handle=branch_handle,
assistant_msg=assistant_msg,
trajectory_buffer=working_buffer,
request_tools=tools,
chat_template_kwargs_key=template_key,
image_data=image_data,
video_data=video_data,
)
内部实现里,commit(...) 会把 branch_handle 解析回 pending attach node,再 upsert assistant child。如果两个 generation 共享同一个 parent 且生成不同 assistant 文本,它们会成为 sibling assistant nodes;如果生成文本相同,commit 会刷新同一节点,避免重复 sibling。
Finalization
finalize_session 应遍历 trie,产出 branch-aware trajectories,而不是只 materialize 一个 active_trajectory。
默认导出策略:
- 默认导出 terminal assistant checkpoints:未被选中的 sibling 保持为短叶子;已继续生长的分支则由最深的 assistant checkpoint 表示。
- 如果某个中间 assistant checkpoint 已经被继续成长为更长分支,默认不单独导出;除非显式打开
export_all_checkpoints。
- 对每条导出的
Trajectory 都应用 reward_info,保持当前 framework contract。
- 每个 branch checkpoint 独立保留
multi_modal_data、num_turns、finish_reason 和 logprob alignment。
这样 best-of-N 会得到预期形态:未被选中的 sibling 作为短 trajectory 导出,被选中的 winner 则作为一条连续的长 trajectory 导出。
兼容性与 rollout
- 对外保持
/v1/chat/completions、/complete、create_session、finalize_session API 不变。
- 通过类似
gateway_trie_enabled=true 的配置灰度启用新存储。
- 对严格线性的 session,token ids、mask、tool parsing、multimodal extraction、response budget 行为应与现有测试一致。
- 为调试保留只读兼容视图,例如
current_head、num_branches、num_inflight_generations,替代 has_active_trajectory。
- 补充测试:linear continuation、prefix mismatch segmentation parity、best-of-N reattachment、idempotent retry、concurrent sibling commit、tool-call key canonicalization、multimodal branch reuse。
可观测性
建议在 gateway / trainer 侧对 prepare / tokenize / commit 打点:incoming turns、prompt token 长度、trie 拓扑摘要、generation id。具体上报通道按部署环境接入即可。
附录:按分叉类型 walk-through
树节点约定:[sys] / [usr] / [asst] 为角色;[ckpt] 为模型实际生成并 commit 后的检查点。
分类依据是 trie 要表达的结构,而非 agent 框架里的场景标签。树形示例见上文英文 Appendix。
适用性:以上分叉类型在 coding agent 里尤其常见——多轮 tool/edit 循环、planner / worker 子 agent、超长上下文 condensation、best-of-N 选 patch 或 plan、超时重试、从半截 run 恢复等。示例为便于阅读用了中性领域,结构与 SWE / code-agent 工作流一一对应。
| 类型 |
子例 |
要点 |
| A. 顺序延伸 |
— |
单角色单链,无分叉 |
| B. system 键分裂 |
B1 并行 sub-agent / B2 顺序 handoff |
不同 system → root 多叉;不建模调用栈 |
| C. 上下文 condensation |
C1 全量 recap / C2 中段 splice |
共享前缀后开兄弟分支;中段 fork 时 clone ckpt |
| D. best-of-N / retry |
D1 best-of-N / D2 idempotent retry |
同 user 多 assistant 兄弟;同文复用节点 |
| E. warm-start |
— |
导入历史 turn 有节点无 ckpt,首次 live 生成才 commit |
Prefix Trie for Multi-Trajectory Storage
Why a prefix trie (vs. the Agent Gateway’s linear trajectory today)
message_historyand one active trajectory. When switching sub-agents, picking a resample path, or returning to an older branch, new requests cannot reattach to historical branches. A trie keeps every branch; incoming messages longest-prefix-match against any path and continue from there.Core design
Idea: Maintain a prefix trie per session. Each message sequence is a path from root to some node; assistant nodes may hold a Checkpoint (token state at that prefix). On a new request: longest-prefix DFS → clone nearest checkpoint → after inference, commit to the new assistant leaf.
One generation lifecycle:
prepare → tokenize → commit.working_stateprompt_idsfromworking_statePrefix match key: Same edge if
(role, text, tool metadata)match. Forking behavior:Concurrency: Each generation’s
working_stateis an independent clone; tokenize, remote inference, and commit may overlap.Relation to Agent Gateway: Replace “one active trajectory + one message_history” with a trie internally; external APIs can stay the same—only in-session branch routing changes.
Data structures (sketch)
Each trie node stores one message only. It does not duplicate the full path. A full transcript can be rebuilt by following
parentpointers from the node to root. For current phase, we don't use per-node token ids. Each node will save the full concated token_id along the path. Note that in the future, we may only store token id of the incremental content of current node to further save memory but we need to think of a way to merge all token id on the path. We can't simply concate all token id, we need to deal with some boundary issue (e.g. the ambiguous bos and eos case of GLM 4.7) This could be integrated with the ContinuousTokenBuilder[WIP] imported from verl.PrepareResultis the public contract between the trie and Agent Gateway. The Gateway reads and mutates only the request-localtrajectory_buffercopy:prompt_ids,response_ids,response_mask, andresponse_logprobs. The trie node is hidden behindbranch_handleand is only passed back at commit time.** Gateway routing **: Compare incoming
turnsto each branch in the session by longest common prefix; continue on hit, otherwise start a new branch.Fork types → trie ops (walk-through in appendix):
systemturn → new edge from rootIntegrating with the current Agent Gateway
Current gateway behavior
The current
GatewaySessionStateis effectively a singleton branch:_handle_chat_completionshas three paths:active_trajectory is Nonemessages; createTrajectoryBuffer(prompt_ids)message_historyactive_trajectory; encode onlymessages[len(message_history):]; append those tokens withresponse_mask=0trajectories; full-encode the new request; replaceactive_trajectoryImportant nuance: on prefix mismatch, the current gateway does not append the new request's generated tokens into the old active buffer. It materializes the old buffer and starts a fresh
TrajectoryBuffer. Therefore, the core issue is not token-level interleaving inside one buffer. The real issue is that only the latest branch remains attachable; older branches are stored as finalized linear segments and cannot be continued.Why a trie, not only a list of
BranchStatesA minimal implementation could introduce
BranchState(message_history, active_trajectory, request_tools, image_data, video_data)and run longest-prefix matching across all branch states. This is a reasonable first step because it restores multiple attachable branches without changing the OpenAI-compatible API.However, a prefix trie is the better steady-state representation:
BranchStatestill duplicates shared message prefixes and prompt tokens across siblings; trie nodes share them structurally.BranchStatematching isO(num_branches * history_len); trie matching walks the incoming messages once, with child lookup by canonical message key.Request flow
The gateway request flow becomes the same
prepare → generate → commitlifecycle described above.1. Prepare under
request_lockThis is intentionally a two-level state transition. During
prepare, the gateway may attach the incremental request messages first:Internally, the pending node becomes the attach parent for this generation. The LLM call then runs with a request-local
working_buffer; the pending node itself does not own generated tokens yet. After decoding the LLM output,commitattaches the assistant child and writes the new checkpoint there:These newly attached prompt-side nodes are structural/pending nodes only. They should not be exported as trajectories and should not become reusable checkpoints until the LLM response is committed under an assistant child. This also makes concurrent sibling generation natural: multiple requests can materialize their prompt suffixes under the same matched parent, run generation with independent buffers, and later commit different assistant children without overwriting each other.
From the Agent Gateway's perspective, the pending node is not exposed directly. It is represented by
branch_handle, while all token-level work happens onworking_buffer.2. Tokenize request-local state
If no checkpoint is available,
prepare_result.trajectory_bufferisNone; keep current full-encode path:If a checkpoint is available,
prepare_result.trajectory_bufferis already a request-local clone. Keep today's incremental behavior but anchor it to the matched branch, not tosession.message_history:(note that the tokenization process will be replaced by ContinuousTokenBuilder from verl [WIP])
Then call the backend with
generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids. For concurrent requests, use a per-generation backend request id such asf"{session_id}:{generation_id}", not onlysession_id.3. Generate outside the session mutation lock
The current code holds
generation_lockacross_backend.generate(...)to protect one mutableactive_trajectory. With trie-backed request-local buffers, the backend call does not need to hold a session-level generation lock:As a migration strategy, we can first keep
generation_lockwhile replacing the state model, then remove or feature-flag it once prepare/commit invariants are covered by tests.4. Commit under
request_lockAfter decoding the assistant response, append generated tokens to the request-local buffer and attach the result to the trie:
Internally,
commit(...)resolvesbranch_handleback to the pending attach node and upserts the assistant child. If two generations share the same parent and produce different assistant text, they become sibling assistant nodes. If they produce the same assistant text, commit refreshes the same node instead of creating a duplicate sibling.Finalization
finalize_sessionshould traverse the trie and emit branch-aware trajectories instead of only materializing oneactive_trajectory.Default export policy:
export_all_checkpointsis explicitly enabled.reward_infoto every emittedTrajectory, matching the current framework contract.multi_modal_data,num_turns,finish_reason, and logprob alignment per branch checkpoint.This produces the expected best-of-N shape: rejected siblings are emitted as short trajectories, while the selected winner is emitted as one continuous longer trajectory.
Compatibility and rollout
/v1/chat/completions,/complete,create_session, andfinalize_sessionAPIs unchanged.gateway_trie_enabled=trueduring rollout.current_head,num_branches,num_inflight_generations) to replacehas_active_trajectory.Observability
Instrument
prepare / tokenize / commiton gateway or trainer: incoming turns, prompt length, trie topology summary, generation id. Wire to your deployment’s tracing stack.多轨迹 Prefix Trie 存储
为何需要 prefix trie(相对当前 Agent Gateway 的 Linear Trajectory)
message_history与 active trajectory,sub-agent 切换、resample 选路或回退到旧分支时,无法把新请求挂回历史分支;trie 保留全部分支,incoming messages 可与任意路径做最长前缀匹配并续写。核心实现
核心思路:session 内维护一棵 prefix trie——每条 message 序列对应 root 到某节点的路径;assistant 节点可挂 Checkpoint(该前缀上的 token 状态)。新请求 DFS 最长前缀匹配 → clone 最近 checkpoint → 推理后 commit 到 assistant 叶节点。
单次 generation 生命周期:
prepare → tokenize → commit。working_stateworking_state上增量拼出本轮prompt_ids前缀匹配键:
(role, text, tool metadata)相等即视为同一边;由此产生各类分叉:并发:各 generation 的
working_state均为独立 clone;tokenize / 远端推理 / commit 可交错进行。与当前 Agent Gateway 的关系:Gateway 侧可用 trie 替换「单条 active trajectory + 单份 message_history」;整体接口可不变,仅在 session 内部做分支路由。
数据结构伪代码
每个 trie node 只保存当前这一条 message,不在每个 node 上重复存整条 path。完整 transcript 可通过
parent指针从当前 node 回溯到 root 重建。当前阶段我们不使用 per-node token ids;每个 node 会保存沿 path 拼接后的完整 token_id。未来为了进一步节省内存,可以只保存当前 node 的 incremental content token id,但这需要设计 path 上 token id 的合并方式。这里不能简单 concat,必须处理边界问题,例如 GLM 4.7 中 ambiguous BOS / EOS 的情况。这个方向后续可以和 verl 中引入的 ContinuousTokenBuilder[WIP] 集成。PrepareResult是 trie 与 Agent Gateway 之间的公开契约。Gateway 只读取和修改 request-local 的trajectory_buffercopy:prompt_ids、response_ids、response_mask、response_logprobs。trie node 被隐藏在branch_handle后面,只在 commit 时原样传回。Gateway 路由(同一思路):incoming
turns与 session 内各分支比最长公共前缀,命中则续写,否则新建分支。分叉类型 → trie 操作(walk-through 见附录):
与当前 Agent Gateway 集成
当前 gateway 行为
当前
GatewaySessionState本质上是一条单例分支:_handle_chat_completions主要有三条路径:active_trajectory is Nonemessages做 full encode,创建TrajectoryBuffer(prompt_ids)message_historyprefix matchactive_trajectory,只 encodemessages[len(message_history):],并以response_mask=0追加trajectories,再 full encode 新请求并替换active_trajectory这里有一个重要细节:prefix mismatch 时,当前 gateway 不会把新请求生成的 tokens append 到旧 active buffer 里。它会先 materialize 旧 buffer,再创建新的
TrajectoryBuffer。因此核心问题不是单个 buffer 内部发生 token-level interleaving;真正的问题是只有最新分支仍然可回接,旧分支会变成 finalized linear segment,后续无法继续增长。为什么不是只维护
BranchStatelist最小实现可以新增
BranchState(message_history, active_trajectory, request_tools, image_data, video_data),然后对所有 branch state 做最长前缀匹配。这个方向是合理的,因为它在不改变 OpenAI-compatible API 的前提下,让多个分支同时保持可回接。但长期看 prefix trie 是更好的状态表达:
BranchState仍会在 sibling 之间重复存储共享 message prefix 和 prompt token;trie 可以结构化共享。BranchState匹配复杂度是O(num_branches * history_len);trie 基于 canonical message key 沿 incoming messages 走一遍即可。请求处理流程
Gateway 请求流程变成上文的
prepare → generate → commit。1. 在
request_lock下 prepare这里刻意拆成两层状态转换。在
prepare阶段,gateway 可以先把本次 request 里缺失的 incremental messages 挂上去:内部实现里,这个 pending node 会成为本次 generation 的 attach parent。随后 LLM call 使用 request-local
working_buffer运行;pending node 本身还不拥有模型生成 tokens。等 LLM output decode 完后,commit再挂载 assistant child,并把新的 checkpoint 写在 assistant 节点上:这些新挂载的 prompt-side nodes 只是 structural / pending nodes。它们不应被导出为 trajectory,也不应成为可复用 checkpoint;只有 LLM response commit 到 assistant child 后,才形成新的 checkpoint。这样也更自然地支持并发 sibling generation:多个请求可以在同一个 matched parent 下 materialize 各自的 prompt suffix,用独立 buffer 并发 generation,最后再 commit 成不同 assistant children,而不会互相覆盖。
从 Agent Gateway 的视角看,pending node 不会直接暴露出来。它被封装在
branch_handle里;所有 token-level 工作都发生在working_buffer上。2. 对 request-local state 做 tokenize
如果没有可复用 checkpoint,
prepare_result.trajectory_buffer为None,则保留当前 full-encode 路径:如果有 checkpoint,
prepare_result.trajectory_buffer已经是 request-local clone。此时保留当前 incremental 行为,但 anchor 到命中的 branch,而不是session.message_history:(注意:tokenization 过程后续会替换为 verl 的 ContinuousTokenBuilder[WIP]。)
随后用
generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids调 backend。为了支持并发请求,backend request id 应使用每次 generation 唯一的 id,例如f"{session_id}:{generation_id}",而不是只有session_id。3. 在 session mutation lock 之外 generate
当前代码用
generation_lock包住_backend.generate(...),是为了保护唯一可变的active_trajectory。当 trie 使用 request-local buffer 后,backend call 不需要持有 session 级别的 generation lock:迁移时可以先保留
generation_lock,只替换状态模型;等 prepare/commit 不变量由测试覆盖后,再移除或用 feature flag 控制。4. 在
request_lock下 commitdecode assistant response 后,把模型生成 tokens 追加到 request-local buffer,并挂回 trie:
内部实现里,
commit(...)会把branch_handle解析回 pending attach node,再 upsert assistant child。如果两个 generation 共享同一个 parent 且生成不同 assistant 文本,它们会成为 sibling assistant nodes;如果生成文本相同,commit 会刷新同一节点,避免重复 sibling。Finalization
finalize_session应遍历 trie,产出 branch-aware trajectories,而不是只 materialize 一个active_trajectory。默认导出策略:
export_all_checkpoints。Trajectory都应用reward_info,保持当前 framework contract。multi_modal_data、num_turns、finish_reason和 logprob alignment。这样 best-of-N 会得到预期形态:未被选中的 sibling 作为短 trajectory 导出,被选中的 winner 则作为一条连续的长 trajectory 导出。
兼容性与 rollout
/v1/chat/completions、/complete、create_session、finalize_sessionAPI 不变。gateway_trie_enabled=true的配置灰度启用新存储。current_head、num_branches、num_inflight_generations,替代has_active_trajectory。可观测性
建议在 gateway / trainer 侧对
prepare / tokenize / commit打点:incoming turns、prompt token 长度、trie 拓扑摘要、generation id。具体上报通道按部署环境接入即可。附录:按分叉类型 walk-through
树节点约定:
[sys]/[usr]/[asst]为角色;[ckpt]为模型实际生成并 commit 后的检查点。分类依据是 trie 要表达的结构,而非 agent 框架里的场景标签。树形示例见上文英文 Appendix。
适用性:以上分叉类型在 coding agent 里尤其常见——多轮 tool/edit 循环、planner / worker 子 agent、超长上下文 condensation、best-of-N 选 patch 或 plan、超时重试、从半截 run 恢复等。示例为便于阅读用了中性领域,结构与 SWE / code-agent 工作流一一对应。