Skip to content

[RFC] Prefix Trie for Multi-trajectories Storage for AgentGateway #51

@gxlvera

Description

@gxlvera

Prefix Trie for Multi-Trajectory Storage

Why a prefix trie (vs. the Agent Gateway’s linear trajectory today)

  • Single active branch only: A session keeps one message_history and one active trajectory. When switching sub-agents, picking a resample path, or returning to an older branch, new requests cannot reattach to historical branches. A trie keeps every branch; incoming messages longest-prefix-match against any path and continue from there.
  • Repeated encoding of shared prefixes: Message/token prefixes shared across trajectories are re-materialized and re-tokenized on every branch switch. A trie stores checkpoints on shared nodes; later calls clone from the matched node and tokenize incrementally.
  • No concurrent inference: One shared state requires a generation lock and serial LLM calls. With a trie, each call owns a cloned branch state; tokenize and commit can interleave—supporting sub-agents, best-of-n, etc.

Core design

Idea: Maintain a prefix trie per session. Each message sequence is a path from root to some node; assistant nodes may hold a Checkpoint (token state at that prefix). On a new request: longest-prefix DFS → clone nearest checkpoint → after inference, commit to the new assistant leaf.

One generation lifecycle: prepare → tokenize → commit.

Phase Role
prepare Walk the trie on incoming turns; deep-copy the matched checkpoint into this call’s working_state
tokenize Incrementally build this round’s prompt_ids from working_state
commit Write LLM output tokens; attach or refresh the assistant child and checkpoint at the leaf

Prefix match key: Same edge if (role, text, tool metadata) match. Forking behavior:

Example Trie behavior
Different system prompt Multiple system edges under root (multi-role)
Context condensation (recap / splice) New sibling at the fork point
Same input, different output Multiple assistant children under the same user (best-of-N)
Same input, same output Reuse existing assistant node (idempotent retry)

Concurrency: Each generation’s working_state is an independent clone; tokenize, remote inference, and commit may overlap.

Relation to Agent Gateway: Replace “one active trajectory + one message_history” with a trie internally; external APIs can stay the same—only in-session branch routing changes.


Data structures (sketch)

@dataclass(frozen=True)
class MessageKey:
    role: str
    content: Any                  
    name: str | None = None
    tool_calls: tuple[Any, ...] | None = None
    tool_call_id: str | None = None 
    reasoning_content: str | None = None

@dataclass
class BranchCheckpoint:
    trajectory_buffer: TrajectoryBuffer      # flattened token state up to this prefix
    request_tools: list[dict[str, Any]] | None = None
    chat_template_kwargs_key: tuple[Any, ...] | None = None
    messages: list[dict[str, Any]] | None = None      # optional; can be rebuilt from node path

@dataclass(frozen=True)
class BranchHandle:
    generation_id: str
    # Opaque to Agent Gateway; internally points to the pending attach node.

@dataclass
class PrepareResult:
    trajectory_buffer: TrajectoryBuffer | None         # cloned checkpoint buffer; None means full encode
    checkpoint_messages: list[dict[str, Any]]          # prefix covered by trajectory_buffer
    branch_handle: BranchHandle                        # passed back to commit, not inspected by Gateway

class TrieNode:
    key: MessageKey | None
    message: dict[str, Any] | None           # only the current message for this node
    parent: TrieNode | None
    checkpoint: BranchCheckpoint | None      # usually only on committed assistant nodes
    children: dict[MessageKey, TrieNode]

class PrefixTrie:
    root: TrieNode

    def match(self, messages) -> TrieNode:
        """Longest-prefix walk on MessageKey(message); return deepest node."""
        ...

    def materialize_prompt_suffix(self, node, messages) -> TrieNode:
        """Attach prompt-side nodes before generation; no checkpoint yet."""
        ...

    def upsert_assistant(self, parent, assistant_msg, ckpt) -> TrieNode:
        """Find or create assistant child; write ckpt (reuse if same message, sibling if not)."""
        ...

    def prepare(self, messages, *, tools=None, chat_template_kwargs=None) -> PrepareResult:
        node = self.match(messages)
        attach_node = self.materialize_prompt_suffix(node, messages)
        checkpoint_node, checkpoint = nearest_ckpt(attach_node)
        return PrepareResult(
            trajectory_buffer=clone(checkpoint.trajectory_buffer) if checkpoint else None,
            checkpoint_messages=(checkpoint.messages or self.rebuild_messages(checkpoint_node))
            if checkpoint
            else [],
            branch_handle=self.make_branch_handle(attach_node),
        )

    def commit(self, branch_handle, trajectory_buffer, assistant_msg, **checkpoint_metadata):
        attach_node = self.resolve_branch_handle(branch_handle)
        ckpt = BranchCheckpoint(trajectory_buffer=trajectory_buffer, **checkpoint_metadata)
        self.upsert_assistant(attach_node, assistant_msg, ckpt)


@dataclass
class GatewaySessionState:
    handle: SessionHandle
    trie: PrefixTrie = field(default_factory=PrefixTrie)
    reward_info: dict[str, Any] = field(default_factory=dict)
    completed: asyncio.Event = field(default_factory=asyncio.Event)
    phase: SessionPhase = SessionPhase.ACTIVE
    request_lock: asyncio.Lock = field(default_factory=asyncio.Lock)

Each trie node stores one message only. It does not duplicate the full path. A full transcript can be rebuilt by following parent pointers from the node to root. For current phase, we don't use per-node token ids. Each node will save the full concated token_id along the path. Note that in the future, we may only store token id of the incremental content of current node to further save memory but we need to think of a way to merge all token id on the path. We can't simply concate all token id, we need to deal with some boundary issue (e.g. the ambiguous bos and eos case of GLM 4.7) This could be integrated with the ContinuousTokenBuilder[WIP] imported from verl.

PrepareResult is the public contract between the trie and Agent Gateway. The Gateway reads and mutates only the request-local trajectory_buffer copy: prompt_ids, response_ids, response_mask, and response_logprobs. The trie node is hidden behind branch_handle and is only passed back at commit time.

** Gateway routing **: Compare incoming turns to each branch in the session by longest common prefix; continue on hit, otherwise start a new branch.

Fork types → trie ops (walk-through in appendix):

Fork type Trie behavior
Sequential extension Single chain; append user → assistant
System-keyed split New system turn → new edge from root
Context condensation New user text at fork → sibling branch; ckpt cloned at splice point
Output-keyed split (best-of-N) Same parent user, multiple assistant children
Idempotent retry Identical assistant text → refresh ckpt on existing node
Warm-start Seeded assistant nodes exist but carry no ckpt

Integrating with the current Agent Gateway

Current gateway behavior

The current GatewaySessionState is effectively a singleton branch:

request_tools: list[dict] | None
message_history: list[dict]
image_data: list[Any] | None
video_data: list[Any] | None
active_trajectory: TrajectoryBuffer | None
trajectories: list[Trajectory]
request_lock: asyncio.Lock
generation_lock: asyncio.Lock

_handle_chat_completions has three paths:

Path Current behavior Limitation
active_trajectory is None Full-encode incoming messages; create TrajectoryBuffer(prompt_ids) Good first-call behavior
Prefix matches current message_history Copy active_trajectory; encode only messages[len(message_history):]; append those tokens with response_mask=0 Works only for the latest branch
Prefix mismatch Materialize current active buffer into trajectories; full-encode the new request; replace active_trajectory Produces linear segments, not branch reattachment

Important nuance: on prefix mismatch, the current gateway does not append the new request's generated tokens into the old active buffer. It materializes the old buffer and starts a fresh TrajectoryBuffer. Therefore, the core issue is not token-level interleaving inside one buffer. The real issue is that only the latest branch remains attachable; older branches are stored as finalized linear segments and cannot be continued.

Why a trie, not only a list of BranchStates

A minimal implementation could introduce BranchState(message_history, active_trajectory, request_tools, image_data, video_data) and run longest-prefix matching across all branch states. This is a reasonable first step because it restores multiple attachable branches without changing the OpenAI-compatible API.

However, a prefix trie is the better steady-state representation:

  • BranchState still duplicates shared message prefixes and prompt tokens across siblings; trie nodes share them structurally.
  • BranchState matching is O(num_branches * history_len); trie matching walks the incoming messages once, with child lookup by canonical message key.
  • Best-of-N naturally becomes several assistant children under one user node, instead of several full histories that happen to share a prefix.
  • Context condensation and warm-start can be represented as ordinary fork / no-checkpoint cases.
  • The trie gives a single place to store checkpoints and branch metadata for safe concurrent generation.

Request flow

The gateway request flow becomes the same prepare → generate → commit lifecycle described above.

1. Prepare under request_lock

messages, tools, request_chat_template_kwargs = normalize(payload)
prepare_result = session.trie.prepare(
    messages,
    tools=tools,
    chat_template_kwargs=request_chat_template_kwargs,
)
branch_handle = prepare_result.branch_handle
working_buffer = prepare_result.trajectory_buffer
generation_id = branch_handle.generation_id

This is intentionally a two-level state transition. During prepare, the gateway may attach the incremental request messages first:

matched assistant checkpoint [ckpt]
└─ new user/tool message      # pending, no checkpoint

Internally, the pending node becomes the attach parent for this generation. The LLM call then runs with a request-local working_buffer; the pending node itself does not own generated tokens yet. After decoding the LLM output, commit attaches the assistant child and writes the new checkpoint there:

matched assistant checkpoint [ckpt]
└─ new user/tool message      # structural prompt node, no checkpoint
   └─ generated assistant message [ckpt]

These newly attached prompt-side nodes are structural/pending nodes only. They should not be exported as trajectories and should not become reusable checkpoints until the LLM response is committed under an assistant child. This also makes concurrent sibling generation natural: multiple requests can materialize their prompt suffixes under the same matched parent, run generation with independent buffers, and later commit different assistant children without overwriting each other.

From the Agent Gateway's perspective, the pending node is not exposed directly. It is represented by branch_handle, while all token-level work happens on working_buffer.

2. Tokenize request-local state

If no checkpoint is available, prepare_result.trajectory_buffer is None; keep current full-encode path:

prompt_ids = _encode_full(messages, tools=tools, ...)
working_buffer = TrajectoryBuffer(prompt_ids=prompt_ids)

If a checkpoint is available, prepare_result.trajectory_buffer is already a request-local clone. Keep today's incremental behavior but anchor it to the matched branch, not to session.message_history:

working_buffer = prepare_result.trajectory_buffer
incremental_messages = messages[len(prepare_result.checkpoint_messages):]
incremental_ids = _encode_incremental(incremental_messages, ...)
working_buffer.response_ids.extend(incremental_ids)
working_buffer.response_mask.extend([0] * len(incremental_ids))

(note that the tokenization process will be replaced by ContinuousTokenBuilder from verl [WIP])

Then call the backend with generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids. For concurrent requests, use a per-generation backend request id such as f"{session_id}:{generation_id}", not only session_id.

3. Generate outside the session mutation lock

The current code holds generation_lock across _backend.generate(...) to protect one mutable active_trajectory. With trie-backed request-local buffers, the backend call does not need to hold a session-level generation lock:

output = await self._backend.generate(
    request_id=f"{session_id}:{generation_id}",
    prompt_ids=generation_context_ids,
    sampling_params=sampling_params,
    image_data=image_data,
    video_data=video_data,
)

As a migration strategy, we can first keep generation_lock while replacing the state model, then remove or feature-flag it once prepare/commit invariants are covered by tests.

4. Commit under request_lock

After decoding the assistant response, append generated tokens to the request-local buffer and attach the result to the trie:

working_buffer.response_ids.extend(response_ids)
working_buffer.response_mask.extend([1] * len(response_ids))
append_or_pad_logprobs(working_buffer, output.log_probs, len(response_ids))

session.trie.commit(
    branch_handle=branch_handle,
    assistant_msg=assistant_msg,
    trajectory_buffer=working_buffer,
    request_tools=tools,
    chat_template_kwargs_key=template_key,
    image_data=image_data,
    video_data=video_data,
)

Internally, commit(...) resolves branch_handle back to the pending attach node and upserts the assistant child. If two generations share the same parent and produce different assistant text, they become sibling assistant nodes. If they produce the same assistant text, commit refreshes the same node instead of creating a duplicate sibling.

Finalization

finalize_session should traverse the trie and emit branch-aware trajectories instead of only materializing one active_trajectory.

Default export policy:

  • Emit terminal assistant checkpoints by default: rejected siblings remain short leaves, while a continued branch is represented by its deepest assistant checkpoint.
  • Do not separately emit an intermediate assistant checkpoint when it has been continued into a longer branch, unless export_all_checkpoints is explicitly enabled.
  • Apply reward_info to every emitted Trajectory, matching the current framework contract.
  • Preserve multi_modal_data, num_turns, finish_reason, and logprob alignment per branch checkpoint.

This produces the expected best-of-N shape: rejected siblings are emitted as short trajectories, while the selected winner is emitted as one continuous longer trajectory.

Compatibility and rollout

  • Keep the external /v1/chat/completions, /complete, create_session, and finalize_session APIs unchanged.
  • Gate the new storage with a config flag such as gateway_trie_enabled=true during rollout.
  • Preserve current behavior for strictly linear sessions: token ids, masks, tool parsing, multimodal extraction, and response budgeting should match existing tests.
  • Keep a read-only compatibility view for debugging (current_head, num_branches, num_inflight_generations) to replace has_active_trajectory.
  • Add tests for linear continuation, prefix mismatch segmentation parity, best-of-N reattachment, idempotent retry, concurrent sibling commit, tool-call key canonicalization, and multimodal branch reuse.

Observability

Instrument prepare / tokenize / commit on gateway or trainer: incoming turns, prompt length, trie topology summary, generation id. Wire to your deployment’s tracing stack.


多轨迹 Prefix Trie 存储

为何需要 prefix trie(相对当前 Agent Gateway 的 Linear Trajectory)

  • 只认当前活跃分支:session 仅维护一条 message_history 与 active trajectory,sub-agent 切换、resample 选路或回退到旧分支时,无法把新请求挂回历史分支;trie 保留全部分支,incoming messages 可与任意路径做最长前缀匹配并续写。
  • 共享前缀重复编码:多轨迹共用的 message/token 前缀在分支切换时要反复 materialize、retokenize;trie 在共享节点挂 checkpoint,后续 call 从命中节点 clone 后增量 tokenize 即可。
  • 推理无法并发:单一共享状态需 generation lock 串行 LLM call;trie 每条 call 独占分支 state,tokenize / commit 可交错,支持 sub-agent、best-of-n 等并发场景。

核心实现

核心思路:session 内维护一棵 prefix trie——每条 message 序列对应 root 到某节点的路径;assistant 节点可挂 Checkpoint(该前缀上的 token 状态)。新请求 DFS 最长前缀匹配 → clone 最近 checkpoint → 推理后 commit 到 assistant 叶节点。

单次 generation 生命周期prepare → tokenize → commit

阶段 作用
prepare 按 incoming turns 在 trie 上 walk,深拷贝匹配到的 checkpoint → 本次 working_state
tokenize working_state 上增量拼出本轮 prompt_ids
commit 写入 LLM output token,在叶节点 attach / refresh assistant 子节点与 checkpoint

前缀匹配键(role, text, tool metadata) 相等即视为同一边;由此产生各类分叉:

现象例子 trie 行为
不同 system prompt root 下多条 system 边(多角色)
上下文 condensation(recap / splice) fork 点新增 sibling
同输入、不同输出 同一 user 下多个 assistant 子节点(best-of-N
同输入、同输出 复用已有 assistant 节点(idempotent retry

并发:各 generation 的 working_state 均为独立 clone;tokenize / 远端推理 / commit 可交错进行。

与当前 Agent Gateway 的关系:Gateway 侧可用 trie 替换「单条 active trajectory + 单份 message_history」;整体接口可不变,仅在 session 内部做分支路由。


数据结构伪代码

@dataclass(frozen=True)
class MessageKey:
    role: str
    content: Any                  # canonical, hashable representation
    name: str | None = None
    tool_calls: tuple[Any, ...] | None = None
    tool_call_id: str | None = None
    reasoning_content: str | None = None

@dataclass
class BranchCheckpoint:
    trajectory_buffer: TrajectoryBuffer      # 截止该前缀的 flattened token state
    request_tools: list[dict[str, Any]] | None = None
    chat_template_kwargs_key: tuple[Any, ...] | None = None
    messages: list[dict[str, Any]] | None = None      # 可选;可由 node path 重建

@dataclass(frozen=True)
class BranchHandle:
    generation_id: str
    # 对 Agent Gateway 不透明;内部指向 pending attach node。

@dataclass
class PrepareResult:
    trajectory_buffer: TrajectoryBuffer | None         # clone 出来的 checkpoint buffer;None 表示需要 full encode
    checkpoint_messages: list[dict[str, Any]]          # trajectory_buffer 已覆盖的 prefix
    branch_handle: BranchHandle                        # commit 时原样传回,Gateway 不解析

class TrieNode:
    key: MessageKey | None
    message: dict[str, Any] | None           # 当前 node 只保存这一条 message
    parent: TrieNode | None
    checkpoint: BranchCheckpoint | None      # 通常只有 committed assistant 节点有
    children: dict[MessageKey, TrieNode]

class PrefixTrie:
    root: TrieNode

    def match(self, messages) -> TrieNode:
        """沿 MessageKey(message) 走最长前缀,返回最深节点。"""
        ...

    def materialize_prompt_suffix(self, node, messages) -> TrieNode:
        """generation 前挂载缺失的 prompt-side nodes;此时不写 checkpoint。"""
        ...

    def upsert_assistant(self, parent, assistant_msg, ckpt) -> TrieNode:
        """找或建 assistant 子节点,写入 ckpt(同 message 复用,不同 message 并列)。"""
        ...

    def prepare(self, messages, *, tools=None, chat_template_kwargs=None) -> PrepareResult:
        node = self.match(messages)
        attach_node = self.materialize_prompt_suffix(node, messages)
        checkpoint_node, checkpoint = nearest_ckpt(attach_node)
        return PrepareResult(
            trajectory_buffer=clone(checkpoint.trajectory_buffer) if checkpoint else None,
            checkpoint_messages=(checkpoint.messages or self.rebuild_messages(checkpoint_node))
            if checkpoint
            else [],
            branch_handle=self.make_branch_handle(attach_node),
        )

    def commit(self, branch_handle, trajectory_buffer, assistant_msg, **checkpoint_metadata):
        attach_node = self.resolve_branch_handle(branch_handle)
        ckpt = BranchCheckpoint(trajectory_buffer=trajectory_buffer, **checkpoint_metadata)
        self.upsert_assistant(attach_node, assistant_msg, ckpt)

每个 trie node 只保存当前这一条 message,不在每个 node 上重复存整条 path。完整 transcript 可通过 parent 指针从当前 node 回溯到 root 重建。当前阶段我们不使用 per-node token ids;每个 node 会保存沿 path 拼接后的完整 token_id。未来为了进一步节省内存,可以只保存当前 node 的 incremental content token id,但这需要设计 path 上 token id 的合并方式。这里不能简单 concat,必须处理边界问题,例如 GLM 4.7 中 ambiguous BOS / EOS 的情况。这个方向后续可以和 verl 中引入的 ContinuousTokenBuilder[WIP] 集成。

PrepareResult 是 trie 与 Agent Gateway 之间的公开契约。Gateway 只读取和修改 request-local 的 trajectory_buffer copy:prompt_idsresponse_idsresponse_maskresponse_logprobs。trie node 被隐藏在 branch_handle 后面,只在 commit 时原样传回。

Gateway 路由(同一思路):incoming turns 与 session 内各分支比最长公共前缀,命中则续写,否则新建分支。

分叉类型 → trie 操作(walk-through 见附录):

分叉类型 trie 行为
顺序延伸 单链追加 user → assistant
system 键分裂 新 system turn → root 下新边
上下文 condensation fork 处 user 文本变化 → 兄弟分支;splice 点 clone ckpt
输出分叉(best-of-N) 同一 user 下多个 assistant 子节点
idempotent retry assistant 文本相同 → 刷新已有节点 ckpt
warm-start 预设 assistant 有节点、无 ckpt

与当前 Agent Gateway 集成

当前 gateway 行为

当前 GatewaySessionState 本质上是一条单例分支:

request_tools: list[dict] | None
message_history: list[dict]
image_data: list[Any] | None
video_data: list[Any] | None
active_trajectory: TrajectoryBuffer | None
trajectories: list[Trajectory]
request_lock: asyncio.Lock
generation_lock: asyncio.Lock

_handle_chat_completions 主要有三条路径:

路径 当前行为 局限
active_trajectory is None 对 incoming messages 做 full encode,创建 TrajectoryBuffer(prompt_ids) 首次调用合理
与当前 message_history prefix match copy 当前 active_trajectory,只 encode messages[len(message_history):],并以 response_mask=0 追加 只能续写最新分支
prefix mismatch 先把当前 active buffer materialize 到 trajectories,再 full encode 新请求并替换 active_trajectory 得到的是线性切段,不是 branch reattachment

这里有一个重要细节:prefix mismatch 时,当前 gateway 不会把新请求生成的 tokens append 到旧 active buffer 里。它会先 materialize 旧 buffer,再创建新的 TrajectoryBuffer。因此核心问题不是单个 buffer 内部发生 token-level interleaving;真正的问题是只有最新分支仍然可回接,旧分支会变成 finalized linear segment,后续无法继续增长。

为什么不是只维护 BranchState list

最小实现可以新增 BranchState(message_history, active_trajectory, request_tools, image_data, video_data),然后对所有 branch state 做最长前缀匹配。这个方向是合理的,因为它在不改变 OpenAI-compatible API 的前提下,让多个分支同时保持可回接。

但长期看 prefix trie 是更好的状态表达:

  • BranchState 仍会在 sibling 之间重复存储共享 message prefix 和 prompt token;trie 可以结构化共享。
  • BranchState 匹配复杂度是 O(num_branches * history_len);trie 基于 canonical message key 沿 incoming messages 走一遍即可。
  • best-of-N 天然表达为同一 user 节点下的多个 assistant 子节点,而不是多份拥有相同前缀的完整 history。
  • context condensation 和 warm-start 都可以作为普通 fork / no-checkpoint 节点表示。
  • trie 为 checkpoint 和 branch metadata 提供统一落点,便于安全支持并发 generation。

请求处理流程

Gateway 请求流程变成上文的 prepare → generate → commit

1. 在 request_lock 下 prepare

messages, tools, request_chat_template_kwargs = normalize(payload)
prepare_result = session.trie.prepare(
    messages,
    tools=tools,
    chat_template_kwargs=request_chat_template_kwargs,
)
branch_handle = prepare_result.branch_handle
working_buffer = prepare_result.trajectory_buffer
generation_id = branch_handle.generation_id

这里刻意拆成两层状态转换。在 prepare 阶段,gateway 可以先把本次 request 里缺失的 incremental messages 挂上去:

matched assistant checkpoint [ckpt]
└─ new user/tool message      # pending, no checkpoint

内部实现里,这个 pending node 会成为本次 generation 的 attach parent。随后 LLM call 使用 request-local working_buffer 运行;pending node 本身还不拥有模型生成 tokens。等 LLM output decode 完后,commit 再挂载 assistant child,并把新的 checkpoint 写在 assistant 节点上:

matched assistant checkpoint [ckpt]
└─ new user/tool message      # structural prompt node, no checkpoint
   └─ generated assistant message [ckpt]

这些新挂载的 prompt-side nodes 只是 structural / pending nodes。它们不应被导出为 trajectory,也不应成为可复用 checkpoint;只有 LLM response commit 到 assistant child 后,才形成新的 checkpoint。这样也更自然地支持并发 sibling generation:多个请求可以在同一个 matched parent 下 materialize 各自的 prompt suffix,用独立 buffer 并发 generation,最后再 commit 成不同 assistant children,而不会互相覆盖。

从 Agent Gateway 的视角看,pending node 不会直接暴露出来。它被封装在 branch_handle 里;所有 token-level 工作都发生在 working_buffer 上。

2. 对 request-local state 做 tokenize

如果没有可复用 checkpoint,prepare_result.trajectory_bufferNone,则保留当前 full-encode 路径:

prompt_ids = _encode_full(messages, tools=tools, ...)
working_buffer = TrajectoryBuffer(prompt_ids=prompt_ids)

如果有 checkpoint,prepare_result.trajectory_buffer 已经是 request-local clone。此时保留当前 incremental 行为,但 anchor 到命中的 branch,而不是 session.message_history

working_buffer = prepare_result.trajectory_buffer
incremental_messages = messages[len(prepare_result.checkpoint_messages):]
incremental_ids = _encode_incremental(incremental_messages, ...)
working_buffer.response_ids.extend(incremental_ids)
working_buffer.response_mask.extend([0] * len(incremental_ids))

(注意:tokenization 过程后续会替换为 verl 的 ContinuousTokenBuilder[WIP]。)

随后用 generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids 调 backend。为了支持并发请求,backend request id 应使用每次 generation 唯一的 id,例如 f"{session_id}:{generation_id}",而不是只有 session_id

3. 在 session mutation lock 之外 generate

当前代码用 generation_lock 包住 _backend.generate(...),是为了保护唯一可变的 active_trajectory。当 trie 使用 request-local buffer 后,backend call 不需要持有 session 级别的 generation lock:

output = await self._backend.generate(
    request_id=f"{session_id}:{generation_id}",
    prompt_ids=generation_context_ids,
    sampling_params=sampling_params,
    image_data=image_data,
    video_data=video_data,
)

迁移时可以先保留 generation_lock,只替换状态模型;等 prepare/commit 不变量由测试覆盖后,再移除或用 feature flag 控制。

4. 在 request_lock 下 commit

decode assistant response 后,把模型生成 tokens 追加到 request-local buffer,并挂回 trie:

working_buffer.response_ids.extend(response_ids)
working_buffer.response_mask.extend([1] * len(response_ids))
append_or_pad_logprobs(working_buffer, output.log_probs, len(response_ids))

session.trie.commit(
    branch_handle=branch_handle,
    assistant_msg=assistant_msg,
    trajectory_buffer=working_buffer,
    request_tools=tools,
    chat_template_kwargs_key=template_key,
    image_data=image_data,
    video_data=video_data,
)

内部实现里,commit(...) 会把 branch_handle 解析回 pending attach node,再 upsert assistant child。如果两个 generation 共享同一个 parent 且生成不同 assistant 文本,它们会成为 sibling assistant nodes;如果生成文本相同,commit 会刷新同一节点,避免重复 sibling。

Finalization

finalize_session 应遍历 trie,产出 branch-aware trajectories,而不是只 materialize 一个 active_trajectory

默认导出策略:

  • 默认导出 terminal assistant checkpoints:未被选中的 sibling 保持为短叶子;已继续生长的分支则由最深的 assistant checkpoint 表示。
  • 如果某个中间 assistant checkpoint 已经被继续成长为更长分支,默认不单独导出;除非显式打开 export_all_checkpoints
  • 对每条导出的 Trajectory 都应用 reward_info,保持当前 framework contract。
  • 每个 branch checkpoint 独立保留 multi_modal_datanum_turnsfinish_reason 和 logprob alignment。

这样 best-of-N 会得到预期形态:未被选中的 sibling 作为短 trajectory 导出,被选中的 winner 则作为一条连续的长 trajectory 导出。

兼容性与 rollout

  • 对外保持 /v1/chat/completions/completecreate_sessionfinalize_session API 不变。
  • 通过类似 gateway_trie_enabled=true 的配置灰度启用新存储。
  • 对严格线性的 session,token ids、mask、tool parsing、multimodal extraction、response budget 行为应与现有测试一致。
  • 为调试保留只读兼容视图,例如 current_headnum_branchesnum_inflight_generations,替代 has_active_trajectory
  • 补充测试:linear continuation、prefix mismatch segmentation parity、best-of-N reattachment、idempotent retry、concurrent sibling commit、tool-call key canonicalization、multimodal branch reuse。

可观测性

建议在 gateway / trainer 侧对 prepare / tokenize / commit 打点:incoming turns、prompt token 长度、trie 拓扑摘要、generation id。具体上报通道按部署环境接入即可。


附录:按分叉类型 walk-through

树节点约定:[sys] / [usr] / [asst] 为角色;[ckpt] 为模型实际生成并 commit 后的检查点。

分类依据是 trie 要表达的结构,而非 agent 框架里的场景标签。树形示例见上文英文 Appendix

适用性:以上分叉类型在 coding agent 里尤其常见——多轮 tool/edit 循环、planner / worker 子 agent、超长上下文 condensation、best-of-N 选 patch 或 plan、超时重试、从半截 run 恢复等。示例为便于阅读用了中性领域,结构与 SWE / code-agent 工作流一一对应。

类型 子例 要点
A. 顺序延伸 单角色单链,无分叉
B. system 键分裂 B1 并行 sub-agent / B2 顺序 handoff 不同 system → root 多叉;不建模调用栈
C. 上下文 condensation C1 全量 recap / C2 中段 splice 共享前缀后开兄弟分支;中段 fork 时 clone ckpt
D. best-of-N / retry D1 best-of-N / D2 idempotent retry 同 user 多 assistant 兄弟;同文复用节点
E. warm-start 导入历史 turn 有节点无 ckpt,首次 live 生成才 commit

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions