[RFC] Prefix Trie for Multi-trajectories Storage for AgentGateway

# Prefix Trie for Multi-Trajectory Storage

**Why a prefix trie (vs. the Agent Gateway’s linear trajectory today)**

- **Single active branch only**: A session keeps one `message_history` and one active trajectory. When switching sub-agents, picking a resample path, or returning to an older branch, new requests cannot reattach to historical branches. A trie keeps every branch; incoming messages longest-prefix-match against any path and continue from there.
- **Repeated encoding of shared prefixes**: Message/token prefixes shared across trajectories are re-materialized and re-tokenized on every branch switch. A trie stores checkpoints on shared nodes; later calls clone from the matched node and tokenize incrementally.
- **No concurrent inference**: One shared state requires a generation lock and serial LLM calls. With a trie, each call owns a cloned branch state; tokenize and commit can interleave—supporting sub-agents, best-of-n, etc.

---

## Core design

**Idea**: Maintain a **prefix trie** per session. Each message sequence is a path from root to some node; assistant nodes may hold a **Checkpoint** (token state at that prefix). On a new request: longest-prefix DFS → clone nearest checkpoint → after inference, commit to the new assistant leaf.

**One generation lifecycle**: `prepare → tokenize → commit`.

| Phase | Role |
|-------|------|
| **prepare** | Walk the trie on incoming turns; deep-copy the matched checkpoint into this call’s `working_state` |
| **tokenize** | Incrementally build this round’s `prompt_ids` from `working_state` |
| **commit** | Write LLM output tokens; attach or refresh the assistant child and checkpoint at the leaf |

**Prefix match key**: Same edge if `(role, text, tool metadata)` match. Forking behavior:

| Example | Trie behavior |
|---------|---------------|
| Different system prompt | Multiple system edges under root (multi-role) |
| Context condensation (recap / splice) | New sibling at the fork point |
| Same input, different output | Multiple assistant children under the same user (**best-of-N**) |
| Same input, same output | Reuse existing assistant node (**idempotent retry**) |

**Concurrency**: Each generation’s `working_state` is an independent clone; tokenize, remote inference, and commit may overlap.

**Relation to Agent Gateway**: Replace “one active trajectory + one message_history” with a trie internally; external APIs can stay the same—only in-session branch routing changes.

---

## Data structures (sketch)

```python
@dataclass(frozen=True)
class MessageKey:
    role: str
    content: Any                  
    name: str | None = None
    tool_calls: tuple[Any, ...] | None = None
    tool_call_id: str | None = None 
    reasoning_content: str | None = None

@dataclass
class BranchCheckpoint:
    trajectory_buffer: TrajectoryBuffer      # flattened token state up to this prefix
    request_tools: list[dict[str, Any]] | None = None
    chat_template_kwargs_key: tuple[Any, ...] | None = None
    messages: list[dict[str, Any]] | None = None      # optional; can be rebuilt from node path

@dataclass(frozen=True)
class BranchHandle:
    generation_id: str
    # Opaque to Agent Gateway; internally points to the pending attach node.

@dataclass
class PrepareResult:
    trajectory_buffer: TrajectoryBuffer | None         # cloned checkpoint buffer; None means full encode
    checkpoint_messages: list[dict[str, Any]]          # prefix covered by trajectory_buffer
    branch_handle: BranchHandle                        # passed back to commit, not inspected by Gateway

class TrieNode:
    key: MessageKey | None
    message: dict[str, Any] | None           # only the current message for this node
    parent: TrieNode | None
    checkpoint: BranchCheckpoint | None      # usually only on committed assistant nodes
    children: dict[MessageKey, TrieNode]

class PrefixTrie:
    root: TrieNode

    def match(self, messages) -> TrieNode:
        """Longest-prefix walk on MessageKey(message); return deepest node."""
        ...

    def materialize_prompt_suffix(self, node, messages) -> TrieNode:
        """Attach prompt-side nodes before generation; no checkpoint yet."""
        ...

    def upsert_assistant(self, parent, assistant_msg, ckpt) -> TrieNode:
        """Find or create assistant child; write ckpt (reuse if same message, sibling if not)."""
        ...

    def prepare(self, messages, *, tools=None, chat_template_kwargs=None) -> PrepareResult:
        node = self.match(messages)
        attach_node = self.materialize_prompt_suffix(node, messages)
        checkpoint_node, checkpoint = nearest_ckpt(attach_node)
        return PrepareResult(
            trajectory_buffer=clone(checkpoint.trajectory_buffer) if checkpoint else None,
            checkpoint_messages=(checkpoint.messages or self.rebuild_messages(checkpoint_node))
            if checkpoint
            else [],
            branch_handle=self.make_branch_handle(attach_node),
        )

    def commit(self, branch_handle, trajectory_buffer, assistant_msg, **checkpoint_metadata):
        attach_node = self.resolve_branch_handle(branch_handle)
        ckpt = BranchCheckpoint(trajectory_buffer=trajectory_buffer, **checkpoint_metadata)
        self.upsert_assistant(attach_node, assistant_msg, ckpt)


@dataclass
class GatewaySessionState:
    handle: SessionHandle
    trie: PrefixTrie = field(default_factory=PrefixTrie)
    reward_info: dict[str, Any] = field(default_factory=dict)
    completed: asyncio.Event = field(default_factory=asyncio.Event)
    phase: SessionPhase = SessionPhase.ACTIVE
    request_lock: asyncio.Lock = field(default_factory=asyncio.Lock)
```

Each trie node stores **one message only**. It does not duplicate the full path. A full transcript can be rebuilt by following `parent` pointers from the node to root. For current phase, we don't use per-node token ids. Each node will save the full concated token_id along the path. Note that in the future, we may only store token id of the incremental content of current node to further save memory but we need to think of a way to merge all token id on the path. We can't simply concate all token id, we need to deal with some boundary issue (e.g. the ambiguous bos and eos case of GLM 4.7) This could be integrated with the ContinuousTokenBuilder[WIP] imported from verl.

`PrepareResult` is the public contract between the trie and Agent Gateway. The Gateway reads and mutates only the request-local `trajectory_buffer` copy: `prompt_ids`, `response_ids`, `response_mask`, and `response_logprobs`. The trie node is hidden behind `branch_handle` and is only passed back at commit time.

** Gateway routing **: Compare incoming `turns` to each branch in the session by longest common prefix; continue on hit, otherwise start a new branch.

**Fork types → trie ops** (walk-through in appendix):

| Fork type | Trie behavior |
|-----------|---------------|
| Sequential extension | Single chain; append user → assistant |
| System-keyed split | New `system` turn → new edge from root |
| Context condensation | New user text at fork → sibling branch; ckpt cloned at splice point |
| Output-keyed split (best-of-N) | Same parent user, multiple assistant children |
| Idempotent retry | Identical assistant text → refresh ckpt on existing node |
| Warm-start | Seeded assistant nodes exist but carry no ckpt |

---

## Integrating with the current Agent Gateway

### Current gateway behavior

The current `GatewaySessionState` is effectively a singleton branch:

```python
request_tools: list[dict] | None
message_history: list[dict]
image_data: list[Any] | None
video_data: list[Any] | None
active_trajectory: TrajectoryBuffer | None
trajectories: list[Trajectory]
request_lock: asyncio.Lock
generation_lock: asyncio.Lock
```

`_handle_chat_completions` has three paths:

| Path | Current behavior | Limitation |
|------|------------------|------------|
| `active_trajectory is None` | Full-encode incoming `messages`; create `TrajectoryBuffer(prompt_ids)` | Good first-call behavior |
| Prefix matches current `message_history` | Copy `active_trajectory`; encode only `messages[len(message_history):]`; append those tokens with `response_mask=0` | Works only for the latest branch |
| Prefix mismatch | Materialize current active buffer into `trajectories`; full-encode the new request; replace `active_trajectory` | Produces linear segments, not branch reattachment |

Important nuance: on prefix mismatch, the current gateway does **not** append the new request's generated tokens into the old active buffer. It materializes the old buffer and starts a fresh `TrajectoryBuffer`. Therefore, the core issue is not token-level interleaving inside one buffer. The real issue is that only the latest branch remains attachable; older branches are stored as finalized linear segments and cannot be continued.

### Why a trie, not only a list of `BranchState`s

A minimal implementation could introduce `BranchState(message_history, active_trajectory, request_tools, image_data, video_data)` and run longest-prefix matching across all branch states. This is a reasonable first step because it restores multiple attachable branches without changing the OpenAI-compatible API.

However, a prefix trie is the better steady-state representation:

- `BranchState` still duplicates shared message prefixes and prompt tokens across siblings; trie nodes share them structurally.
- `BranchState` matching is `O(num_branches * history_len)`; trie matching walks the incoming messages once, with child lookup by canonical message key.
- Best-of-N naturally becomes several assistant children under one user node, instead of several full histories that happen to share a prefix.
- Context condensation and warm-start can be represented as ordinary fork / no-checkpoint cases.
- The trie gives a single place to store checkpoints and branch metadata for safe concurrent generation.


### Request flow

The gateway request flow becomes the same `prepare → generate → commit` lifecycle described above.

**1. Prepare under `request_lock`**

```python
messages, tools, request_chat_template_kwargs = normalize(payload)
prepare_result = session.trie.prepare(
    messages,
    tools=tools,
    chat_template_kwargs=request_chat_template_kwargs,
)
branch_handle = prepare_result.branch_handle
working_buffer = prepare_result.trajectory_buffer
generation_id = branch_handle.generation_id
```

This is intentionally a two-level state transition. During `prepare`, the gateway may attach the incremental request messages first:

```plaintext
matched assistant checkpoint [ckpt]
└─ new user/tool message      # pending, no checkpoint
```

Internally, the pending node becomes the attach parent for this generation. The LLM call then runs with a request-local `working_buffer`; the pending node itself does not own generated tokens yet. After decoding the LLM output, `commit` attaches the assistant child and writes the new checkpoint there:

```plaintext
matched assistant checkpoint [ckpt]
└─ new user/tool message      # structural prompt node, no checkpoint
   └─ generated assistant message [ckpt]
```

These newly attached prompt-side nodes are structural/pending nodes only. They should not be exported as trajectories and should not become reusable checkpoints until the LLM response is committed under an assistant child. This also makes concurrent sibling generation natural: multiple requests can materialize their prompt suffixes under the same matched parent, run generation with independent buffers, and later commit different assistant children without overwriting each other.

From the Agent Gateway's perspective, the pending node is not exposed directly. It is represented by `branch_handle`, while all token-level work happens on `working_buffer`.

**2. Tokenize request-local state**

If no checkpoint is available, `prepare_result.trajectory_buffer` is `None`; keep current full-encode path:

```python
prompt_ids = _encode_full(messages, tools=tools, ...)
working_buffer = TrajectoryBuffer(prompt_ids=prompt_ids)
```

If a checkpoint is available, `prepare_result.trajectory_buffer` is already a request-local clone. Keep today's incremental behavior but anchor it to the matched branch, not to `session.message_history`:

```python
working_buffer = prepare_result.trajectory_buffer
incremental_messages = messages[len(prepare_result.checkpoint_messages):]
incremental_ids = _encode_incremental(incremental_messages, ...)
working_buffer.response_ids.extend(incremental_ids)
working_buffer.response_mask.extend([0] * len(incremental_ids))
```
(note that the tokenization process will be replaced by ContinuousTokenBuilder from verl [WIP])

Then call the backend with `generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids`. For concurrent requests, use a per-generation backend request id such as `f"{session_id}:{generation_id}"`, not only `session_id`.

**3. Generate outside the session mutation lock**

The current code holds `generation_lock` across `_backend.generate(...)` to protect one mutable `active_trajectory`. With trie-backed request-local buffers, the backend call does not need to hold a session-level generation lock:

```python
output = await self._backend.generate(
    request_id=f"{session_id}:{generation_id}",
    prompt_ids=generation_context_ids,
    sampling_params=sampling_params,
    image_data=image_data,
    video_data=video_data,
)
```

As a migration strategy, we can first keep `generation_lock` while replacing the state model, then remove or feature-flag it once prepare/commit invariants are covered by tests.

**4. Commit under `request_lock`**

After decoding the assistant response, append generated tokens to the request-local buffer and attach the result to the trie:

```python
working_buffer.response_ids.extend(response_ids)
working_buffer.response_mask.extend([1] * len(response_ids))
append_or_pad_logprobs(working_buffer, output.log_probs, len(response_ids))

session.trie.commit(
    branch_handle=branch_handle,
    assistant_msg=assistant_msg,
    trajectory_buffer=working_buffer,
    request_tools=tools,
    chat_template_kwargs_key=template_key,
    image_data=image_data,
    video_data=video_data,
)
```

Internally, `commit(...)` resolves `branch_handle` back to the pending attach node and upserts the assistant child. If two generations share the same parent and produce different assistant text, they become sibling assistant nodes. If they produce the same assistant text, commit refreshes the same node instead of creating a duplicate sibling.

### Finalization

`finalize_session` should traverse the trie and emit branch-aware trajectories instead of only materializing one `active_trajectory`.

Default export policy:

- Emit terminal assistant checkpoints by default: rejected siblings remain short leaves, while a continued branch is represented by its deepest assistant checkpoint.
- Do not separately emit an intermediate assistant checkpoint when it has been continued into a longer branch, unless `export_all_checkpoints` is explicitly enabled.
- Apply `reward_info` to every emitted `Trajectory`, matching the current framework contract.
- Preserve `multi_modal_data`, `num_turns`, `finish_reason`, and logprob alignment per branch checkpoint.

This produces the expected best-of-N shape: rejected siblings are emitted as short trajectories, while the selected winner is emitted as one continuous longer trajectory.

### Compatibility and rollout

- Keep the external `/v1/chat/completions`, `/complete`, `create_session`, and `finalize_session` APIs unchanged.
- Gate the new storage with a config flag such as `gateway_trie_enabled=true` during rollout.
- Preserve current behavior for strictly linear sessions: token ids, masks, tool parsing, multimodal extraction, and response budgeting should match existing tests.
- Keep a read-only compatibility view for debugging (`current_head`, `num_branches`, `num_inflight_generations`) to replace `has_active_trajectory`.
- Add tests for linear continuation, prefix mismatch segmentation parity, best-of-N reattachment, idempotent retry, concurrent sibling commit, tool-call key canonicalization, and multimodal branch reuse.

---

## Observability

Instrument `prepare / tokenize / commit` on gateway or trainer: incoming turns, prompt length, trie topology summary, generation id. Wire to your deployment’s tracing stack.

---

# 多轨迹 Prefix Trie 存储

**为何需要 prefix trie（相对当前 Agent Gateway 的 Linear Trajectory）**

- **只认当前活跃分支**：session 仅维护一条 `message_history` 与 active trajectory，sub-agent 切换、resample 选路或回退到旧分支时，无法把新请求挂回历史分支；trie 保留全部分支，incoming messages 可与任意路径做最长前缀匹配并续写。
- **共享前缀重复编码**：多轨迹共用的 message/token 前缀在分支切换时要反复 materialize、retokenize；trie 在共享节点挂 checkpoint，后续 call 从命中节点 clone 后增量 tokenize 即可。
- **推理无法并发**：单一共享状态需 generation lock 串行 LLM call；trie 每条 call 独占分支 state，tokenize / commit 可交错，支持 sub-agent、best-of-n 等并发场景。

---

## 核心实现

**核心思路**：session 内维护一棵 **prefix trie**——每条 message 序列对应 root 到某节点的路径；assistant 节点可挂 **Checkpoint**（该前缀上的 token 状态）。新请求 DFS 最长前缀匹配 → clone 最近 checkpoint → 推理后 commit 到 assistant 叶节点。

**单次 generation 生命周期**：`prepare → tokenize → commit`。

| 阶段 | 作用 |
|------|------|
| **prepare** | 按 incoming turns 在 trie 上 walk，*深拷贝*匹配到的 checkpoint → 本次 `working_state` |
| **tokenize** | 在 `working_state` 上增量拼出本轮 `prompt_ids` |
| **commit** | 写入 LLM output token，在叶节点 attach / refresh assistant 子节点与 checkpoint |

**前缀匹配键**：`(role, text, tool metadata)` 相等即视为同一边；由此产生各类分叉：

| 现象例子 | trie 行为 |
|------|-----------|
| 不同 system prompt | root 下多条 system 边（多角色） |
| 上下文 condensation（recap / splice） | fork 点新增 sibling |
| 同输入、不同输出 | 同一 user 下多个 assistant 子节点（**best-of-N**） |
| 同输入、同输出 | 复用已有 assistant 节点（**idempotent retry**） |

**并发**：各 generation 的 `working_state` 均为独立 clone；tokenize / 远端推理 / commit 可交错进行。

**与当前 Agent Gateway 的关系**：Gateway 侧可用 trie 替换「单条 active trajectory + 单份 message_history」；整体接口可不变，仅在 session 内部做分支路由。

---

## 数据结构伪代码

```python
@dataclass(frozen=True)
class MessageKey:
    role: str
    content: Any                  # canonical, hashable representation
    name: str | None = None
    tool_calls: tuple[Any, ...] | None = None
    tool_call_id: str | None = None
    reasoning_content: str | None = None

@dataclass
class BranchCheckpoint:
    trajectory_buffer: TrajectoryBuffer      # 截止该前缀的 flattened token state
    request_tools: list[dict[str, Any]] | None = None
    chat_template_kwargs_key: tuple[Any, ...] | None = None
    messages: list[dict[str, Any]] | None = None      # 可选；可由 node path 重建

@dataclass(frozen=True)
class BranchHandle:
    generation_id: str
    # 对 Agent Gateway 不透明；内部指向 pending attach node。

@dataclass
class PrepareResult:
    trajectory_buffer: TrajectoryBuffer | None         # clone 出来的 checkpoint buffer；None 表示需要 full encode
    checkpoint_messages: list[dict[str, Any]]          # trajectory_buffer 已覆盖的 prefix
    branch_handle: BranchHandle                        # commit 时原样传回，Gateway 不解析

class TrieNode:
    key: MessageKey | None
    message: dict[str, Any] | None           # 当前 node 只保存这一条 message
    parent: TrieNode | None
    checkpoint: BranchCheckpoint | None      # 通常只有 committed assistant 节点有
    children: dict[MessageKey, TrieNode]

class PrefixTrie:
    root: TrieNode

    def match(self, messages) -> TrieNode:
        """沿 MessageKey(message) 走最长前缀，返回最深节点。"""
        ...

    def materialize_prompt_suffix(self, node, messages) -> TrieNode:
        """generation 前挂载缺失的 prompt-side nodes；此时不写 checkpoint。"""
        ...

    def upsert_assistant(self, parent, assistant_msg, ckpt) -> TrieNode:
        """找或建 assistant 子节点，写入 ckpt（同 message 复用，不同 message 并列）。"""
        ...

    def prepare(self, messages, *, tools=None, chat_template_kwargs=None) -> PrepareResult:
        node = self.match(messages)
        attach_node = self.materialize_prompt_suffix(node, messages)
        checkpoint_node, checkpoint = nearest_ckpt(attach_node)
        return PrepareResult(
            trajectory_buffer=clone(checkpoint.trajectory_buffer) if checkpoint else None,
            checkpoint_messages=(checkpoint.messages or self.rebuild_messages(checkpoint_node))
            if checkpoint
            else [],
            branch_handle=self.make_branch_handle(attach_node),
        )

    def commit(self, branch_handle, trajectory_buffer, assistant_msg, **checkpoint_metadata):
        attach_node = self.resolve_branch_handle(branch_handle)
        ckpt = BranchCheckpoint(trajectory_buffer=trajectory_buffer, **checkpoint_metadata)
        self.upsert_assistant(attach_node, assistant_msg, ckpt)
```

每个 trie node **只保存当前这一条 message**，不在每个 node 上重复存整条 path。完整 transcript 可通过 `parent` 指针从当前 node 回溯到 root 重建。当前阶段我们不使用 per-node token ids；每个 node 会保存沿 path 拼接后的完整 token_id。未来为了进一步节省内存，可以只保存当前 node 的 incremental content token id，但这需要设计 path 上 token id 的合并方式。这里不能简单 concat，必须处理边界问题，例如 GLM 4.7 中 ambiguous BOS / EOS 的情况。这个方向后续可以和 verl 中引入的 ContinuousTokenBuilder[WIP] 集成。

`PrepareResult` 是 trie 与 Agent Gateway 之间的公开契约。Gateway 只读取和修改 request-local 的 `trajectory_buffer` copy：`prompt_ids`、`response_ids`、`response_mask`、`response_logprobs`。trie node 被隐藏在 `branch_handle` 后面，只在 commit 时原样传回。

**Gateway 路由（同一思路）**：incoming `turns` 与 session 内各分支比最长公共前缀，命中则续写，否则新建分支。

**分叉类型 → trie 操作**（walk-through 见附录）：

| 分叉类型 | trie 行为 |
|----------|-----------|
| 顺序延伸 | 单链追加 user → assistant |
| system 键分裂 | 新 system turn → root 下新边 |
| 上下文 condensation | fork 处 user 文本变化 → 兄弟分支；splice 点 clone ckpt |
| 输出分叉（best-of-N） | 同一 user 下多个 assistant 子节点 |
| idempotent retry | assistant 文本相同 → 刷新已有节点 ckpt |
| warm-start | 预设 assistant 有节点、无 ckpt |

---

## 与当前 Agent Gateway 集成

### 当前 gateway 行为

当前 `GatewaySessionState` 本质上是一条单例分支：

```python
request_tools: list[dict] | None
message_history: list[dict]
image_data: list[Any] | None
video_data: list[Any] | None
active_trajectory: TrajectoryBuffer | None
trajectories: list[Trajectory]
request_lock: asyncio.Lock
generation_lock: asyncio.Lock
```

`_handle_chat_completions` 主要有三条路径：

| 路径 | 当前行为 | 局限 |
|------|----------|------|
| `active_trajectory is None` | 对 incoming `messages` 做 full encode，创建 `TrajectoryBuffer(prompt_ids)` | 首次调用合理 |
| 与当前 `message_history` prefix match | copy 当前 `active_trajectory`，只 encode `messages[len(message_history):]`，并以 `response_mask=0` 追加 | 只能续写最新分支 |
| prefix mismatch | 先把当前 active buffer materialize 到 `trajectories`，再 full encode 新请求并替换 `active_trajectory` | 得到的是线性切段，不是 branch reattachment |

这里有一个重要细节：prefix mismatch 时，当前 gateway **不会**把新请求生成的 tokens append 到旧 active buffer 里。它会先 materialize 旧 buffer，再创建新的 `TrajectoryBuffer`。因此核心问题不是单个 buffer 内部发生 token-level interleaving；真正的问题是只有最新分支仍然可回接，旧分支会变成 finalized linear segment，后续无法继续增长。

### 为什么不是只维护 `BranchState` list

最小实现可以新增 `BranchState(message_history, active_trajectory, request_tools, image_data, video_data)`，然后对所有 branch state 做最长前缀匹配。这个方向是合理的，因为它在不改变 OpenAI-compatible API 的前提下，让多个分支同时保持可回接。

但长期看 prefix trie 是更好的状态表达：

- `BranchState` 仍会在 sibling 之间重复存储共享 message prefix 和 prompt token；trie 可以结构化共享。
- `BranchState` 匹配复杂度是 `O(num_branches * history_len)`；trie 基于 canonical message key 沿 incoming messages 走一遍即可。
- best-of-N 天然表达为同一 user 节点下的多个 assistant 子节点，而不是多份拥有相同前缀的完整 history。
- context condensation 和 warm-start 都可以作为普通 fork / no-checkpoint 节点表示。
- trie 为 checkpoint 和 branch metadata 提供统一落点，便于安全支持并发 generation。

### 请求处理流程

Gateway 请求流程变成上文的 `prepare → generate → commit`。

**1. 在 `request_lock` 下 prepare**

```python
messages, tools, request_chat_template_kwargs = normalize(payload)
prepare_result = session.trie.prepare(
    messages,
    tools=tools,
    chat_template_kwargs=request_chat_template_kwargs,
)
branch_handle = prepare_result.branch_handle
working_buffer = prepare_result.trajectory_buffer
generation_id = branch_handle.generation_id
```

这里刻意拆成两层状态转换。在 `prepare` 阶段，gateway 可以先把本次 request 里缺失的 incremental messages 挂上去：

```plaintext
matched assistant checkpoint [ckpt]
└─ new user/tool message      # pending, no checkpoint
```

内部实现里，这个 pending node 会成为本次 generation 的 attach parent。随后 LLM call 使用 request-local `working_buffer` 运行；pending node 本身还不拥有模型生成 tokens。等 LLM output decode 完后，`commit` 再挂载 assistant child，并把新的 checkpoint 写在 assistant 节点上：

```plaintext
matched assistant checkpoint [ckpt]
└─ new user/tool message      # structural prompt node, no checkpoint
   └─ generated assistant message [ckpt]
```

这些新挂载的 prompt-side nodes 只是 structural / pending nodes。它们不应被导出为 trajectory，也不应成为可复用 checkpoint；只有 LLM response commit 到 assistant child 后，才形成新的 checkpoint。这样也更自然地支持并发 sibling generation：多个请求可以在同一个 matched parent 下 materialize 各自的 prompt suffix，用独立 buffer 并发 generation，最后再 commit 成不同 assistant children，而不会互相覆盖。

从 Agent Gateway 的视角看，pending node 不会直接暴露出来。它被封装在 `branch_handle` 里；所有 token-level 工作都发生在 `working_buffer` 上。

**2. 对 request-local state 做 tokenize**

如果没有可复用 checkpoint，`prepare_result.trajectory_buffer` 为 `None`，则保留当前 full-encode 路径：

```python
prompt_ids = _encode_full(messages, tools=tools, ...)
working_buffer = TrajectoryBuffer(prompt_ids=prompt_ids)
```

如果有 checkpoint，`prepare_result.trajectory_buffer` 已经是 request-local clone。此时保留当前 incremental 行为，但 anchor 到命中的 branch，而不是 `session.message_history`：

```python
working_buffer = prepare_result.trajectory_buffer
incremental_messages = messages[len(prepare_result.checkpoint_messages):]
incremental_ids = _encode_incremental(incremental_messages, ...)
working_buffer.response_ids.extend(incremental_ids)
working_buffer.response_mask.extend([0] * len(incremental_ids))
```

（注意：tokenization 过程后续会替换为 verl 的 ContinuousTokenBuilder[WIP]。）

随后用 `generation_context_ids = working_buffer.prompt_ids + working_buffer.response_ids` 调 backend。为了支持并发请求，backend request id 应使用每次 generation 唯一的 id，例如 `f"{session_id}:{generation_id}"`，而不是只有 `session_id`。

**3. 在 session mutation lock 之外 generate**

当前代码用 `generation_lock` 包住 `_backend.generate(...)`，是为了保护唯一可变的 `active_trajectory`。当 trie 使用 request-local buffer 后，backend call 不需要持有 session 级别的 generation lock：

```python
output = await self._backend.generate(
    request_id=f"{session_id}:{generation_id}",
    prompt_ids=generation_context_ids,
    sampling_params=sampling_params,
    image_data=image_data,
    video_data=video_data,
)
```

迁移时可以先保留 `generation_lock`，只替换状态模型；等 prepare/commit 不变量由测试覆盖后，再移除或用 feature flag 控制。

**4. 在 `request_lock` 下 commit**

decode assistant response 后，把模型生成 tokens 追加到 request-local buffer，并挂回 trie：

```python
working_buffer.response_ids.extend(response_ids)
working_buffer.response_mask.extend([1] * len(response_ids))
append_or_pad_logprobs(working_buffer, output.log_probs, len(response_ids))

session.trie.commit(
    branch_handle=branch_handle,
    assistant_msg=assistant_msg,
    trajectory_buffer=working_buffer,
    request_tools=tools,
    chat_template_kwargs_key=template_key,
    image_data=image_data,
    video_data=video_data,
)
```

内部实现里，`commit(...)` 会把 `branch_handle` 解析回 pending attach node，再 upsert assistant child。如果两个 generation 共享同一个 parent 且生成不同 assistant 文本，它们会成为 sibling assistant nodes；如果生成文本相同，commit 会刷新同一节点，避免重复 sibling。

### Finalization

`finalize_session` 应遍历 trie，产出 branch-aware trajectories，而不是只 materialize 一个 `active_trajectory`。

默认导出策略：

- 默认导出 terminal assistant checkpoints：未被选中的 sibling 保持为短叶子；已继续生长的分支则由最深的 assistant checkpoint 表示。
- 如果某个中间 assistant checkpoint 已经被继续成长为更长分支，默认不单独导出；除非显式打开 `export_all_checkpoints`。
- 对每条导出的 `Trajectory` 都应用 `reward_info`，保持当前 framework contract。
- 每个 branch checkpoint 独立保留 `multi_modal_data`、`num_turns`、`finish_reason` 和 logprob alignment。

这样 best-of-N 会得到预期形态：未被选中的 sibling 作为短 trajectory 导出，被选中的 winner 则作为一条连续的长 trajectory 导出。

### 兼容性与 rollout

- 对外保持 `/v1/chat/completions`、`/complete`、`create_session`、`finalize_session` API 不变。
- 通过类似 `gateway_trie_enabled=true` 的配置灰度启用新存储。
- 对严格线性的 session，token ids、mask、tool parsing、multimodal extraction、response budget 行为应与现有测试一致。
- 为调试保留只读兼容视图，例如 `current_head`、`num_branches`、`num_inflight_generations`，替代 `has_active_trajectory`。
- 补充测试：linear continuation、prefix mismatch segmentation parity、best-of-N reattachment、idempotent retry、concurrent sibling commit、tool-call key canonicalization、multimodal branch reuse。

---

## 可观测性

建议在 gateway / trainer 侧对 `prepare / tokenize / commit` 打点：incoming turns、prompt token 长度、trie 拓扑摘要、generation id。具体上报通道按部署环境接入即可。

---

# 附录：按分叉类型 walk-through

树节点约定：`[sys]` / `[usr]` / `[asst]` 为角色；`[ckpt]` 为模型**实际生成并 commit** 后的检查点。

分类依据是 **trie 要表达的结构**，而非 agent 框架里的场景标签。树形示例见上文英文 **Appendix**。

**适用性**：以上分叉类型在 **coding agent** 里尤其常见——多轮 tool/edit 循环、planner / worker 子 agent、超长上下文 condensation、best-of-N 选 patch 或 plan、超时重试、从半截 run 恢复等。示例为便于阅读用了中性领域，结构与 SWE / code-agent 工作流一一对应。

| 类型 | 子例 | 要点 |
|------|------|------|
| **A. 顺序延伸** | — | 单角色单链，无分叉 |
| **B. system 键分裂** | B1 并行 sub-agent / B2 顺序 handoff | 不同 system → root 多叉；不建模调用栈 |
| **C. 上下文 condensation** | C1 全量 recap / C2 中段 splice | 共享前缀后开兄弟分支；中段 fork 时 clone ckpt |
| **D. best-of-N / retry** | D1 best-of-N / D2 idempotent retry | 同 user 多 assistant 兄弟；同文复用节点 |
| **E. warm-start** | — | 导入历史 turn 有节点无 ckpt，首次 live 生成才 commit |


Phase	Role
prepare	Walk the trie on incoming turns; deep-copy the matched checkpoint into this call’s `working_state`
tokenize	Incrementally build this round’s `prompt_ids` from `working_state`
commit	Write LLM output tokens; attach or refresh the assistant child and checkpoint at the leaf

阶段	作用
prepare	按 incoming turns 在 trie 上 walk，深拷贝匹配到的 checkpoint → 本次 `working_state`
tokenize	在 `working_state` 上增量拼出本轮 `prompt_ids`
commit	写入 LLM output token，在叶节点 attach / refresh assistant 子节点与 checkpoint

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Prefix Trie for Multi-trajectories Storage for AgentGateway #51

Prefix Trie for Multi-Trajectory Storage

Core design

Data structures (sketch)

Integrating with the current Agent Gateway

Current gateway behavior

Why a trie, not only a list of `BranchState`s

Request flow

Finalization

Compatibility and rollout

Observability

多轨迹 Prefix Trie 存储

核心实现

数据结构伪代码

与当前 Agent Gateway 集成

当前 gateway 行为

为什么不是只维护 `BranchState` list

请求处理流程

Finalization

兼容性与 rollout

可观测性

附录：按分叉类型 walk-through

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Example	Trie behavior
Different system prompt	Multiple system edges under root (multi-role)
Context condensation (recap / splice)	New sibling at the fork point
Same input, different output	Multiple assistant children under the same user (best-of-N)
Same input, same output	Reuse existing assistant node (idempotent retry)

Fork type	Trie behavior
Sequential extension	Single chain; append user → assistant
System-keyed split	New `system` turn → new edge from root
Context condensation	New user text at fork → sibling branch; ckpt cloned at splice point
Output-keyed split (best-of-N)	Same parent user, multiple assistant children
Idempotent retry	Identical assistant text → refresh ckpt on existing node
Warm-start	Seeded assistant nodes exist but carry no ckpt

Path	Current behavior	Limitation
`active_trajectory is None`	Full-encode incoming `messages`; create `TrajectoryBuffer(prompt_ids)`	Good first-call behavior
Prefix matches current `message_history`	Copy `active_trajectory`; encode only `messages[len(message_history):]`; append those tokens with `response_mask=0`	Works only for the latest branch
Prefix mismatch	Materialize current active buffer into `trajectories`; full-encode the new request; replace `active_trajectory`	Produces linear segments, not branch reattachment

现象例子	trie 行为
不同 system prompt	root 下多条 system 边（多角色）
上下文 condensation（recap / splice）	fork 点新增 sibling
同输入、不同输出	同一 user 下多个 assistant 子节点（best-of-N）
同输入、同输出	复用已有 assistant 节点（idempotent retry）

分叉类型	trie 行为
顺序延伸	单链追加 user → assistant
system 键分裂	新 system turn → root 下新边
上下文 condensation	fork 处 user 文本变化 → 兄弟分支；splice 点 clone ckpt
输出分叉（best-of-N）	同一 user 下多个 assistant 子节点
idempotent retry	assistant 文本相同 → 刷新已有节点 ckpt
warm-start	预设 assistant 有节点、无 ckpt

路径	当前行为	局限
`active_trajectory is None`	对 incoming `messages` 做 full encode，创建 `TrajectoryBuffer(prompt_ids)`	首次调用合理
与当前 `message_history` prefix match	copy 当前 `active_trajectory`，只 encode `messages[len(message_history):]`，并以 `response_mask=0` 追加	只能续写最新分支
prefix mismatch	先把当前 active buffer materialize 到 `trajectories`，再 full encode 新请求并替换 `active_trajectory`	得到的是线性切段，不是 branch reattachment

类型	子例	要点
A. 顺序延伸	—	单角色单链，无分叉
B. system 键分裂	B1 并行 sub-agent / B2 顺序 handoff	不同 system → root 多叉；不建模调用栈
C. 上下文 condensation	C1 全量 recap / C2 中段 splice	共享前缀后开兄弟分支；中段 fork 时 clone ckpt
D. best-of-N / retry	D1 best-of-N / D2 idempotent retry	同 user 多 assistant 兄弟；同文复用节点
E. warm-start	—	导入历史 turn 有节点无 ckpt，首次 live 生成才 commit

[RFC] Prefix Trie for Multi-trajectories Storage for AgentGateway #51

Description

Prefix Trie for Multi-Trajectory Storage

Core design

Data structures (sketch)

Integrating with the current Agent Gateway

Current gateway behavior

Why a trie, not only a list of BranchStates

Request flow

Finalization

Compatibility and rollout

Observability

多轨迹 Prefix Trie 存储

核心实现

数据结构伪代码

与当前 Agent Gateway 集成

当前 gateway 行为

为什么不是只维护 BranchState list

请求处理流程

Finalization

兼容性与 rollout

可观测性

附录：按分叉类型 walk-through

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Why a trie, not only a list of `BranchState`s

为什么不是只维护 `BranchState` list