Simplify CPU KV context protocol and deduplicate server key-resolution path by Copilot · Pull Request #253 · hlin99/LMCache

Copilot · 2026-05-13T05:42:24Z

PR #252 introduced a CPU KV transfer path but expanded server-side surface area with a verbose registration payload and duplicated store/retrieve flow logic. This change keeps the CPU path intact while removing unnecessary protocol complexity and reusing existing server infrastructure.

Protocol: compact CPU registration payload
- REGISTER_KV_CACHE_CPU_CONTEXT now sends serialized layout metadata instead of decomposed scalar fields.
- Payload is reduced to:
  - instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla
- EngineType / LayoutHints and reconstructed layout scalars are removed from the CPU registration protocol.
Transfer context: send full MemoryLayoutDesc directly
- CPUTransferContext.register() now serializes the computed MemoryLayoutDesc once and sends it to server as bytes.
- Eliminates server-side shape/dtype reconstruction from redundant scalar params.
Server: remove repeated session/hash/object-key derivation
- Added shared helper:
  - _resolve_obj_keys(self, key: IPCCacheEngineKey) -> list[ObjectKey]
- Reused by:
  - store
  - retrieve
  - store_cpu_chunks
  - retrieve_cpu_chunks
- CPU chunk methods are now focused on CPU payload handling plus storage read/write orchestration.
Tests updated for API/payload changes
- Updated CPU-context and adapter tests to match the new registration contract and assert serialized layout payload behavior.

# New CPU registration payload shape
[
    instance_id,
    model_name,
    world_size,
    pickle.dumps(layout_desc),  # MemoryLayoutDesc bytes
    block_size,
    use_mla,
]

Original prompt

Context

Branch ww20_PR_cpu_context_pickle (PR #252) adds a CPU-based KV transfer path for non-CUDA devices. However, the server-side changes are overly invasive and introduce large amounts of code duplication. This cleanup PR should remove the unnecessary server-side additions and make the CPU path reuse existing infrastructure as much as possible.

Problem

The current implementation adds 3 new RequestType enums (REGISTER_KV_CACHE_CPU_CONTEXT, STORE_CPU_CHUNKS, RETRIEVE_CPU_CHUNKS), 3 new server methods (register_kv_cache_cpu_context, store_cpu_chunks, retrieve_cpu_chunks), 2 new server dicts (cpu_contexts, cpu_context_meta), and ~29 lines of protocol definitions — all of which largely duplicate the existing CUDA path logic in server.py.

Specifically:

register_kv_cache_cpu_context receives 10 params, immediately deletes 2 (engine_type, layout_hints), and reconstructs a MemoryLayoutDesc from scalars — even though the worker side already has the full MemoryLayoutDesc.
store_cpu_chunks and retrieve_cpu_chunks duplicate the entire session/hash/ipc_key_to_object_keys/reserve_write/finish_write flow from the existing store/retrieve methods. The only real difference is the data copy mechanism (pickle vs CUDA IPC memcpy).

Required Changes

1. Simplify server registration for CPU context

Instead of a separate register_kv_cache_cpu_context with 10 parameters, create a minimal registration path. Options:

Have the worker send the serialized MemoryLayoutDesc (pickle) plus (model_name, world_size, block_size, use_mla) via a single simplified request.
Or even better: have the REGISTER_KV_CACHE_CPU_CONTEXT request type take just (instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla) where layout_desc_bytes = pickle.dumps(layout_desc). This eliminates the need to pass num_layers, hidden_dim_size, dtype_str as separate params and reconstruct them server-side.
Remove unused engine_type and layout_hints from the CPU registration protocol definition entirely.
Update protocols/engine.py accordingly to match the simplified payload.

2. Reduce duplication in store/retrieve

Extract the common session/hash/key computation into a shared helper method on MPCacheEngine:

def _resolve_obj_keys(self, key: IPCCacheEngineKey) -> list[ObjectKey]:
    """Common session/hash/key resolution used by both CUDA and CPU paths."""
    session = self.session_manager.get_or_create(key.request_id)
    session.set_tokens(list(key.token_ids))
    chunk_hashes = [
        TokenHasher.hash_to_bytes(h) for h in session.get_hashes(key.start, key.end)
    ]
    if key.worker_id is None:
        raise ValueError("Must operate with worker_id != None")
    return ipc_key_to_object_keys(key, chunk_hashes)

Then have store, retrieve, store_cpu_chunks, and retrieve_cpu_chunks all call this helper instead of duplicating those ~6 lines each.

3. Simplify `store_cpu_chunks` and `retrieve_cpu_chunks`

After extracting the helper, these methods should be much shorter — just the pickle deserialization + tensor.copy_ (store) or tensor.cpu().clone() + pickle serialization (retrieve), plus reserve_write/finish_write calls.

4. Clean up `transfer_context.py` — `CPUTransferContext.register()`

Update the registration call in CPUTransferContext.register() to match the simplified server protocol. Specifically:

Serialize layout_desc with pickle and send as bytes
Remove engine_type and layout_hints from the registration payload
Update protocols/engine.py REGISTER_KV_CACHE_CPU_CONTEXT definition accordingly

5. Update protocol definitions

In protocols/engine.py, update the REGISTER_KV_CACHE_CPU_CONTEXT protocol definition to match the simplified parameter list. Remove EngineType and LayoutHints from the payload classes.

6. Update tests

Update tests/v1/multiprocess/test_cpu_context.py and tests/v1/test_vllm_mp_adapter.py to match any API changes.

Files to modify

lmcache/v1/multiprocess/server.py — main cleanup target
lmcache/v1/multiprocess/transfer_context.py — update registration call
lmcache/v1/multiprocess/protocols/engine.py — simplify protocol definition
tests/v1/multiprocess/test_cpu_context.py — update tests
tests/v1/test_vllm_mp_adapter.py — update tests if needed

Goal

Minimize the diff in server.py so the PR is clean and easy to review, while keeping the CPU context functionality fully working.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: @hlin99/LMCache/pull/252 你看看这个PR，这里面对adapter的改动是不是还是太多了，合理吗

Assistant: 好的，我来仔细分析一下这个 PR 对 adapter 的改动。

PR 概览

指标	数值
文件变更	11 个（4 个新增，7 个修改）
增加行数	+2,164
删除行数	-115
提交数	7

这个 PR 的核心目的是为非 CUDA 设备（CPU/XPU/HPU）添加一条 KV 缓存传输路径，通过 pickle 序列化实现 CPU 端 KV 数据的 store/retrieve。

Adapter 改动分...

This pull request was created from Copilot chat.

…n reuse Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/ff2f6066-9a28-465c-b57b-513765cf04fa Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/ff2f6066-9a28-465c-b57b-513765cf04fa Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Initial plan

7914c1f

Copilot AI assigned Copilot and hlin99 May 13, 2026

Copilot started work on behalf of hlin99 May 13, 2026 05:42 View session

Copilot AI and others added 2 commits May 13, 2026 05:51

refactor: simplify CPU context registration payload and key resolutio…

fab2fdf

…n reuse Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/ff2f6066-9a28-465c-b57b-513765cf04fa Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

test: align imports and docstrings after validation feedback

2af1ef6

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/ff2f6066-9a28-465c-b57b-513765cf04fa Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Refactor server-side CPU KV transfer path for better reuse~~ Simplify CPU KV context protocol and deduplicate server key-resolution path May 13, 2026

Copilot AI requested a review from hlin99 May 13, 2026 05:56

Copilot finished work on behalf of hlin99 May 13, 2026 05:56

Copilot AI mentioned this pull request May 13, 2026

Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads #254

Closed

hlin99 closed this May 28, 2026

hlin99 deleted the copilot/cleanup-cpu-kv-transfer-path branch May 28, 2026 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify CPU KV context protocol and deduplicate server key-resolution path#253

Simplify CPU KV context protocol and deduplicate server key-resolution path#253
Copilot wants to merge 3 commits into
ww20_PR_cpu_context_picklefrom
copilot/cleanup-cpu-kv-transfer-path

Copilot AI commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Problem

Required Changes

1. Simplify server registration for CPU context

2. Reduce duplication in store/retrieve

3. Simplify store_cpu_chunks and retrieve_cpu_chunks

4. Clean up transfer_context.py — CPUTransferContext.register()

5. Update protocol definitions

6. Update tests

Files to modify

Goal

PR 概览

Adapter 改动分...

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 13, 2026 •

edited

Loading

3. Simplify `store_cpu_chunks` and `retrieve_cpu_chunks`

4. Clean up `transfer_context.py` — `CPUTransferContext.register()`