Skip to content

Simplify CPU KV context protocol and deduplicate server key-resolution path#253

Closed
Copilot wants to merge 3 commits into
ww20_PR_cpu_context_picklefrom
copilot/cleanup-cpu-kv-transfer-path
Closed

Simplify CPU KV context protocol and deduplicate server key-resolution path#253
Copilot wants to merge 3 commits into
ww20_PR_cpu_context_picklefrom
copilot/cleanup-cpu-kv-transfer-path

Conversation

Copy link
Copy Markdown

Copilot AI commented May 13, 2026

PR #252 introduced a CPU KV transfer path but expanded server-side surface area with a verbose registration payload and duplicated store/retrieve flow logic. This change keeps the CPU path intact while removing unnecessary protocol complexity and reusing existing server infrastructure.

  • Protocol: compact CPU registration payload

    • REGISTER_KV_CACHE_CPU_CONTEXT now sends serialized layout metadata instead of decomposed scalar fields.
    • Payload is reduced to:
      • instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla
    • EngineType / LayoutHints and reconstructed layout scalars are removed from the CPU registration protocol.
  • Transfer context: send full MemoryLayoutDesc directly

    • CPUTransferContext.register() now serializes the computed MemoryLayoutDesc once and sends it to server as bytes.
    • Eliminates server-side shape/dtype reconstruction from redundant scalar params.
  • Server: remove repeated session/hash/object-key derivation

    • Added shared helper:
      • _resolve_obj_keys(self, key: IPCCacheEngineKey) -> list[ObjectKey]
    • Reused by:
      • store
      • retrieve
      • store_cpu_chunks
      • retrieve_cpu_chunks
    • CPU chunk methods are now focused on CPU payload handling plus storage read/write orchestration.
  • Tests updated for API/payload changes

    • Updated CPU-context and adapter tests to match the new registration contract and assert serialized layout payload behavior.
# New CPU registration payload shape
[
    instance_id,
    model_name,
    world_size,
    pickle.dumps(layout_desc),  # MemoryLayoutDesc bytes
    block_size,
    use_mla,
]
Original prompt

Context

Branch ww20_PR_cpu_context_pickle (PR #252) adds a CPU-based KV transfer path for non-CUDA devices. However, the server-side changes are overly invasive and introduce large amounts of code duplication. This cleanup PR should remove the unnecessary server-side additions and make the CPU path reuse existing infrastructure as much as possible.

Problem

The current implementation adds 3 new RequestType enums (REGISTER_KV_CACHE_CPU_CONTEXT, STORE_CPU_CHUNKS, RETRIEVE_CPU_CHUNKS), 3 new server methods (register_kv_cache_cpu_context, store_cpu_chunks, retrieve_cpu_chunks), 2 new server dicts (cpu_contexts, cpu_context_meta), and ~29 lines of protocol definitions — all of which largely duplicate the existing CUDA path logic in server.py.

Specifically:

  • register_kv_cache_cpu_context receives 10 params, immediately deletes 2 (engine_type, layout_hints), and reconstructs a MemoryLayoutDesc from scalars — even though the worker side already has the full MemoryLayoutDesc.
  • store_cpu_chunks and retrieve_cpu_chunks duplicate the entire session/hash/ipc_key_to_object_keys/reserve_write/finish_write flow from the existing store/retrieve methods. The only real difference is the data copy mechanism (pickle vs CUDA IPC memcpy).

Required Changes

1. Simplify server registration for CPU context

Instead of a separate register_kv_cache_cpu_context with 10 parameters, create a minimal registration path. Options:

  • Have the worker send the serialized MemoryLayoutDesc (pickle) plus (model_name, world_size, block_size, use_mla) via a single simplified request.
  • Or even better: have the REGISTER_KV_CACHE_CPU_CONTEXT request type take just (instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla) where layout_desc_bytes = pickle.dumps(layout_desc). This eliminates the need to pass num_layers, hidden_dim_size, dtype_str as separate params and reconstruct them server-side.
  • Remove unused engine_type and layout_hints from the CPU registration protocol definition entirely.
  • Update protocols/engine.py accordingly to match the simplified payload.

2. Reduce duplication in store/retrieve

Extract the common session/hash/key computation into a shared helper method on MPCacheEngine:

def _resolve_obj_keys(self, key: IPCCacheEngineKey) -> list[ObjectKey]:
    """Common session/hash/key resolution used by both CUDA and CPU paths."""
    session = self.session_manager.get_or_create(key.request_id)
    session.set_tokens(list(key.token_ids))
    chunk_hashes = [
        TokenHasher.hash_to_bytes(h) for h in session.get_hashes(key.start, key.end)
    ]
    if key.worker_id is None:
        raise ValueError("Must operate with worker_id != None")
    return ipc_key_to_object_keys(key, chunk_hashes)

Then have store, retrieve, store_cpu_chunks, and retrieve_cpu_chunks all call this helper instead of duplicating those ~6 lines each.

3. Simplify store_cpu_chunks and retrieve_cpu_chunks

After extracting the helper, these methods should be much shorter — just the pickle deserialization + tensor.copy_ (store) or tensor.cpu().clone() + pickle serialization (retrieve), plus reserve_write/finish_write calls.

4. Clean up transfer_context.pyCPUTransferContext.register()

Update the registration call in CPUTransferContext.register() to match the simplified server protocol. Specifically:

  • Serialize layout_desc with pickle and send as bytes
  • Remove engine_type and layout_hints from the registration payload
  • Update protocols/engine.py REGISTER_KV_CACHE_CPU_CONTEXT definition accordingly

5. Update protocol definitions

In protocols/engine.py, update the REGISTER_KV_CACHE_CPU_CONTEXT protocol definition to match the simplified parameter list. Remove EngineType and LayoutHints from the payload classes.

6. Update tests

Update tests/v1/multiprocess/test_cpu_context.py and tests/v1/test_vllm_mp_adapter.py to match any API changes.

Files to modify

  • lmcache/v1/multiprocess/server.py — main cleanup target
  • lmcache/v1/multiprocess/transfer_context.py — update registration call
  • lmcache/v1/multiprocess/protocols/engine.py — simplify protocol definition
  • tests/v1/multiprocess/test_cpu_context.py — update tests
  • tests/v1/test_vllm_mp_adapter.py — update tests if needed

Goal

Minimize the diff in server.py so the PR is clean and easy to review, while keeping the CPU context functionality fully working.

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: @hlin99/LMCache/pull/252 你看看这个PR,这里面对adapter的改动是不是还是太多了,合理吗

Assistant: 好的,我来仔细分析一下这个 PR 对 adapter 的改动。

PR 概览

指标 数值
文件变更 11 个(4 个新增,7 个修改)
增加行数 +2,164
删除行数 -115
提交数 7

这个 PR 的核心目的是为非 CUDA 设备(CPU/XPU/HPU)添加一条 KV 缓存传输路径,通过 pickle 序列化实现 CPU 端 KV 数据的 store/retrieve。


Adapter 改动分...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Refactor server-side CPU KV transfer path for better reuse Simplify CPU KV context protocol and deduplicate server key-resolution path May 13, 2026
Copilot AI requested a review from hlin99 May 13, 2026 05:56
Copilot finished work on behalf of hlin99 May 13, 2026 05:56
@hlin99 hlin99 closed this May 28, 2026
@hlin99 hlin99 deleted the copilot/cleanup-cpu-kv-transfer-path branch May 28, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants