Simplify CPU KV context protocol and deduplicate server key-resolution path#253
Closed
Copilot wants to merge 3 commits into
Closed
Simplify CPU KV context protocol and deduplicate server key-resolution path#253Copilot wants to merge 3 commits into
Copilot wants to merge 3 commits into
Conversation
…n reuse Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/ff2f6066-9a28-465c-b57b-513765cf04fa Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/ff2f6066-9a28-465c-b57b-513765cf04fa Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Refactor server-side CPU KV transfer path for better reuse
Simplify CPU KV context protocol and deduplicate server key-resolution path
May 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #252 introduced a CPU KV transfer path but expanded server-side surface area with a verbose registration payload and duplicated store/retrieve flow logic. This change keeps the CPU path intact while removing unnecessary protocol complexity and reusing existing server infrastructure.
Protocol: compact CPU registration payload
REGISTER_KV_CACHE_CPU_CONTEXTnow sends serialized layout metadata instead of decomposed scalar fields.instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mlaEngineType/LayoutHintsand reconstructed layout scalars are removed from the CPU registration protocol.Transfer context: send full
MemoryLayoutDescdirectlyCPUTransferContext.register()now serializes the computedMemoryLayoutDesconce and sends it to server as bytes.Server: remove repeated session/hash/object-key derivation
_resolve_obj_keys(self, key: IPCCacheEngineKey) -> list[ObjectKey]storeretrievestore_cpu_chunksretrieve_cpu_chunksTests updated for API/payload changes
Original prompt
Context
Branch
ww20_PR_cpu_context_pickle(PR #252) adds a CPU-based KV transfer path for non-CUDA devices. However, the server-side changes are overly invasive and introduce large amounts of code duplication. This cleanup PR should remove the unnecessary server-side additions and make the CPU path reuse existing infrastructure as much as possible.Problem
The current implementation adds 3 new
RequestTypeenums (REGISTER_KV_CACHE_CPU_CONTEXT,STORE_CPU_CHUNKS,RETRIEVE_CPU_CHUNKS), 3 new server methods (register_kv_cache_cpu_context,store_cpu_chunks,retrieve_cpu_chunks), 2 new server dicts (cpu_contexts,cpu_context_meta), and ~29 lines of protocol definitions — all of which largely duplicate the existing CUDA path logic inserver.py.Specifically:
register_kv_cache_cpu_contextreceives 10 params, immediately deletes 2 (engine_type,layout_hints), and reconstructs aMemoryLayoutDescfrom scalars — even though the worker side already has the fullMemoryLayoutDesc.store_cpu_chunksandretrieve_cpu_chunksduplicate the entire session/hash/ipc_key_to_object_keys/reserve_write/finish_writeflow from the existingstore/retrievemethods. The only real difference is the data copy mechanism (pickle vs CUDA IPC memcpy).Required Changes
1. Simplify server registration for CPU context
Instead of a separate
register_kv_cache_cpu_contextwith 10 parameters, create a minimal registration path. Options:MemoryLayoutDesc(pickle) plus(model_name, world_size, block_size, use_mla)via a single simplified request.REGISTER_KV_CACHE_CPU_CONTEXTrequest type take just(instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla)wherelayout_desc_bytes = pickle.dumps(layout_desc). This eliminates the need to passnum_layers,hidden_dim_size,dtype_stras separate params and reconstruct them server-side.engine_typeandlayout_hintsfrom the CPU registration protocol definition entirely.protocols/engine.pyaccordingly to match the simplified payload.2. Reduce duplication in store/retrieve
Extract the common session/hash/key computation into a shared helper method on
MPCacheEngine:Then have
store,retrieve,store_cpu_chunks, andretrieve_cpu_chunksall call this helper instead of duplicating those ~6 lines each.3. Simplify
store_cpu_chunksandretrieve_cpu_chunksAfter extracting the helper, these methods should be much shorter — just the pickle deserialization +
tensor.copy_(store) ortensor.cpu().clone()+ pickle serialization (retrieve), plusreserve_write/finish_writecalls.4. Clean up
transfer_context.py—CPUTransferContext.register()Update the registration call in
CPUTransferContext.register()to match the simplified server protocol. Specifically:layout_descwith pickle and send as bytesengine_typeandlayout_hintsfrom the registration payloadprotocols/engine.pyREGISTER_KV_CACHE_CPU_CONTEXTdefinition accordingly5. Update protocol definitions
In
protocols/engine.py, update theREGISTER_KV_CACHE_CPU_CONTEXTprotocol definition to match the simplified parameter list. RemoveEngineTypeandLayoutHintsfrom the payload classes.6. Update tests
Update
tests/v1/multiprocess/test_cpu_context.pyandtests/v1/test_vllm_mp_adapter.pyto match any API changes.Files to modify
lmcache/v1/multiprocess/server.py— main cleanup targetlmcache/v1/multiprocess/transfer_context.py— update registration calllmcache/v1/multiprocess/protocols/engine.py— simplify protocol definitiontests/v1/multiprocess/test_cpu_context.py— update teststests/v1/test_vllm_mp_adapter.py— update tests if neededGoal
Minimize the diff in
server.pyso the PR is clean and easy to review, while keeping the CPU context functionality fully working.The following is the prior conversation context from the user's chat exploration (may be truncated):
User: @hlin99/LMCache/pull/252 你看看这个PR,这里面对adapter的改动是不是还是太多了,合理吗
Assistant: 好的,我来仔细分析一下这个 PR 对 adapter 的改动。
PR 概览
这个 PR 的核心目的是为非 CUDA 设备(CPU/XPU/HPU)添加一条 KV 缓存传输路径,通过 pickle 序列化实现 CPU 端 KV 数据的 store/retrieve。
Adapter 改动分...
This pull request was created from Copilot chat.