Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads#254
Conversation
Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/b277cbef-7a95-47da-b1b5-93a69834a331 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/b277cbef-7a95-47da-b1b5-93a69834a331 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
hlin99
left a comment
There was a problem hiding this comment.
Change Request: Revert pickle-based CPU registration to scalar parameters
The pickle.loads approach introduces security risk (arbitrary code execution) and unnecessary coupling. Revert to PR #252's original scalar parameter design.
-
lmcache/v1/multiprocess/protocols/engine.py — Restore REGISTER_KV_CACHE_CPU_CONTEXT payload to [int, str, int, int, int, int, str, bool], i.e. [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla].
-
lmcache/v1/multiprocess/server.py — Replace register_kv_cache_cpu_context: remove layout_desc_bytes: bytes param and pickle.loads, restore scalar params (block_size, num_layers, hidden_dim_size, dtype_str, use_mla). Reconstruct layout_desc from scalars using torch.Size and getattr(torch, dtype_str), same as PR #252. Remove import pickle if no longer used.
-
lmcache/v1/multiprocess/transfer_context.py — In CPUTransferContext.register(), replace pickle.dumps(layout_desc) with scalar params [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla_flag]. Remove import pickle.
-
Tests — Update test_cpu_context.py and test_vllm_mp_adapter.py to pass scalar params instead of pickle.dumps(layout_desc). Change assertion len(args[2]) == 6 to len(args[2]) == 8.
Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/8efcb792-420d-460c-98c8-97775dcee47e Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Implemented as requested in commit |
hlin99
left a comment
There was a problem hiding this comment.
update docs/design/v1/multiprocess/cpu_context_design.md according to implementation
Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/11d24763-3f7b-4dda-946b-e0c7b35c73d2 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Updated in commit Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
hlin99
left a comment
There was a problem hiding this comment.
NO need to show too much details in doc, especially code, otherwise it's hard to maintain on each code change. please streamline the doc. high level design is sufficient
Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/c045a593-284a-42f6-b68f-83ebe9bfad8b Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
Updated in commit Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
This PR aligns the CPU-context path with the existing adapter control flow by making
TransferContextreturn adapter-compatible futures instead of internally managing completion state. Based on review feedback, CPU registration was reverted from pickled layout bytes to scalar parameters to match the original PR #252 design.TransferContext contract: return futures, no internal polling
TransferContext.submit_store()/submit_retrieve()now returnMessagingFuture.poll_finished()anddrain_all()from the interface and implementations.CudaTransferContextreturnssend_request(...).to_cuda_future().CPUTransferContextperforms sync gather/scatter + MQ and returns already-resolved futures (query() -> True,result() -> bool).Adapter flow restored to pre-refactor semantics
wrap_kv_cachesrestored to original signature/behavior (nouse_cpu_contextparameter).store_futures/retrieve_futuresdicts and tracks returned futures directly._pending_store_request_ids.get_finished()and_process_finished_stores()were reverted to adapter-owned future polling flow (no transfer-context draining path).transfer_ctxand delegates register/store/retrieve through it.CPU registration payload reverted to scalar parameters
REGISTER_KV_CACHE_CPU_CONTEXTpayload is now:[instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla]Server cleanup with shared key-resolution helper
register_kv_cache_cpu_contextreconstructslayout_descfrom scalar params (torch.Size+getattr(torch, dtype_str)).cpu_contexts,cpu_context_meta), plus lookup fallback._resolve_obj_keysand reused it acrossstore,retrieve,store_cpu_chunks,retrieve_cpu_chunksto remove duplicated session/hash/key logic.Tests updated for restored adapter contract and scalar CPU protocol
len(args[2]) == 8).Original prompt
Goal
Minimize adapter and server changes in the CPU context PR by having
TransferContextreturn the same future interface that the adapter already uses. This wayget_finished()and the futures tracking logic in the adapter stay completely untouched.Current state (branch
ww20_PR_cpu_context_pickle)The current PR introduces
TransferContextwithpoll_finished()/drain_all()methods, which forces the adapter to delete its existingstore_futures/retrieve_futuresdicts and completely rewriteget_finished(). This creates a huge diff (~+83/-100 lines in adapter).Required Changes
1.
TransferContextshould return futures, not manage themChange
TransferContext.submit_store()andsubmit_retrieve()to return futures instead of storing results internally. Removepoll_finished()anddrain_all()from the interface.For
CudaTransferContext: basically the same as current — callsend_request(...).to_cuda_future()and return it.For
CPUTransferContext: do the gather+pickle+send synchronously, then return an already-resolved future (or a thin wrapper that behaves like one). The key is thatfuture.query()returnsTrueimmediately andfuture.result()returns the store/retrieve result.2. Adapter changes should be minimal (~+15/-10 lines)
The adapter should keep its existing
store_futuresandretrieve_futuresdicts.get_finished()should NOT be modified at all.Changes needed in
lmcache/integration/vllm/vllm_multi_process_adapter.py:_send_register_kv_caches_request: create transfer_ctx and calltransfer_ctx.register()instead of directly building the MQ request. Keepself.kv_caches = kv_caches.submit_store_request: callself.transfer_ctx.submit_store(...)to get the future, then store it inself.store_futures[request_id] = future(same as before).submit_retrieve_request: callself.transfer_ctx.submit_retrieve(...)to get the future, then store it inself.retrieve_futures[request_id] = (future, list(op.block_ids))(same as before).shutdown: addself.transfer_ctx.close().get_finished()must NOT be changed at all._process_finished_stores()must NOT be changed at all._pending_store_request_idsset that was added — not needed.3.
wrap_kv_cachesmust be restored to originalRemove the
use_cpu_contextparameter. The function should be exactly as it was before PR #252:4. Server cleanup — remove unnecessary methods and code
In
lmcache/v1/multiprocess/server.py:register_kv_cache_cpu_context(simplified with pickle.loads for layout_desc — as done in PR Simplify CPU KV context protocol and deduplicate server key-resolution path #253)store_cpu_chunksandretrieve_cpu_chunks(using_resolve_obj_keyshelper)_resolve_obj_keyshelper (extracts common session/hash/key logic from store/retrieve/store_cpu_chunks/retrieve_cpu_chunks)cpu_contextsandcpu_context_metadicts_find_layout_descCPU context fallbackThe
register_kv_cache_cpu_contextshould use the simplified protocol from PR #253:(instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla)engine_typeorlayout_hintsparams (those were deleted on arrival anyway)5. Protocol definitions
In
lmcache/v1/multiprocess/protocols/engine.py,REGISTER_KV_CACHE_CPU_CONTEXTpayload should match the simplified params:6.
transfer_context.pyrewriteRemove
poll_finished(),drain_all()from theTransferContextABC.CudaTransferContext:register(): same as currentsubmit_store(): callsend_request().to_cuda_future(), return the futuresubmit_retrieve(): callsend_request().to_cuda_future(), return the futureCPUTransferContext:register(): same as cur...This pull request was created from Copilot chat.