Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads by Copilot · Pull Request #254 · hlin99/LMCache

Copilot · 2026-05-13T06:16:24Z

This PR aligns the CPU-context path with the existing adapter control flow by making TransferContext return adapter-compatible futures instead of internally managing completion state. Based on review feedback, CPU registration was reverted from pickled layout bytes to scalar parameters to match the original PR #252 design.

TransferContext contract: return futures, no internal polling
- TransferContext.submit_store() / submit_retrieve() now return MessagingFuture.
- Removed poll_finished() and drain_all() from the interface and implementations.
- CudaTransferContext returns send_request(...).to_cuda_future().
- CPUTransferContext performs sync gather/scatter + MQ and returns already-resolved futures (query() -> True, result() -> bool).
Adapter flow restored to pre-refactor semantics
- wrap_kv_caches restored to original signature/behavior (no use_cpu_context parameter).
- Adapter now keeps store_futures / retrieve_futures dicts and tracks returned futures directly.
- Removed _pending_store_request_ids.
- get_finished() and _process_finished_stores() were reverted to adapter-owned future polling flow (no transfer-context draining path).
- Registration path now creates transfer_ctx and delegates register/store/retrieve through it.
CPU registration payload reverted to scalar parameters
- REGISTER_KV_CACHE_CPU_CONTEXT payload is now:
  - [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla]
- This replaces the intermediate pickled-layout payload variant.
Server cleanup with shared key-resolution helper
- register_kv_cache_cpu_context reconstructs layout_desc from scalar params (torch.Size + getattr(torch, dtype_str)).
- Kept CPU-context handlers and metadata maps (cpu_contexts, cpu_context_meta), plus lookup fallback.
- Added _resolve_obj_keys and reused it across store, retrieve, store_cpu_chunks, retrieve_cpu_chunks to remove duplicated session/hash/key logic.
Tests updated for restored adapter contract and scalar CPU protocol
- Updated adapter tests to assert returned futures are stored in adapter tracking dicts.
- Updated CPU registration assertions and server registration tests to use scalar payload parameters (len(args[2]) == 8).

class TransferContext(ABC):
    @abstractmethod
    def submit_store(...) -> MessagingFuture: ...
    @abstractmethod
    def submit_retrieve(...) -> MessagingFuture: ...

# Adapter keeps ownership of future tracking.
future = self.transfer_ctx.submit_store(...)
self.store_futures[request_id] = future

Original prompt

Goal

Minimize adapter and server changes in the CPU context PR by having TransferContext return the same future interface that the adapter already uses. This way get_finished() and the futures tracking logic in the adapter stay completely untouched.

Current state (branch `ww20_PR_cpu_context_pickle`)

The current PR introduces TransferContext with poll_finished() / drain_all() methods, which forces the adapter to delete its existing store_futures / retrieve_futures dicts and completely rewrite get_finished(). This creates a huge diff (~+83/-100 lines in adapter).

Required Changes

1. `TransferContext` should return futures, not manage them

Change TransferContext.submit_store() and submit_retrieve() to return futures instead of storing results internally. Remove poll_finished() and drain_all() from the interface.

class TransferContext(ABC):
    @abstractmethod
    def register(...) -> None

    @abstractmethod
    def submit_store(request_id, key, instance_id, kv_caches, block_ids, event, blocks_in_chunk) -> MessagingFuture
        """Return a future compatible with the existing adapter futures tracking."""

    @abstractmethod
    def submit_retrieve(request_id, key, instance_id, kv_caches, block_ids, event, blocks_in_chunk, skip_first_n_tokens=0) -> MessagingFuture
        """Return a future compatible with the existing adapter futures tracking."""

    @abstractmethod
    def close() -> None

For CudaTransferContext: basically the same as current — call send_request(...).to_cuda_future() and return it.

For CPUTransferContext: do the gather+pickle+send synchronously, then return an already-resolved future (or a thin wrapper that behaves like one). The key is that future.query() returns True immediately and future.result() returns the store/retrieve result.

2. Adapter changes should be minimal (~+15/-10 lines)

The adapter should keep its existing store_futures and retrieve_futures dicts. get_finished() should NOT be modified at all.

Changes needed in lmcache/integration/vllm/vllm_multi_process_adapter.py:

_send_register_kv_caches_request: create transfer_ctx and call transfer_ctx.register() instead of directly building the MQ request. Keep self.kv_caches = kv_caches.
submit_store_request: call self.transfer_ctx.submit_store(...) to get the future, then store it in self.store_futures[request_id] = future (same as before).
submit_retrieve_request: call self.transfer_ctx.submit_retrieve(...) to get the future, then store it in self.retrieve_futures[request_id] = (future, list(op.block_ids)) (same as before).
shutdown: add self.transfer_ctx.close().
get_finished() must NOT be changed at all.
_process_finished_stores() must NOT be changed at all.
Remove the _pending_store_request_ids set that was added — not needed.

3. `wrap_kv_caches` must be restored to original

Remove the use_cpu_context parameter. The function should be exactly as it was before PR #252:

def wrap_kv_caches(kv_caches: dict[str, torch.Tensor]) -> KVCache:
    logger.info("KV caches keys are %s", list(kv_caches.keys()))
    return [CudaIPCWrapper(tensor) for tensor in kv_caches.values()]

4. Server cleanup — remove unnecessary methods and code

In lmcache/v1/multiprocess/server.py:

Keep register_kv_cache_cpu_context (simplified with pickle.loads for layout_desc — as done in PR Simplify CPU KV context protocol and deduplicate server key-resolution path #253)
Keep store_cpu_chunks and retrieve_cpu_chunks (using _resolve_obj_keys helper)
Keep _resolve_obj_keys helper (extracts common session/hash/key logic from store/retrieve/store_cpu_chunks/retrieve_cpu_chunks)
Keep cpu_contexts and cpu_context_meta dicts
Keep _find_layout_desc CPU context fallback
Keep handler registrations and thread pool config for the 3 new request types
Remove any dead code or unused imports that were added

The register_kv_cache_cpu_context should use the simplified protocol from PR #253:

Parameters: (instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla)
No engine_type or layout_hints params (those were deleted on arrival anyway)

5. Protocol definitions

In lmcache/v1/multiprocess/protocols/engine.py, REGISTER_KV_CACHE_CPU_CONTEXT payload should match the simplified params:

"REGISTER_KV_CACHE_CPU_CONTEXT": ProtocolDefinition(
    payload_classes=[int, str, int, bytes, int, bool],
    response_class=None,
    handler_type=HandlerType.SYNC,
),

6. `transfer_context.py` rewrite

Remove poll_finished(), drain_all() from the TransferContext ABC.

CudaTransferContext:

register(): same as current
submit_store(): call send_request().to_cuda_future(), return the future
submit_retrieve(): call send_request().to_cuda_future(), return the future
No internal futures tracking dicts

CPUTransferContext:

register(): same as cur...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/b277cbef-7a95-47da-b1b5-93a69834a331 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

hlin99

Change Request: Revert pickle-based CPU registration to scalar parameters
The pickle.loads approach introduces security risk (arbitrary code execution) and unnecessary coupling. Revert to PR #252's original scalar parameter design.

lmcache/v1/multiprocess/protocols/engine.py — Restore REGISTER_KV_CACHE_CPU_CONTEXT payload to [int, str, int, int, int, int, str, bool], i.e. [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla].
lmcache/v1/multiprocess/server.py — Replace register_kv_cache_cpu_context: remove layout_desc_bytes: bytes param and pickle.loads, restore scalar params (block_size, num_layers, hidden_dim_size, dtype_str, use_mla). Reconstruct layout_desc from scalars using torch.Size and getattr(torch, dtype_str), same as PR #252. Remove import pickle if no longer used.
lmcache/v1/multiprocess/transfer_context.py — In CPUTransferContext.register(), replace pickle.dumps(layout_desc) with scalar params [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla_flag]. Remove import pickle.
Tests — Update test_cpu_context.py and test_vllm_mp_adapter.py to pass scalar params instead of pickle.dumps(layout_desc). Change assertion len(args[2]) == 6 to len(args[2]) == 8.

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/8efcb792-420d-460c-98c8-97775dcee47e Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot · 2026-05-13T07:01:02Z

Change Request: Revert pickle-based CPU registration to scalar parameters
The pickle.loads approach introduces security risk (arbitrary code execution) and unnecessary coupling. Revert to PR #252's or...

Implemented as requested in commit 8dd3558. CPU registration is reverted to scalar params across protocol/server/transfer_context, and tests were updated accordingly (len(args[2]) == 8).

hlin99

update docs/design/v1/multiprocess/cpu_context_design.md according to implementation

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/11d24763-3f7b-4dda-946b-e0c7b35c73d2 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot · 2026-05-13T07:29:40Z

update docs/design/v1/multiprocess/cpu_context_design.md according to implementation

Updated in commit fbba72e. docs/design/v1/multiprocess/cpu_context_design.md now matches implementation (scalar REGISTER_KV_CACHE_CPU_CONTEXT payload, adapter-owned future tracking, and resolved futures for CPU submit paths). No UI changes in this PR, so no screenshot applies.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

docs.python.org
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
numpy.org
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
psutil.readthedocs.io
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
pytorch.org
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
typing-extensions.readthedocs.io
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

hlin99

NO need to show too much details in doc, especially code, otherwise it's hard to maintain on each code change. please streamline the doc. high level design is sufficient

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/c045a593-284a-42f6-b68f-83ebe9bfad8b Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot · 2026-05-13T07:47:32Z

NO need to show too much details in doc, especially code, otherwise it's hard to maintain on each code change. please streamline the doc. high level design is sufficient

Updated in commit 95a009c. The doc is now streamlined to high-level design, with most code-level details removed and only concise protocol/behavior summaries retained. No UI changes were made in this update.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

docs.python.org
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
numpy.org
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
psutil.readthedocs.io
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
pytorch.org
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
typing-extensions.readthedocs.io
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
- Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Initial plan

7bb4157

Copilot AI assigned Copilot and hlin99 May 13, 2026

Copilot started work on behalf of hlin99 May 13, 2026 06:16 View session

Copilot AI and others added 2 commits May 13, 2026 06:32

Refactor transfer context to return adapter-compatible futures

ace643d

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/b277cbef-7a95-47da-b1b5-93a69834a331 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Improve docstrings for transfer context and key resolver

22f3dbc

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/b277cbef-7a95-47da-b1b5-93a69834a331 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Refactor TransferContext to return futures directly~~ Minimize CPU-context integration diff by restoring adapter-owned futures and simplifying CPU registration payload May 13, 2026

Copilot AI requested a review from hlin99 May 13, 2026 06:36

Copilot finished work on behalf of hlin99 May 13, 2026 06:36

hlin99 requested changes May 13, 2026

View reviewed changes

Copilot started work on behalf of hlin99 May 13, 2026 06:54 View session

Revert CPU registration payload from pickle bytes to scalar fields

8dd3558

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/8efcb792-420d-460c-98c8-97775dcee47e Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot AI changed the title ~~Minimize CPU-context integration diff by restoring adapter-owned futures and simplifying CPU registration payload~~ Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads May 13, 2026

Copilot AI requested a review from hlin99 May 13, 2026 07:02

Copilot finished work on behalf of hlin99 May 13, 2026 07:02

hlin99 requested changes May 13, 2026

View reviewed changes

Copilot started work on behalf of hlin99 May 13, 2026 07:26 View session

Sync CPU context design doc with scalar registration and future flow

fbba72e

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/11d24763-3f7b-4dda-946b-e0c7b35c73d2 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot finished work on behalf of hlin99 May 13, 2026 07:31

Copilot AI requested a review from hlin99 May 13, 2026 07:31

hlin99 requested changes May 13, 2026

View reviewed changes

Copilot started work on behalf of hlin99 May 13, 2026 07:44 View session

Streamline CPU context design doc to high-level guidance

95a009c

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/c045a593-284a-42f6-b68f-83ebe9bfad8b Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot finished work on behalf of hlin99 May 13, 2026 07:48

Copilot AI requested a review from hlin99 May 13, 2026 07:48

hlin99 closed this May 28, 2026

hlin99 deleted the copilot/minimize-adapter-server-changes branch May 28, 2026 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads#254

Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads#254
Copilot wants to merge 6 commits into
ww20_PR_cpu_context_picklefrom
copilot/minimize-adapter-server-changes

Copilot AI commented May 13, 2026 •

edited

Loading

Uh oh!

hlin99 left a comment

Uh oh!

Copilot AI commented May 13, 2026

Uh oh!

hlin99 left a comment

Uh oh!

Copilot AI commented May 13, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

hlin99 left a comment

Uh oh!

Copilot AI commented May 13, 2026 •

edited

Loading

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Current state (branch ww20_PR_cpu_context_pickle)

Required Changes

1. TransferContext should return futures, not manage them

2. Adapter changes should be minimal (~+15/-10 lines)

3. wrap_kv_caches must be restored to original

4. Server cleanup — remove unnecessary methods and code

5. Protocol definitions

6. transfer_context.py rewrite

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 13, 2026

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 13, 2026 •

edited

Loading

Current state (branch `ww20_PR_cpu_context_pickle`)

1. `TransferContext` should return futures, not manage them

3. `wrap_kv_caches` must be restored to original

6. `transfer_context.py` rewrite

Copilot AI commented May 13, 2026 •

edited

Loading

Copilot AI commented May 13, 2026 •

edited

Loading