Skip to content

Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads#254

Closed
Copilot wants to merge 6 commits into
ww20_PR_cpu_context_picklefrom
copilot/minimize-adapter-server-changes
Closed

Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads#254
Copilot wants to merge 6 commits into
ww20_PR_cpu_context_picklefrom
copilot/minimize-adapter-server-changes

Conversation

Copy link
Copy Markdown

Copilot AI commented May 13, 2026

This PR aligns the CPU-context path with the existing adapter control flow by making TransferContext return adapter-compatible futures instead of internally managing completion state. Based on review feedback, CPU registration was reverted from pickled layout bytes to scalar parameters to match the original PR #252 design.

  • TransferContext contract: return futures, no internal polling

    • TransferContext.submit_store() / submit_retrieve() now return MessagingFuture.
    • Removed poll_finished() and drain_all() from the interface and implementations.
    • CudaTransferContext returns send_request(...).to_cuda_future().
    • CPUTransferContext performs sync gather/scatter + MQ and returns already-resolved futures (query() -> True, result() -> bool).
  • Adapter flow restored to pre-refactor semantics

    • wrap_kv_caches restored to original signature/behavior (no use_cpu_context parameter).
    • Adapter now keeps store_futures / retrieve_futures dicts and tracks returned futures directly.
    • Removed _pending_store_request_ids.
    • get_finished() and _process_finished_stores() were reverted to adapter-owned future polling flow (no transfer-context draining path).
    • Registration path now creates transfer_ctx and delegates register/store/retrieve through it.
  • CPU registration payload reverted to scalar parameters

    • REGISTER_KV_CACHE_CPU_CONTEXT payload is now:
      • [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla]
    • This replaces the intermediate pickled-layout payload variant.
  • Server cleanup with shared key-resolution helper

    • register_kv_cache_cpu_context reconstructs layout_desc from scalar params (torch.Size + getattr(torch, dtype_str)).
    • Kept CPU-context handlers and metadata maps (cpu_contexts, cpu_context_meta), plus lookup fallback.
    • Added _resolve_obj_keys and reused it across store, retrieve, store_cpu_chunks, retrieve_cpu_chunks to remove duplicated session/hash/key logic.
  • Tests updated for restored adapter contract and scalar CPU protocol

    • Updated adapter tests to assert returned futures are stored in adapter tracking dicts.
    • Updated CPU registration assertions and server registration tests to use scalar payload parameters (len(args[2]) == 8).
class TransferContext(ABC):
    @abstractmethod
    def submit_store(...) -> MessagingFuture: ...
    @abstractmethod
    def submit_retrieve(...) -> MessagingFuture: ...

# Adapter keeps ownership of future tracking.
future = self.transfer_ctx.submit_store(...)
self.store_futures[request_id] = future
Original prompt

Goal

Minimize adapter and server changes in the CPU context PR by having TransferContext return the same future interface that the adapter already uses. This way get_finished() and the futures tracking logic in the adapter stay completely untouched.

Current state (branch ww20_PR_cpu_context_pickle)

The current PR introduces TransferContext with poll_finished() / drain_all() methods, which forces the adapter to delete its existing store_futures / retrieve_futures dicts and completely rewrite get_finished(). This creates a huge diff (~+83/-100 lines in adapter).

Required Changes

1. TransferContext should return futures, not manage them

Change TransferContext.submit_store() and submit_retrieve() to return futures instead of storing results internally. Remove poll_finished() and drain_all() from the interface.

class TransferContext(ABC):
    @abstractmethod
    def register(...) -> None

    @abstractmethod
    def submit_store(request_id, key, instance_id, kv_caches, block_ids, event, blocks_in_chunk) -> MessagingFuture
        """Return a future compatible with the existing adapter futures tracking."""

    @abstractmethod
    def submit_retrieve(request_id, key, instance_id, kv_caches, block_ids, event, blocks_in_chunk, skip_first_n_tokens=0) -> MessagingFuture
        """Return a future compatible with the existing adapter futures tracking."""

    @abstractmethod
    def close() -> None

For CudaTransferContext: basically the same as current — call send_request(...).to_cuda_future() and return it.

For CPUTransferContext: do the gather+pickle+send synchronously, then return an already-resolved future (or a thin wrapper that behaves like one). The key is that future.query() returns True immediately and future.result() returns the store/retrieve result.

2. Adapter changes should be minimal (~+15/-10 lines)

The adapter should keep its existing store_futures and retrieve_futures dicts. get_finished() should NOT be modified at all.

Changes needed in lmcache/integration/vllm/vllm_multi_process_adapter.py:

  • _send_register_kv_caches_request: create transfer_ctx and call transfer_ctx.register() instead of directly building the MQ request. Keep self.kv_caches = kv_caches.
  • submit_store_request: call self.transfer_ctx.submit_store(...) to get the future, then store it in self.store_futures[request_id] = future (same as before).
  • submit_retrieve_request: call self.transfer_ctx.submit_retrieve(...) to get the future, then store it in self.retrieve_futures[request_id] = (future, list(op.block_ids)) (same as before).
  • shutdown: add self.transfer_ctx.close().
  • get_finished() must NOT be changed at all.
  • _process_finished_stores() must NOT be changed at all.
  • Remove the _pending_store_request_ids set that was added — not needed.

3. wrap_kv_caches must be restored to original

Remove the use_cpu_context parameter. The function should be exactly as it was before PR #252:

def wrap_kv_caches(kv_caches: dict[str, torch.Tensor]) -> KVCache:
    logger.info("KV caches keys are %s", list(kv_caches.keys()))
    return [CudaIPCWrapper(tensor) for tensor in kv_caches.values()]

4. Server cleanup — remove unnecessary methods and code

In lmcache/v1/multiprocess/server.py:

  • Keep register_kv_cache_cpu_context (simplified with pickle.loads for layout_desc — as done in PR Simplify CPU KV context protocol and deduplicate server key-resolution path #253)
  • Keep store_cpu_chunks and retrieve_cpu_chunks (using _resolve_obj_keys helper)
  • Keep _resolve_obj_keys helper (extracts common session/hash/key logic from store/retrieve/store_cpu_chunks/retrieve_cpu_chunks)
  • Keep cpu_contexts and cpu_context_meta dicts
  • Keep _find_layout_desc CPU context fallback
  • Keep handler registrations and thread pool config for the 3 new request types
  • Remove any dead code or unused imports that were added

The register_kv_cache_cpu_context should use the simplified protocol from PR #253:

  • Parameters: (instance_id, model_name, world_size, layout_desc_bytes, block_size, use_mla)
  • No engine_type or layout_hints params (those were deleted on arrival anyway)

5. Protocol definitions

In lmcache/v1/multiprocess/protocols/engine.py, REGISTER_KV_CACHE_CPU_CONTEXT payload should match the simplified params:

"REGISTER_KV_CACHE_CPU_CONTEXT": ProtocolDefinition(
    payload_classes=[int, str, int, bytes, int, bool],
    response_class=None,
    handler_type=HandlerType.SYNC,
),

6. transfer_context.py rewrite

Remove poll_finished(), drain_all() from the TransferContext ABC.

CudaTransferContext:

  • register(): same as current
  • submit_store(): call send_request().to_cuda_future(), return the future
  • submit_retrieve(): call send_request().to_cuda_future(), return the future
  • No internal futures tracking dicts

CPUTransferContext:

  • register(): same as cur...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Refactor TransferContext to return futures directly Minimize CPU-context integration diff by restoring adapter-owned futures and simplifying CPU registration payload May 13, 2026
Copilot AI requested a review from hlin99 May 13, 2026 06:36
Copilot finished work on behalf of hlin99 May 13, 2026 06:36
Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change Request: Revert pickle-based CPU registration to scalar parameters
The pickle.loads approach introduces security risk (arbitrary code execution) and unnecessary coupling. Revert to PR #252's original scalar parameter design.

  1. lmcache/v1/multiprocess/protocols/engine.py — Restore REGISTER_KV_CACHE_CPU_CONTEXT payload to [int, str, int, int, int, int, str, bool], i.e. [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla].

  2. lmcache/v1/multiprocess/server.py — Replace register_kv_cache_cpu_context: remove layout_desc_bytes: bytes param and pickle.loads, restore scalar params (block_size, num_layers, hidden_dim_size, dtype_str, use_mla). Reconstruct layout_desc from scalars using torch.Size and getattr(torch, dtype_str), same as PR #252. Remove import pickle if no longer used.

  3. lmcache/v1/multiprocess/transfer_context.py — In CPUTransferContext.register(), replace pickle.dumps(layout_desc) with scalar params [instance_id, model_name, world_size, block_size, num_layers, hidden_dim_size, dtype_str, use_mla_flag]. Remove import pickle.

  4. Tests — Update test_cpu_context.py and test_vllm_mp_adapter.py to pass scalar params instead of pickle.dumps(layout_desc). Change assertion len(args[2]) == 6 to len(args[2]) == 8.

Copy link
Copy Markdown
Author

Copilot AI commented May 13, 2026

Change Request: Revert pickle-based CPU registration to scalar parameters
The pickle.loads approach introduces security risk (arbitrary code execution) and unnecessary coupling. Revert to PR #252's or...

Implemented as requested in commit 8dd3558. CPU registration is reverted to scalar params across protocol/server/transfer_context, and tests were updated accordingly (len(args[2]) == 8).

Copilot AI changed the title Minimize CPU-context integration diff by restoring adapter-owned futures and simplifying CPU registration payload Minimize CPU-context integration diff by restoring adapter-owned futures and reverting CPU registration to scalar payloads May 13, 2026
Copilot AI requested a review from hlin99 May 13, 2026 07:02
Copilot finished work on behalf of hlin99 May 13, 2026 07:02
Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update docs/design/v1/multiprocess/cpu_context_design.md according to implementation

Copy link
Copy Markdown
Author

Copilot AI commented May 13, 2026

update docs/design/v1/multiprocess/cpu_context_design.md according to implementation

Updated in commit fbba72e. docs/design/v1/multiprocess/cpu_context_design.md now matches implementation (scalar REGISTER_KV_CACHE_CPU_CONTEXT payload, adapter-owned future tracking, and resolved futures for CPU submit paths). No UI changes in this PR, so no screenshot applies.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • docs.python.org
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
  • numpy.org
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
  • psutil.readthedocs.io
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
  • pytorch.org
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
  • typing-extensions.readthedocs.io
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of hlin99 May 13, 2026 07:31
Copilot AI requested a review from hlin99 May 13, 2026 07:31
Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO need to show too much details in doc, especially code, otherwise it's hard to maintain on each code change. please streamline the doc. high level design is sufficient

Copy link
Copy Markdown
Author

Copilot AI commented May 13, 2026

NO need to show too much details in doc, especially code, otherwise it's hard to maintain on each code change. please streamline the doc. high level design is sufficient

Updated in commit 95a009c. The doc is now streamlined to high-level design, with most code-level details removed and only concise protocol/behavior summaries retained. No UI changes were made in this update.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • docs.python.org
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
  • numpy.org
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
  • psutil.readthedocs.io
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
  • pytorch.org
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)
  • typing-extensions.readthedocs.io
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build (dns block)
    • Triggering command: /home/REDACTED/.local/bin/sphinx-build sphinx-build -M html source build ock (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot finished work on behalf of hlin99 May 13, 2026 07:48
Copilot AI requested a review from hlin99 May 13, 2026 07:48
@hlin99 hlin99 closed this May 28, 2026
@hlin99 hlin99 deleted the copilot/minimize-adapter-server-changes branch May 28, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants