hlin99 · hlin99 · May 12, 2026 · May 12, 2026 · May 12, 2026 · May 12, 2026
diff --git a/docs/design/v1/multiprocess/non_gpu_context_design.md b/docs/design/v1/multiprocess/non_gpu_context_design.md
@@ -0,0 +1,217 @@
+# Non-GPU Context Design (Multiprocess Mode)
+
+## 1. Motivation
+
+LMCache multiprocess mode originally depended on CUDA IPC: workers send IPC handles,
+and the server reads/writes worker GPU memory directly. That path works well on
+CUDA, but the required primitives are CUDA-specific (IPC memory handles,
+interprocess CUDA events, CUDA stream semantics).
+
+For **CPU, XPU, HPU, and other non-CUDA devices**, those primitives do not exist.
+The non-GPU context design introduces a device-agnostic path where workers move KV
+data through CPU chunks instead of CUDA IPC handles.
+
+Goal: keep the existing CUDA path unchanged while adding a second path that works
+across non-CUDA backends.
+
+## 2. Design
+
+### 2.1 Architecture Overview
+
+```text
+Worker adapter (vLLM MP adapter)
+  └─ TransferContext
+      ├─ HandleTransferContext  (CUDA IPC path)
+      └─ DataTransferContext    (non-CUDA data path)
+          └─ NonGpuContext
+             ├─ NonGpuContextPickle
+             └─ NonGpuContextShm (TODO)
+```
+
+State machine overview (worker-side):
+
+```text
+                       create_transfer_context()
+                                 |
+                 +---------------+---------------+
+                 |                               |
+                 v                               v
+      HandleTransferContext            DataTransferContext
+          (device == CUDA)            (device != CUDA)
+                 |                               |
+                 v                               v
+              register()                      register()
+                 |                               |
+                 +---------------+---------------+
+                                 |
+                                 v
+                                READY
+                                 |
+                 +---------------+-------------------------------+
+                 |                                               |
+                 v                                               v
+    submit_store (handle path)                  submit_store (data path)
+    -> STORE request (async)                    -> prepare_store -> gather -> commit_store
+                 |                                               |
+                 +---------------+-------------------------------+
+                                 |
+                                 v
+                                READY
+                                 |
+                 +---------------+-------------------------------+
+                 |                                               |
+                 v                                               v
+  submit_retrieve (handle path)               submit_retrieve (data path)
+  -> RETRIEVE request (async)                 -> prepare_retrieve -> scatter -> commit_retrieve
+                 |                                               |
+                 +---------------+-------------------------------+
+                                 |
+                                 v
+                                READY
+                                 |
+                                 v
+                               close()
+```
+
+Overall data flow:
+- **CUDA path**: worker sends a handle, server pulls/pushes data directly.
+- **Non-CUDA path**: worker gathers/scatters paged KV and exchanges CPU-side data
+  via a transport-specific `NonGpuContext` implementation.
+
+### 2.2 Worker Side: TransferContext
+
+`TransferContext` is the worker-side transport abstraction with four methods:
+`register`, `submit_store`, `submit_retrieve`, and `close`.
+The contract is intentionally minimal so worker adapters only depend on these
+four lifecycle and transfer operations.
+
+- **HandleTransferContext** keeps the original CUDA IPC behavior:
+  worker sends a handle and server performs direct GPU-side transfer.
+- **DataTransferContext** is the non-CUDA path:
+  worker transfers actual data chunks through `NonGpuContext`.
+
+`DataTransferContext` flows:
+- **submit_store**: `prepare_store` → `gather_paged_kv_to_cpu` → `commit_store`
+- **submit_retrieve**: `prepare_retrieve` → `scatter_cpu_to_paged_kv` → `commit_retrieve`
+
+Why `prepare → data operation → commit`:
+- `prepare_*`: set up transport state (for SHM this allocates/returns shared buffers;
+  for pickle it is a protocol RPC that does not allocate transfer buffers).
+- gather/scatter: worker-local data movement between paged KV and contiguous
+  CPU chunks, performed between protocol phases.
+- `commit_*`: finalize and notify server to consume or release transfer state.
+
+`create_transfer_context()` selects the implementation once based on device type
+(CUDA → `HandleTransferContext`, otherwise → `DataTransferContext`).
+It also validates that all KV cache tensors share one device type and rejects
+mixed-device configurations by raising an error.
+
+| Context | What is transferred | Who performs copy work | Completion style |
+|---|---|---|---|
+| HandleTransferContext | Device handle/reference | Server pulls/pushes via IPC | Async MQ future |
+| DataTransferContext | Actual CPU chunk data | Worker gather/scatter + transport commit | Synchronous worker-side flow |
+
+### 2.3 Server Side: GPU Context vs Non-GPU Context
+
+- **GPU Context (existing path):** server uses CUDA IPC handles to access worker
+  device memory directly.
+- **Non-GPU Context:** server participates in two separate two-phase protocols
+  exposed by `NonGpuContext`: `prepare_store/commit_store` for store, and
+  `prepare_retrieve/commit_retrieve` for retrieve, plus lifecycle cleanup via
+  `close`.
+
+`NonGpuContext` implementations:
+- **NonGpuContextPickle**: serialize/deserialize chunk payloads with pickle.
+- **NonGpuContextShm**: shared-memory transport (planned/TODO).
+
+This split keeps server protocol stable while allowing transport-specific behavior
+behind one interface contract.
+
+### 2.4 Transport Comparison
+
+**Store (worker → server storage):**
+
+| Transport | Copies | Data flow |
+|---|---|---|
+| Handle (CUDA IPC) | 2 | GPU KV → GPU staging buffer → CPU memory object |
+| Pickle | 4 | GPU KV → CPU chunk → serialize → deserialize → CPU memory object |
+| SHM (TODO) | 1 | GPU KV → CPU memory object (SHM mapped) |
+
+**Retrieve (server storage → worker):**
+
+| Transport | Copies | Data flow |
+|---|---|---|
+| Handle (CUDA IPC) | 2 | CPU memory object → GPU staging buffer → GPU KV |
+| Pickle | 4 | CPU memory object → serialize → deserialize → CPU chunk → GPU KV |
+| SHM (TODO) | 1 | CPU memory object (SHM mapped) → GPU KV |
+
+| Transport | Pros | Cons | Best fit |
+|---|---|---|---|
+| Handle (CUDA IPC) | Mature path, good async overlap | CUDA-only | NVIDIA CUDA deployments |
+| Pickle | Works everywhere, no SHM setup | Extra serialization + copy overhead | Universal fallback |
+| SHM (TODO) | Lowest copy count, no serialization | Requires enough `/dev/shm` and synchronization | High-throughput non-CUDA setups |
+
+## 3. Protocol & Data Flow
+
+### 3.1 MQ Request Types Used by Non-GPU Path
+
+The non-GPU path uses five request types:
+
+1. `REGISTER_KV_CACHE_NON_GPU_CONTEXT`  
+   Worker registers non-CUDA KV layout metadata so the server can reconstruct
+   the worker KV memory layout for store/retrieve operations.
+
+2. `PREPARE_STORE`  
+   Worker asks server/transport to prepare store-side transfer state.
+
+3. `COMMIT_STORE`  
+   Worker commits store data so server can persist it into storage.
+
+4. `PREPARE_RETRIEVE`  
+   Worker asks server to prepare retrieval payload/state for a key.
+
+5. `COMMIT_RETRIEVE`  
+   Worker acknowledges retrieval completion so transport state can be finalized.
+
+### 3.2 Data Flow: Pickle Path
+
+Store:
+1. Worker `prepare_store` RPC.
+2. Worker gathers paged KV into CPU chunks.
+3. Worker `commit_store` sends serialized bytes.
+4. Server deserializes and writes to storage.
+
+Retrieve:
+1. Worker `prepare_retrieve` RPC.
+2. Server reads from storage and returns serialized bytes.
+3. Worker deserializes to CPU chunks.
+4. Worker scatters chunks back to paged KV.
+5. Worker `commit_retrieve` finalizes protocol state.
+
+```text
+Store (pickle)
+Worker: prepare_store --> Server
+Worker: gather paged KV -> CPU chunks
+Worker: commit_store(serialized bytes) --> Server
+Server: deserialize -> storage write
+
+Retrieve (pickle)
+Worker: prepare_retrieve --> Server
+Server: read storage -> serialize bytes
+Server: serialized bytes --> Worker
+Worker: deserialize -> scatter to paged KV
+Worker: commit_retrieve --> Server
+```
+
+### 3.3 Data Flow: SHM Path (TODO)
+
+Store:
+1. Worker `prepare_store` obtains SHM slot/offset.
+2. Worker gathers directly into SHM-backed buffers.
+3. Worker `commit_store` notifies server to consume SHM data.
+
+Retrieve:
+1. Worker `prepare_retrieve` asks server to populate SHM.
+2. Server writes retrieved chunks into SHM.
+3. Worker scatters from SHM-backed buffers into paged KV.
+4. Worker `commit_retrieve` releases/read-completes SHM state.
@@ -13,7 +13,8 @@
 
 # First Party
 from lmcache.integration.request_telemetry.factory import RequestTelemetryFactory
-from lmcache.utils import EngineType, _lmcache_nvtx_annotate, init_logger
+from lmcache.integration.vllm.utils import vllm_layout_hints
+from lmcache.utils import _lmcache_nvtx_annotate, init_logger
 from lmcache.v1.multiprocess.custom_types import (
     BlockAllocationRecord,
     CudaIPCWrapper,
@@ -22,6 +23,10 @@
 )
 from lmcache.v1.multiprocess.mq import MessageQueueClient, MessagingFuture
 from lmcache.v1.multiprocess.protocol import RequestType, get_response_class
+from lmcache.v1.multiprocess.transfer_context import (
+    TransferContext,
+    create_transfer_context,
+)
 from lmcache.v1.periodic_thread import PeriodicThread, ThreadLevel, ThreadRunSummary
 
 logger = init_logger(__name__)
@@ -803,6 +808,9 @@ def __init__(
         # Registered kv caches from vLLM
         self.kv_caches: dict[str, torch.Tensor] = {}
 
+        # Transport context for transfer operations.
+        self.transfer_ctx: TransferContext | None = None
+
         # Request futures
         self.store_futures: dict[str, MessagingFuture[StoreResult]] = {}
         # request_id -> (future, block_ids)
@@ -939,27 +947,24 @@ def _send_register_kv_caches_request(
             ConnectionError: if the server does not respond within
                 mq_timeout.
         """
-        # First Party
-        from lmcache.integration.vllm.utils import vllm_layout_hints
-
+        self.kv_caches = kv_caches
+        self.transfer_ctx = create_transfer_context(kv_caches)
         layout_hints = vllm_layout_hints()
         layout_hints["inference_engine_logical_block_size"] = (
             self.vllm_logical_block_size
         )
-        future = send_lmcache_request(
-            self.mq_client,
-            RequestType.REGISTER_KV_CACHE,
-            [
+        try:
+            self.transfer_ctx.register(
                 self.instance_id,
-                wrap_kv_caches(kv_caches),
+                kv_caches,
                 self.model_name,
                 self.world_size,
-                EngineType.VLLM,
-                layout_hints,
-            ],
-        )
-        try:
-            future.result(timeout=self._mq_timeout)
+                self.blocks_in_chunk,
+                self.mq_client,
+                self._mq_timeout,
+                send_request=send_lmcache_request,
+                layout_hints=layout_hints,
+            )
         except TimeoutError:
             raise ConnectionError(
                 "LMCache server did not respond to "
@@ -1049,11 +1054,20 @@ def submit_store_request(
             request_id=request_id,
             cache_salt=cache_salt,
         )
-        future = send_lmcache_request(
-            self.mq_client,
-            RequestType.STORE,
-            [key, self.instance_id, op.block_ids, event.ipc_handle()],
-        ).to_cuda_future()
+        if self.transfer_ctx is None:
+            raise RuntimeError(
+                "Transfer context is not initialized. "
+                "Call register_kv_caches() before submitting store requests."
+            )
+        future = self.transfer_ctx.submit_store(
+            request_id,
+            key,
+            self.instance_id,
+            self.kv_caches,
+            op.block_ids,
+            event,
+            self.blocks_in_chunk,
+        )
         self.store_futures[request_id] = future
 
     @_lmcache_nvtx_annotate
@@ -1088,17 +1102,21 @@ def submit_retrieve_request(
             request_id=request_id,
             cache_salt=cache_salt,
         )
-        future = send_lmcache_request(
-            self.mq_client,
-            RequestType.RETRIEVE,
-            [
-                key,
-                self.instance_id,
-                op.block_ids,
-                event.ipc_handle(),
-                op.skip_first_n_tokens,
-            ],
-        ).to_cuda_future()
+        if self.transfer_ctx is None:
+            raise RuntimeError(
+                "Transfer context is not initialized. "
+                "Call register_kv_caches() before submitting retrieve requests."
+            )
+        future = self.transfer_ctx.submit_retrieve(
+            request_id,
+            key,
+            self.instance_id,
+            self.kv_caches,
+            op.block_ids,
+            event,
+            self.blocks_in_chunk,
+            skip_first_n_tokens=op.skip_first_n_tokens,
+        )
         self.retrieve_futures[request_id] = (future, list(op.block_ids))
 
     @_lmcache_nvtx_annotate
@@ -1309,6 +1327,10 @@ def shutdown(self):
                 self._mq_timeout,
             )
 
+        if self.transfer_ctx is not None:
+            self.transfer_ctx.close()
+            self.transfer_ctx = None
+
         self.mq_client.close()
         self.request_telemetry.close()
 

diff --git a/lmcache/python_ops_fallback.py b/lmcache/python_ops_fallback.py
@@ -17,7 +17,7 @@
 import torch
 
 # First Party
-from lmcache import torch_dev, torch_device_type
+from lmcache import torch_dev
 
 # Store the tensor objects in memory so that they can be accessed
 # outside the scope of this file
@@ -327,10 +327,7 @@ def alloc_pinned_numa_ptr(size: int, numa_id: int = 0) -> int:
     Note: NUMA node selection is not supported on non-CUDA."""
 
     # Create a 1D uint8 CPU tensor, as uint8 == 1 byte
-    # On XPU (Intel GPU), PyTorch 2.4+ supports pin_memory=True via SYCL USM
-    # host allocation, enabling fast DMA for XPU<->CPU transfers.
-    pin_memory = torch_device_type == "xpu"
-    tensor = torch.empty(size, dtype=torch.uint8, pin_memory=pin_memory)
+    tensor = torch.empty(size, dtype=torch.uint8, pin_memory=False)
 
     # First-touch initialization (forces physical allocation)
     tensor.fill_(0)
@@ -358,10 +355,7 @@ def alloc_pinned_ptr(size: int, device_id: int = 0) -> int:
     fast DMA transfers. On other non-CUDA platforms, pinning is not supported."""
 
     # Create a 1D uint8 CPU tensor, as uint8 == 1 byte
-    # On XPU (Intel GPU), PyTorch 2.4+ supports pin_memory=True via SYCL USM
-    # host allocation, enabling fast DMA for XPU<->CPU transfers.
-    pin_memory = torch_device_type == "xpu"
-    tensor = torch.empty(size, dtype=torch.uint8, pin_memory=pin_memory)
+    tensor = torch.empty(size, dtype=torch.uint8, pin_memory=False)
 
     # First-touch initialization (forces physical allocation)
     tensor.fill_(0)