Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
99119a2
feat(mp): CPU Context by pickle
hlin99 May 12, 2026
9a12d7a
renaming bounce keyword to cpu context
hlin99 May 12, 2026
08a4d10
fix unit test failures
hlin99 May 12, 2026
715ac57
address bot review comment
hlin99 May 12, 2026
96adc94
refactor: standardize cpu context naming conventions
hlin99 May 13, 2026
e888620
small fix on error handling
hlin99 May 13, 2026
cf70122
refactor: polymorphic TransferContext for MP adapter transport layer
hlin99 May 13, 2026
b665d41
restore unnecessary changes
hlin99 May 13, 2026
1b5fa31
Revert CPU registration payload from pickle bytes to scalar fields
hlin99 May 13, 2026
8008ed3
update design doc
hlin99 May 14, 2026
10349ba
Merge branch 'dev' into ww20_PR_cpu_context_pickle
hlin99 May 14, 2026
f1f1824
rebase dsv4: propagate vllm_logical_block_size through TransferContex…
hlin99 May 15, 2026
f8d93b9
rename to more general names: non_gpu_context & non_cuda_transfer_con…
hlin99 May 15, 2026
598c30f
add todo note for deepseek v4 on non-cuda path
hlin99 May 15, 2026
b4371e3
Consolidate MPCacheEngine context state into unified registry
hlin99 May 15, 2026
a129e9d
use dataclass for payload
hlin99 May 15, 2026
f030c2b
Auto-disable l1-use-lazy on non-CUDA backends
hlin99 May 15, 2026
2009b41
Lift layout hints to the caller layer to avoid redundant computation …
hlin99 May 15, 2026
a9c5ec3
Merge branch 'dev' into ww20_PR_cpu_context_pickle
hlin99 May 17, 2026
ec68ee3
[refactor] reserve zero-copy buffer allocation interface to NonGpuCon…
hlin99 May 19, 2026
bb60eef
fix: update test to use new unified protocol methods
hlin99 May 19, 2026
91ed22b
refactor: rename transfer context classes to handle/data semantics
hlin99 May 20, 2026
6d8ff15
update test
hlin99 May 20, 2026
b07b708
update docs
hlin99 May 20, 2026
633e6eb
Merge remote-tracking branch 'origin/dev' into ww20_PR_cpu_context_pi…
hlin99 May 20, 2026
09415fd
rename test file
hlin99 May 20, 2026
3b3f569
Merge branch 'dev' into ww20_PR_cpu_context_pickle
hlin99 May 20, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
217 changes: 217 additions & 0 deletions docs/design/v1/multiprocess/non_gpu_context_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
# Non-GPU Context Design (Multiprocess Mode)

## 1. Motivation

LMCache multiprocess mode originally depended on CUDA IPC: workers send IPC handles,
and the server reads/writes worker GPU memory directly. That path works well on
CUDA, but the required primitives are CUDA-specific (IPC memory handles,
interprocess CUDA events, CUDA stream semantics).

For **CPU, XPU, HPU, and other non-CUDA devices**, those primitives do not exist.
The non-GPU context design introduces a device-agnostic path where workers move KV
data through CPU chunks instead of CUDA IPC handles.

Goal: keep the existing CUDA path unchanged while adding a second path that works
across non-CUDA backends.

## 2. Design

### 2.1 Architecture Overview

```text
Worker adapter (vLLM MP adapter)
└─ TransferContext
├─ HandleTransferContext (CUDA IPC path)
└─ DataTransferContext (non-CUDA data path)
└─ NonGpuContext
├─ NonGpuContextPickle
└─ NonGpuContextShm (TODO)
```

State machine overview (worker-side):

```text
create_transfer_context()
|
+---------------+---------------+
| |
v v
HandleTransferContext DataTransferContext
(device == CUDA) (device != CUDA)
| |
v v
register() register()
| |
+---------------+---------------+
|
v
READY
|
+---------------+-------------------------------+
| |
v v
submit_store (handle path) submit_store (data path)
-> STORE request (async) -> prepare_store -> gather -> commit_store
| |
+---------------+-------------------------------+
|
v
READY
|
+---------------+-------------------------------+
| |
v v
submit_retrieve (handle path) submit_retrieve (data path)
-> RETRIEVE request (async) -> prepare_retrieve -> scatter -> commit_retrieve
| |
+---------------+-------------------------------+
|
v
READY
|
v
close()
```

Overall data flow:
- **CUDA path**: worker sends a handle, server pulls/pushes data directly.
- **Non-CUDA path**: worker gathers/scatters paged KV and exchanges CPU-side data
via a transport-specific `NonGpuContext` implementation.

### 2.2 Worker Side: TransferContext

`TransferContext` is the worker-side transport abstraction with four methods:
`register`, `submit_store`, `submit_retrieve`, and `close`.
The contract is intentionally minimal so worker adapters only depend on these
four lifecycle and transfer operations.

- **HandleTransferContext** keeps the original CUDA IPC behavior:
worker sends a handle and server performs direct GPU-side transfer.
- **DataTransferContext** is the non-CUDA path:
worker transfers actual data chunks through `NonGpuContext`.

`DataTransferContext` flows:
- **submit_store**: `prepare_store` → `gather_paged_kv_to_cpu` → `commit_store`
- **submit_retrieve**: `prepare_retrieve` → `scatter_cpu_to_paged_kv` → `commit_retrieve`

Why `prepare → data operation → commit`:
- `prepare_*`: set up transport state (for SHM this allocates/returns shared buffers;
for pickle it is a protocol RPC that does not allocate transfer buffers).
- gather/scatter: worker-local data movement between paged KV and contiguous
CPU chunks, performed between protocol phases.
- `commit_*`: finalize and notify server to consume or release transfer state.

`create_transfer_context()` selects the implementation once based on device type
(CUDA → `HandleTransferContext`, otherwise → `DataTransferContext`).
It also validates that all KV cache tensors share one device type and rejects
mixed-device configurations by raising an error.

| Context | What is transferred | Who performs copy work | Completion style |
|---|---|---|---|
| HandleTransferContext | Device handle/reference | Server pulls/pushes via IPC | Async MQ future |
| DataTransferContext | Actual CPU chunk data | Worker gather/scatter + transport commit | Synchronous worker-side flow |

### 2.3 Server Side: GPU Context vs Non-GPU Context

- **GPU Context (existing path):** server uses CUDA IPC handles to access worker
device memory directly.
- **Non-GPU Context:** server participates in two separate two-phase protocols
exposed by `NonGpuContext`: `prepare_store/commit_store` for store, and
`prepare_retrieve/commit_retrieve` for retrieve, plus lifecycle cleanup via
`close`.

`NonGpuContext` implementations:
- **NonGpuContextPickle**: serialize/deserialize chunk payloads with pickle.
- **NonGpuContextShm**: shared-memory transport (planned/TODO).

This split keeps server protocol stable while allowing transport-specific behavior
behind one interface contract.

### 2.4 Transport Comparison

**Store (worker → server storage):**

| Transport | Copies | Data flow |
|---|---|---|
| Handle (CUDA IPC) | 2 | GPU KV → GPU staging buffer → CPU memory object |
| Pickle | 4 | GPU KV → CPU chunk → serialize → deserialize → CPU memory object |
| SHM (TODO) | 1 | GPU KV → CPU memory object (SHM mapped) |

**Retrieve (server storage → worker):**

| Transport | Copies | Data flow |
|---|---|---|
| Handle (CUDA IPC) | 2 | CPU memory object → GPU staging buffer → GPU KV |
| Pickle | 4 | CPU memory object → serialize → deserialize → CPU chunk → GPU KV |
| SHM (TODO) | 1 | CPU memory object (SHM mapped) → GPU KV |

| Transport | Pros | Cons | Best fit |
|---|---|---|---|
| Handle (CUDA IPC) | Mature path, good async overlap | CUDA-only | NVIDIA CUDA deployments |
| Pickle | Works everywhere, no SHM setup | Extra serialization + copy overhead | Universal fallback |
| SHM (TODO) | Lowest copy count, no serialization | Requires enough `/dev/shm` and synchronization | High-throughput non-CUDA setups |

## 3. Protocol & Data Flow

### 3.1 MQ Request Types Used by Non-GPU Path

The non-GPU path uses five request types:

1. `REGISTER_KV_CACHE_NON_GPU_CONTEXT`
Worker registers non-CUDA KV layout metadata so the server can reconstruct
the worker KV memory layout for store/retrieve operations.

2. `PREPARE_STORE`
Worker asks server/transport to prepare store-side transfer state.

3. `COMMIT_STORE`
Worker commits store data so server can persist it into storage.

4. `PREPARE_RETRIEVE`
Worker asks server to prepare retrieval payload/state for a key.

5. `COMMIT_RETRIEVE`
Worker acknowledges retrieval completion so transport state can be finalized.

### 3.2 Data Flow: Pickle Path

Store:
1. Worker `prepare_store` RPC.
2. Worker gathers paged KV into CPU chunks.
3. Worker `commit_store` sends serialized bytes.
4. Server deserializes and writes to storage.

Retrieve:
1. Worker `prepare_retrieve` RPC.
2. Server reads from storage and returns serialized bytes.
3. Worker deserializes to CPU chunks.
4. Worker scatters chunks back to paged KV.
5. Worker `commit_retrieve` finalizes protocol state.

```text
Store (pickle)
Worker: prepare_store --> Server
Worker: gather paged KV -> CPU chunks
Worker: commit_store(serialized bytes) --> Server
Server: deserialize -> storage write

Retrieve (pickle)
Worker: prepare_retrieve --> Server
Server: read storage -> serialize bytes
Server: serialized bytes --> Worker
Worker: deserialize -> scatter to paged KV
Worker: commit_retrieve --> Server
```

### 3.3 Data Flow: SHM Path (TODO)

Store:
1. Worker `prepare_store` obtains SHM slot/offset.
2. Worker gathers directly into SHM-backed buffers.
3. Worker `commit_store` notifies server to consume SHM data.

Retrieve:
1. Worker `prepare_retrieve` asks server to populate SHM.
2. Server writes retrieved chunks into SHM.
3. Worker scatters from SHM-backed buffers into paged KV.
4. Worker `commit_retrieve` releases/read-completes SHM state.
84 changes: 53 additions & 31 deletions lmcache/integration/vllm/vllm_multi_process_adapter.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@

# First Party
from lmcache.integration.request_telemetry.factory import RequestTelemetryFactory
from lmcache.utils import EngineType, _lmcache_nvtx_annotate, init_logger
from lmcache.integration.vllm.utils import vllm_layout_hints
from lmcache.utils import _lmcache_nvtx_annotate, init_logger
from lmcache.v1.multiprocess.custom_types import (
BlockAllocationRecord,
CudaIPCWrapper,
Expand All @@ -22,6 +23,10 @@
)
from lmcache.v1.multiprocess.mq import MessageQueueClient, MessagingFuture
from lmcache.v1.multiprocess.protocol import RequestType, get_response_class
from lmcache.v1.multiprocess.transfer_context import (
TransferContext,
create_transfer_context,
)
from lmcache.v1.periodic_thread import PeriodicThread, ThreadLevel, ThreadRunSummary

logger = init_logger(__name__)
Expand Down Expand Up @@ -803,6 +808,9 @@ def __init__(
# Registered kv caches from vLLM
self.kv_caches: dict[str, torch.Tensor] = {}

# Transport context for transfer operations.
self.transfer_ctx: TransferContext | None = None

# Request futures
self.store_futures: dict[str, MessagingFuture[StoreResult]] = {}
# request_id -> (future, block_ids)
Expand Down Expand Up @@ -939,27 +947,24 @@ def _send_register_kv_caches_request(
ConnectionError: if the server does not respond within
mq_timeout.
"""
# First Party
from lmcache.integration.vllm.utils import vllm_layout_hints

self.kv_caches = kv_caches
self.transfer_ctx = create_transfer_context(kv_caches)
layout_hints = vllm_layout_hints()
layout_hints["inference_engine_logical_block_size"] = (
self.vllm_logical_block_size
)
future = send_lmcache_request(
self.mq_client,
RequestType.REGISTER_KV_CACHE,
[
try:
self.transfer_ctx.register(
self.instance_id,
wrap_kv_caches(kv_caches),
kv_caches,
self.model_name,
self.world_size,
EngineType.VLLM,
layout_hints,
],
)
try:
future.result(timeout=self._mq_timeout)
self.blocks_in_chunk,
self.mq_client,
self._mq_timeout,
send_request=send_lmcache_request,
layout_hints=layout_hints,
)
except TimeoutError:
raise ConnectionError(
"LMCache server did not respond to "
Expand Down Expand Up @@ -1049,11 +1054,20 @@ def submit_store_request(
request_id=request_id,
cache_salt=cache_salt,
)
future = send_lmcache_request(
self.mq_client,
RequestType.STORE,
[key, self.instance_id, op.block_ids, event.ipc_handle()],
).to_cuda_future()
if self.transfer_ctx is None:
raise RuntimeError(
"Transfer context is not initialized. "
"Call register_kv_caches() before submitting store requests."
)
future = self.transfer_ctx.submit_store(
request_id,
key,
self.instance_id,
self.kv_caches,
op.block_ids,
event,
self.blocks_in_chunk,
)
self.store_futures[request_id] = future

@_lmcache_nvtx_annotate
Expand Down Expand Up @@ -1088,17 +1102,21 @@ def submit_retrieve_request(
request_id=request_id,
cache_salt=cache_salt,
)
future = send_lmcache_request(
self.mq_client,
RequestType.RETRIEVE,
[
key,
self.instance_id,
op.block_ids,
event.ipc_handle(),
op.skip_first_n_tokens,
],
).to_cuda_future()
if self.transfer_ctx is None:
raise RuntimeError(
"Transfer context is not initialized. "
"Call register_kv_caches() before submitting retrieve requests."
)
future = self.transfer_ctx.submit_retrieve(
request_id,
key,
self.instance_id,
self.kv_caches,
op.block_ids,
event,
self.blocks_in_chunk,
skip_first_n_tokens=op.skip_first_n_tokens,
)
self.retrieve_futures[request_id] = (future, list(op.block_ids))

@_lmcache_nvtx_annotate
Expand Down Expand Up @@ -1309,6 +1327,10 @@ def shutdown(self):
self._mq_timeout,
)

if self.transfer_ctx is not None:
self.transfer_ctx.close()
self.transfer_ctx = None

self.mq_client.close()
self.request_telemetry.close()

Expand Down
12 changes: 3 additions & 9 deletions lmcache/python_ops_fallback.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@
import torch

# First Party
from lmcache import torch_dev, torch_device_type
from lmcache import torch_dev

# Store the tensor objects in memory so that they can be accessed
# outside the scope of this file
Expand Down Expand Up @@ -327,10 +327,7 @@ def alloc_pinned_numa_ptr(size: int, numa_id: int = 0) -> int:
Note: NUMA node selection is not supported on non-CUDA."""

# Create a 1D uint8 CPU tensor, as uint8 == 1 byte
# On XPU (Intel GPU), PyTorch 2.4+ supports pin_memory=True via SYCL USM
# host allocation, enabling fast DMA for XPU<->CPU transfers.
pin_memory = torch_device_type == "xpu"
tensor = torch.empty(size, dtype=torch.uint8, pin_memory=pin_memory)
tensor = torch.empty(size, dtype=torch.uint8, pin_memory=False)

# First-touch initialization (forces physical allocation)
tensor.fill_(0)
Expand Down Expand Up @@ -358,10 +355,7 @@ def alloc_pinned_ptr(size: int, device_id: int = 0) -> int:
fast DMA transfers. On other non-CUDA platforms, pinning is not supported."""

# Create a 1D uint8 CPU tensor, as uint8 == 1 byte
# On XPU (Intel GPU), PyTorch 2.4+ supports pin_memory=True via SYCL USM
# host allocation, enabling fast DMA for XPU<->CPU transfers.
pin_memory = torch_device_type == "xpu"
tensor = torch.empty(size, dtype=torch.uint8, pin_memory=pin_memory)
tensor = torch.empty(size, dtype=torch.uint8, pin_memory=False)

# First-touch initialization (forces physical allocation)
tensor.fill_(0)
Expand Down
Loading
Loading