Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
316 changes: 63 additions & 253 deletions docs/design/v1/multiprocess/cpu_context_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,288 +2,98 @@

## Scope

This document describes the non-CUDA CPU-based KV transfer path for LMCache
multiprocess mode.
This document describes the high-level design of the non-CUDA KV transfer path
for LMCache multiprocess mode.

The goal is to support KV transfer for non-CUDA devices (for example CPU,
XPU, HPU) without changing the existing CUDA IPC path, while providing a
clean abstraction layer that makes it easy to add alternative transport
mechanisms (e.g. shared memory) in a future PR.
The purpose of this path is to support KV transfer on non-CUDA devices (for
example CPU, XPU, and HPU) while preserving existing CUDA IPC behavior.

## Why this path exists

The CUDA path uses IPC wrappers around GPU tensors and the existing
`REGISTER_KV_CACHE` / `STORE` / `RETRIEVE` request flow.
CUDA IPC is only available for CUDA tensors. For non-CUDA tensors, workers use
a CPU-context path that:

For non-CUDA tensors, CUDA IPC is not available. The CPU context path
provides a generic protocol where workers:
1. gathers KV blocks into CPU chunks,
2. transfers those chunks to/from the server through `CPUContext`,
3. scatters retrieved chunks back to worker KV tensors.

1. Gather KV blocks into CPU chunk(or memory obj) tensors.
2. Transport those CPU chunks to the server storage through a concrete
`CPUContext` implementation.
3. Retrieve CPU chunks(or memory obj) from the server and scatter them back into device KV
tensors.
## Protocol overview

## Protocol additions

Three request types are used for non-CUDA mode (unchanged from the original
cpu context design):
Non-CUDA mode adds three request types:

- `REGISTER_KV_CACHE_CPU_CONTEXT`
- `STORE_CPU_CHUNKS`
- `RETRIEVE_CPU_CHUNKS`

These are registered in the MP server dispatch and have corresponding
payload/response contracts in the multiprocess protocol definitions.

## File structure

```
lmcache/v1/multiprocess/
├── cpu_context.py # CPUContextMetadata, CPUContext(ABC), factory, gather/scatter utils
└── cpu_context_pickle.py # CPUContextPickle — pickle-based concrete implementation
```

### `cpu_context.py`

Provides:

- **`CPUContextMetadata`** dataclass — layout metadata:

```python
@dataclass
class CPUContextMetadata:
layout_desc: MemoryLayoutDesc
block_size: int
use_mla: bool
```

- **`CPUContext(ABC)`** — abstract base class with `mq_client` as a common
dependency. All concrete implementations share the same two-phase
`prepare/commit` interface:

```python
class CPUContext(ABC):
def __init__(self, metadata: CPUContextMetadata, mq_client, mq_timeout: float): ...

@abstractmethod
def prepare_store(self, key, instance_id, chunks: list[torch.Tensor]) -> Any: ...
@abstractmethod
def commit_store(self, handle: Any) -> bool: ...
@abstractmethod
def prepare_retrieve(self, key, instance_id) -> tuple[Any, list[torch.Tensor] | None]: ...
@abstractmethod
def commit_retrieve(self, handle: Any) -> None: ...
@abstractmethod
def close(self) -> None: ...
```

- **`create_cpu_context()`** factory — currently always returns a
`CPUContextPickle` instance; a future SHM-capable PR can extend this to
probe for shared-memory availability and fall back to pickle.

- **Shared utility functions** used by all concrete implementations:
- `compute_kv_layout` — extract block size, layer count, hidden dim and
dtype from live KV tensors.
- `gather_paged_kv_to_cpu` — gather paged KV blocks into a list of CPU
tensors (one per LMCache chunk).
- `scatter_cpu_to_paged_kv` — scatter CPU chunk tensors back into paged
KV tensors.

### `cpu_context_pickle.py`

Provides **`CPUContextPickle(CPUContext)`**:

| Phase | What happens |
|---|---|
| `prepare_store` | `pickle.dumps(chunks)` → returns `(key, instance_id, bytes)` as opaque handle |
| `commit_store` | sends `STORE_CPU_CHUNKS` via `mq_client`, blocks for server ack, returns `bool` |
| `prepare_retrieve` | sends `RETRIEVE_CPU_CHUNKS` via `mq_client`, blocks for response, `pickle.loads` → returns `(None, chunks)` or `(None, None)` on miss |
| `commit_retrieve` | no-op (pickle path holds no server-side locks) |
| `close` | no-op |

## Tensor/chunk contracts

Chunk formats are unchanged:
CPU-context registration uses scalar metadata (for example: `instance_id`,
`model_name`, `world_size`, `block_size`, `num_layers`, `hidden_dim_size`,
`dtype_str`, and `use_mla`) so server-side layout can be reconstructed without
transmitting pickled layout objects, reducing serialization coupling and
allowing server-side validation from explicit fields.

- non-MLA: `[2, num_layers, chunk_tokens, hidden_dim]`
- MLA: `[num_layers, chunk_tokens, hidden_dim]`
## Main components

Internal gather/scatter uses block-level indexing to avoid token-level slot
expansion and token-wise select/copy operations.
- `cpu_context.py`
- defines `CPUContextMetadata`, the `CPUContext` abstraction, and shared
gather/scatter helpers.
- `cpu_context_pickle.py`
- current concrete `CPUContext` implementation.
- `transfer_context.py`
- dispatches between CUDA and CPU transfer paths.
- `vllm_multi_process_adapter.py`
- owns request lifecycle and future polling.
- `server.py`
- stores per-instance CPU metadata and handles CPU chunk store/retrieve
requests.

## Layout handling
## Worker-side behavior

Supported KV formats in CPU gather/scatter:
`create_transfer_context(kv_caches)` selects transport by device type:

- `NL_X_TWO_NB_BS_NH_HS` (NHD)
- `NL_X_NB_TWO_BS_NH_HS` (NHD flashinfer)
- `NL_X_TWO_NB_NH_BS_HS` (HND)
- `NL_X_NB_TWO_NH_BS_HS` (HND flashinfer)
- `NL_X_NB_BS_HS` (MLA)
- CUDA tensors -> `CudaTransferContext`
- non-CUDA tensors -> `CPUTransferContext`

## Worker adapter integration
The adapter keeps ownership of request completion tracking via
`store_futures` and `retrieve_futures`.

The adapter now delegates transport behavior to
`lmcache/v1/multiprocess/transfer_context.py`.
For CPU mode, store/retrieve execution is synchronous inside
`CPUTransferContext` (gather/scatter plus MQ interaction), and the transfer
methods return resolved futures so adapter-side completion flow stays uniform
across CUDA and non-CUDA modes. Here, "resolved futures" means the futures are
already completed when returned (no background async work pending in the CPU
path).

`create_transfer_context(kv_caches)` centralizes device dispatch:
## Server-side behavior

- all CUDA → existing CUDA IPC registration and store/retrieve path
- all non-CUDA → `CPUTransferContext` with cpu context registration and CPU context store/retrieve path
`MPCacheEngine` maintains CPU-context metadata per worker instance and uses that
metadata to resolve layout for CPU chunk writes/reads.

`LMCacheMPSchedulerAdapter` now holds `self.transfer_ctx: TransferContext | None`
and calls:
Server handlers:

- `transfer_ctx.register(...)`
- `transfer_ctx.submit_store(...)`
- `transfer_ctx.submit_retrieve(...)`
- `transfer_ctx.poll_finished()` (healthy) or `transfer_ctx.drain_all()` (unhealthy)
- register CPU-context metadata,
- store worker-provided CPU chunks into storage,
- retrieve CPU chunks from storage and return them to workers.

### Store path (non-CUDA)
Cleanup removes CPU-context state on unregister.

```python
# CPUTransferContext.submit_store
cpu_chunks = gather_paged_kv_to_cpu(kv_caches, block_ids, blocks_in_chunk, ...)
handle = self._cpu_context.prepare_store(key, instance_id, cpu_chunks)
ok = self._cpu_context.commit_store(handle) # synchronous; blocks for server ack
self._store_done[request_id] = ok
```
## Format and compatibility notes

`CPUTransferContext.poll_finished()` drains `_store_done` on each call.

### Retrieve path (non-CUDA)

```python
# CPUTransferContext.submit_retrieve
handle, chunks = self._cpu_context.prepare_retrieve(key, instance_id) # synchronous
ok = chunks is not None
if chunks is not None:
try:
scatter_cpu_to_paged_kv(kv_caches, block_ids, chunks, blocks_in_chunk,
skip_first_n_tokens=skip_first_n_tokens, ...)
except (RuntimeError, ValueError, TypeError, IndexError):
ok = False
self._cpu_context.commit_retrieve(handle)
self._retrieve_done[request_id] = (ok, block_ids)
```

`CPUTransferContext.poll_finished()` drains `_retrieve_done` on each call.
The adapter passes `op.skip_first_n_tokens` into
`transfer_ctx.submit_retrieve(..., skip_first_n_tokens=...)`.

The retrieve is **synchronous inside `CPUTransferContext.submit_retrieve`**;
`poll_finished()` just drains request ids recorded by submit methods.

## Server integration

`MPCacheEngine` holds:

- `cpu_contexts: dict[int, CPUContextMetadata]` — per-instance metadata.
- `cpu_context_meta: dict[int, tuple[str, int]]` — per-instance
`(model_name, world_size)` for layout resolution.

Server-side handler methods are unchanged:
- `register_kv_cache_cpu_context` — stores `CPUContextMetadata` in `cpu_contexts`.
- `store_cpu_chunks` — unpickles payload, copies tensors into storage.
- `retrieve_cpu_chunks` — reads from storage, pickles tensors, returns bytes.

Additional integration points:

- Unregister cleanup removes both `cpu_contexts` and `cpu_context_meta`.
- Layout lookup via `_find_layout_desc` resolves both GPU and CPU context
registrations.
- Status reporting (`report_status`) includes `registered_cpu_instance_ids`
and `cpu_context_meta`.

## CUDA vs non-CUDA state machine

```text
register_kv_caches()
|
v
create_transfer_context(kv_caches)
|
+----------------+----------------+
| |
v v
[device == cuda] [device != cuda]
| |
v v
CudaTransferContext.register() CPUTransferContext.register()
REGISTER_KV_CACHE (CUDA IPC) REGISTER_KV_CACHE_CPU_CONTEXT (CPU metadata)
| + create_cpu_context()
+----------------+----------------+
|
v
[READY / SERVING]
|
+----------------+----------------+
| |
v v
transfer_ctx.submit_store() transfer_ctx.submit_store()
| |
v v
STORE (GPU -> L1) gather_paged_kv_to_cpu()
| + _cpu_context.prepare_store()
v + _cpu_context.commit_store() [sync]
[READY] _store_done[id] = ok
| |
+----------------+----------------+
|
v
transfer_ctx.submit_retrieve() + get_finished()
|
+----------------+----------------+
| |
v v
RETRIEVE (L1 -> GPU) _cpu_context.prepare_retrieve() [sync]
[async future] + scatter_cpu_to_paged_kv()
+ _cpu_context.commit_retrieve()
_retrieve_done[id] = (ok, block_ids)
| |
+----------------+----------------+
|
v
[READY / SERVING]
|
v
unregister_kv_cache()
|
v
[TERMINATED]
```

## Future extension: CPUContextShm

The `CPUContext` base class is designed to accommodate a shared-memory
implementation in a future PR with minimal changes:

| Phase | Pickle | SHM (future) |
|---|---|---|
| `prepare_store` | `pickle.dumps` | MQ `PREPARE_STORE` → slot metadata → memcpy |
| `commit_store` | MQ `STORE_CPU_CHUNKS` | MQ `COMMIT_STORE` |
| `prepare_retrieve` | MQ `RETRIEVE_CPU_CHUNKS` + `pickle.loads` | MQ `PREPARE_RETRIEVE` → tensor views from SHM |
| `commit_retrieve` | no-op | MQ `FINISH_READ` (release read lock) |

The `create_cpu_context()` factory will probe for SHM availability and fall
back to pickle when SHM is unavailable.
- Chunk tensor layout remains consistent with gather/scatter contracts:
non-MLA chunks are 4D (`[2, num_layers, chunk_tokens, hidden_dim]`) and MLA
chunks are 3D (`[num_layers, chunk_tokens, hidden_dim]`).
- Existing CUDA IPC semantics are unchanged.
- CPU-context logic remains isolated from shared GPU connector utilities.

## Validation coverage

`tests/v1/multiprocess/test_cpu_context.py` covers:

- CPU wrapper behavior (`wrap_kv_caches` with cpu context mode)
- NHD and MLA gather/scatter round-trip
- HND round-trip for both HND formats
- `skip_first_n_tokens` behavior
- Server-side register/store/retrieve flow
Tests cover:

`tests/v1/test_vllm_mp_adapter.py` covers transfer-context integration,
including CPU registration path (`REGISTER_KV_CACHE_CPU_CONTEXT`) and
store/retrieve submit delegation.
- CPU gather/scatter correctness across supported layouts,
- CPU registration and server store/retrieve flow,
- adapter integration with transfer-context submit/get-finished behavior.

## Non-goals
## Future extension

- No change to existing CUDA IPC path semantics.
- No CPU-specific logic added to shared `gpu_connector/utils.py`.
The `CPUContext` abstraction is designed to support additional transports
(e.g. shared-memory-based implementations) with minimal adapter/server flow
changes.
Loading