Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces by Copilot · Pull Request #278 · hlin99/LMCache

Copilot · 2026-05-20T05:43:59Z

This PR adds a shared-memory (SHM) transport path on top of the existing pickle-based non-GPU context flow, while keeping the current NonGpuContext ABC/protocol contracts intact. It introduces SHM slot-based prepare/commit semantics with backward-compatible fallback to pickle mode.

Worker-side non-GPU context
- Added NonGpuContextShm in lmcache/v1/multiprocess/non_gpu_context_shm.py.
- Uses POSIX SHM via os.open("/dev/shm/...") + mmap (no multiprocessing.shared_memory.SharedMemory).
- SHM attachment path is aligned with PR Add shared-memory (SHM) transport for non-GPU context #273’s mmap fix pattern: open shm fd, mmap attach, then close fd immediately to avoid lifecycle/resource-tracker issues.
- Implements:
  - prepare_store: requests SHM slots and returns tensor views backed by SHM.
  - commit_store: sends COMMIT_STORE with empty bytes (b"") as SHM commit signal.
  - prepare_retrieve: requests SHM slots for retrieval and returns SHM-backed tensor views.
  - commit_retrieve: notifies server to release read locks.
- Added _make_tensor_view(...) built on torch.frombuffer.
Context factory + registration handshake
- Updated create_non_gpu_context(...) to accept shm_name and pool_size; chooses SHM context only when both are valid, otherwise keeps pickle path.
- Added RegisterNonGpuContextResponse in protocol definitions and wired REGISTER_KV_CACHE_NON_GPU_CONTEXT to return it.
- Updated DataTransferContext.register to consume registration response and instantiate either SHM or pickle context accordingly.
Server SHM protocol behavior
- register_kv_cache_non_gpu_context now returns SHM pool metadata from storage manager.
- prepare_store (SHM mode): reserves write objects and returns slot metadata (offset, length, shape, dtype) in PrepareStoreResponse.context["slots"].
- commit_store (SHM mode): treats empty payload as SHM commit and finalizes finish_write without deserialization.
- prepare_retrieve (SHM mode): reads already-prefetched objects and returns slot metadata in PrepareRetrieveResponse.context["slots"].
- commit_retrieve (SHM mode): releases prefetched read locks.
- Added internal pending SHM transfer tracking for write/read commit coordination.
Transport-path visibility (logging)
- Added explicit logger.info messages on both worker and server sides indicating whether non-GPU transfer is using SHM or pickle transport.
- SHM logs include pool metadata (shm_name, pool_size) when active.
SHM pool plumbing in memory/storage managers
- L1MemoryManagerConfig now includes shm_name (default lmcache_l1_pool_<pid>).
- create_memory_allocator now attempts SHM pool setup for non-lazy L1 allocator with graceful fallback to regular pickle-compatible allocator on failure.
- Added SHM helpers in memory manager:
  - _check_shm_capacity(required_bytes)
  - _unlink_stale_shm(shm_name) (with name/path safety checks)
- Added get_shm_pool_info() delegation chain:
  - L1MemoryManager → L1Manager → StorageManager.
- Added StorageManager.unsafe_read(keys) for SHM retrieve path (no additional read-lock acquisition).
- Added MemoryObj properties:
  - shm_offset (from meta.address)
  - shm_byte_length (from get_size())
New focused tests
- Added tests/v1/distributed/test_shm_l1_pool.py covering:
  - SHM segment creation and cleanup.
  - Tensor view reconstruction from SHM buffer.
  - Basic NonGpuContextShm prepare/commit store/retrieve flow with mocked MQ.

# registration response now carries SHM pool info
@dataclass
class RegisterNonGpuContextResponse:
    shm_name: str = ""
    pool_size: int = 0

# context selection remains backward-compatible
ctx = create_non_gpu_context(
    metadata,
    mq_client,
    mq_timeout,
    shm_name=shm_name,
    pool_size=pool_size,
)

Original prompt

Goal

Add a shared-memory (SHM) based NonGpuContext implementation on top of the existing ww20_PR_cpu_context_pickle branch. The existing ABC interface and protocol definitions MUST NOT be changed — only new code should be added.

Existing Interface (DO NOT CHANGE)

The ABC in lmcache/v1/multiprocess/non_gpu_context.py already defines:

prepare_store(key, instance_id) -> list[torch.Tensor] | None
commit_store(key, instance_id, chunks) -> bool
prepare_retrieve(key, instance_id) -> list[torch.Tensor] | None
commit_retrieve(key, instance_id) -> bool

The protocols in lmcache/v1/multiprocess/protocols/engine.py already have PrepareStoreResponse (with context: dict) and PrepareRetrieveResponse (with success, data, context).

The server in lmcache/v1/multiprocess/server.py already has prepare_store, commit_store, prepare_retrieve, commit_retrieve handlers (pickle-only).

Changes Required

1. New file: `lmcache/v1/multiprocess/non_gpu_context_shm.py`

Create NonGpuContextShm(NonGpuContext) that:

Attaches to a named POSIX shared-memory segment via mmap (NOT multiprocessing.shared_memory.SharedMemory to avoid resource_tracker unlinking on worker exit)
prepare_store: sends PREPARE_STORE RPC, parses response.context["slots"] to create tensor views into shared memory, returns them as out-buffers
commit_store: sends COMMIT_STORE with empty bytes (data already in SHM), notifies server
prepare_retrieve: sends PREPARE_RETRIEVE RPC, parses response.context["slots"] to create tensor views
commit_retrieve: sends COMMIT_RETRIEVE to release read locks on server
Helper _make_tensor_view(offset, length, shape, dtype_str) -> torch.Tensor using torch.frombuffer

2. Update `lmcache/v1/multiprocess/non_gpu_context.py`

Update create_non_gpu_context factory to accept optional shm_name: str = "" and pool_size: int = 0 parameters. If both are provided and valid, return NonGpuContextShm; otherwise return NonGpuContextPickle.

3. Add `RegisterNonGpuContextResponse` to `lmcache/v1/multiprocess/protocols/engine.py`

Add a dataclass:

@dataclass
class RegisterNonGpuContextResponse:
    shm_name: str = ""
    pool_size: int = 0

Update the REGISTER_KV_CACHE_NON_GPU_CONTEXT protocol definition's response_class from None to RegisterNonGpuContextResponse.

4. Update server: `lmcache/v1/multiprocess/server.py`

Change register_kv_cache_non_gpu_context return type to RegisterNonGpuContextResponse
It should return SHM pool info from self.storage_manager.get_shm_pool_info() (if available)
Update prepare_store to check if SHM is active; if so, resolve obj_keys, call storage_manager.reserve_write, and return slot metadata (offset, length, shape, dtype) in PrepareStoreResponse.context["slots"]
Update commit_store: if cpu_data is empty bytes, treat as SHM mode (data already written), just call storage_manager.finish_write
Update prepare_retrieve: if SHM active, return slot metadata in PrepareRetrieveResponse.context["slots"] instead of serialized data
Update commit_retrieve: if SHM active, release read locks via storage_manager.finish_read_prefetched

5. Update `lmcache/v1/multiprocess/transfer_context.py`

In DataTransferContext.register:

After future.result(), parse RegisterNonGpuContextResponse to extract shm_name and pool_size
Pass them to create_non_gpu_context(..., shm_name=shm_name, pool_size=pool_size)

6. SHM pool infrastructure in memory/storage managers

In lmcache/v1/distributed/memory_manager.py:

Add helper _check_shm_capacity(required_bytes) that checks /dev/shm free space
Add helper _unlink_stale_shm(shm_name) that removes stale lmcache_l1_pool_* segments
In create_memory_allocator, if config.shm_name is set and not lazy mode, try to set up SHM (with graceful fallback to pickle on failure)
Pass shm_name to MixedMemoryAllocator constructor (add as optional kwarg)
Add get_shm_pool_info() -> dict to MemoryManager class

In lmcache/v1/distributed/config.py:

Add shm_name: str field to L1MemoryManagerConfig with default f"lmcache_l1_pool_{os.getpid()}"

In lmcache/v1/distributed/storage_manager.py:

Add get_shm_pool_info() -> dict delegating to l1_manager
Add unsafe_read(keys) -> tuple[list[ObjectKey], list[MemoryObj]] for SHM retrieve without re-locking

In lmcache/v1/distributed/l1_manager.py:

Add get_shm_pool_info() -> dict delegating to memory_manager

In lmcache/v1/memory_management.py:

Add properties shm_offset -> int (returns self.meta.address) and shm_byte_length -> int (returns self.get_size()) to the MemoryObj base class

7. Tests: `tests/v1/distributed/test_shm_l1_pool.py`

Add unit tests verifying:

SHM segment creation and cleanup
Tensor view creation from SHM buffer
Basic store/retrieve flow with NonGpuContextShm (mocked MQ)

Key Design Princip...

This pull request was created from Copilot chat.

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/07c7d0ab-d21a-4245-9109-006f91352b6c Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

hlin99

can you refer to https://github.com/hlin99/LMCache/pull/273/commits the last commit to use mmap?
worker/server side print out logger.info pickle path or shm path.

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/6bb8fb82-c368-43a5-a4b1-83bfedecc1a6 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot · 2026-05-20T06:21:06Z

can you refer to https://github.com/hlin99/LMCache/pull/273/commits the last commit to use mmap?

worker/server side print out logger.info pickle path or shm path.

Done in 1972767 (plus 2b1f1d9). I aligned the SHM attach logic with PR #273’s mmap-style approach (open /dev/shm/..., mmap, then close fd immediately), and added explicit logger.info on both worker and server for transport selection (pickle vs SHM, including SHM name/pool size when active).

hlin99

server启动一次不动，worker前后启动两次，发两个一模一样的prompt，第一次都没问题，但是第二次worker启动以后报错如下。
(EngineCore pid=28702) [2026-05-20 06:32:33,134] LMCache INFO: Currently used by:
(EngineCore pid=28702) - vLLM non-MLA flash attention (utils.py:413:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:33,135] LMCache INFO: Creating NonGpuContextShm (shm_name=lmcache_l1_pool_26917, pool_size=21474836480) (non_gpu_context.py:128:lmcache.v1.multiprocess.non_gpu_context)
(EngineCore pid=28702) INFO 05-20 06:32:33 [gpu_worker.py:578] Compile and warming up model for size 2048
(EngineCore pid=28702) WARNING 05-20 06:32:33 [gpu_model_runner.py:5965] Skipping CUDA graph capture. To turn on CUDA graph capture, ensure cudagraph_mode was not manually set to NONE
(EngineCore pid=28702) INFO 05-20 06:32:33 [core.py:283] init engine (profile, create kv cache, warmup model) took 5.00 seconds
(EngineCore pid=28702) INFO 05-20 06:32:34 [factory.py:64] Creating v1 connector with name: LMCacheMPConnector and engine_id: 37ebcb6d-db18-410f-86ab-df39ef06e456
(EngineCore pid=28702) WARNING 05-20 06:32:34 [base.py:189] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
(EngineCore pid=28702) INFO 05-20 06:32:34 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=28702) WARNING 05-20 06:32:34 [xpu.py:181] XPU Graph is not supported in the current PyTorch version, disabling cudagraph_mode.
(APIServer pid=28433) INFO 05-20 06:32:34 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=28433) WARNING 05-20 06:32:34 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 0.6, 'top_p': 0.9}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=28433) INFO 05-20 06:32:35 [hf.py:314] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=28433) INFO 05-20 06:32:35 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:37] Available routes are:
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=28433) INFO: Started server process [28433]
(APIServer pid=28433) INFO: Waiting for application startup.
(APIServer pid=28433) INFO: Application startup complete.
(EngineCore pid=28702) [2026-05-20 06:32:49,047] LMCache INFO: PeriodicThread lmcache-heartbeat entering main loop (interval=10.0s) (periodic_thread.py:279:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,047] LMCache INFO: Started PeriodicThread: lmcache-heartbeat (level=critical, interval=10.0s, init_wait=0.0s) (periodic_thread.py:239:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,050] LMCache INFO: PeriodicThread lmcache-heartbeat entering main loop (interval=10.0s) (periodic_thread.py:279:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,050] LMCache INFO: Started PeriodicThread: lmcache-heartbeat (level=critical, interval=10.0s, init_wait=0.0s) (periodic_thread.py:239:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:509:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: GPU KV Cache Dimensions: [32][2, 372, 64, 8, 128] (utils.py:520:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: vLLM KV cache layout: NHD (utils.py:534:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: GPU KV Format: NL x [2, NB, BS, NH, HS] (utils.py:412:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: Currently used by:
(EngineCore pid=28702) - vLLM non-MLA flash attention (utils.py:413:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,916] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:509:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: GPU KV Cache Dimensions: [32][2, 372, 64, 8, 128] (utils.py:520:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: vLLM KV cache layout: NHD (utils.py:534:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: GPU KV Format: NL x [2, NB, BS, NH, HS] (utils.py:412:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: Currently used by:
(EngineCore pid=28702) - vLLM non-MLA flash attention (utils.py:413:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,926] LMCache ERROR: Something went wrong when processing the store request for request_id=cmpl-a2ea2c8d4c493a4e-0-807a58c8 (vllm_multi_process_adapter.py:1155:lmcache.integration.vllm.vllm_multi_process_adapter)
(APIServer pid=28433) INFO 05-20 06:32:55 [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 12.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 97.6%

hlin99

任务：将 PR #278 的 SHM transport 移植到 `ww20_PR_cpu_context_pickle` 分支并修复已知问题

背景

PR #278 (copilot/add-nongpucontext-shm-implementation) 在 non-GPU context 的 pickle 路径之上新增了 SHM（shared memory）传输模式。需要将该 PR 的改动 cherry-pick / 移植到 hlin99:ww20_PR_cpu_context_pickle 分支，同时修复以下 code review 中发现的所有问题。

🔴 Critical Issues（必须修复）

1. 幂等性 Bug：重复 prompt 的 store 请求会失败

复现方式： server 不重启，worker 前后启动两次，发两次一模一样的 prompt。第二次 store 时报错 Something went wrong when processing the store request。

根因： 第二次 store 时 key 已存在，reserve_write(..., "new") 返回空 dict，导致 prepare_store 存入空 reserved_keys 列表到 _pending_shm_writes，随后 commit_store 中 not reserved_keys → 返回 False。

修复：

server.py 的 prepare_store：如果 reserved 为空（not reserved_keys），不要存入 _pending_shm_writes，直接返回 PrepareStoreResponse(context={})（不含 "slots" key）。
server.py 的 commit_store：当 cpu_data == b"" 且 SHM active 时，如果 not reserved_keys，返回 True（幂等语义：key 已存在 = 存储成功），而非 False。
Worker 侧 NonGpuContextShm.prepare_store：如果 response 中 context 不包含 "slots" key 或 slots 为空列表，返回 None（让上层跳过 SHM 写入）。

2. `close()` 双重调用导致 `OSError: Bad file descriptor`

文件： non_gpu_context_shm.py 的 close() 方法

修复： 添加 guard 防止双重关闭：

def close(self) -> None:
    if self._shm_fd < 0:
        return
    try:
        self._mmap_obj.close()
    finally:
        fd = self._shm_fd
        self._shm_fd = -1
        os.close(fd)
3. _pending_shm_writes / _pending_shm_reads 无锁保护，存在竞态
文件： server.py L262-264 及所有对这两个 dict 的读写操作

修复： 添加一个 threading.Lock（例如 self._pending_shm_lock），在所有访问 _pending_shm_writes 和 _pending_shm_reads 的代码块中加锁：

prepare_store 中写入
commit_store 中 pop
prepare_retrieve 中写入
commit_retrieve 中 pop
unregister_kv_cache 中过滤清理
4. _is_shm_active() 每次调用都查询 storage manager — 热路径性能问题
文件： server.py 的 _is_shm_active() 方法

修复： 在 __init__ 中初始化 self._shm_active: bool = False，在 register_kv_cache_non_gpu_context 成功注册后更新该标志（基于 get_shm_pool_info() 结果）。后续所有 _is_shm_active() 调用直接返回 self._shm_active。

5. prepare_store reserve 后 worker 超时导致内存泄漏
文件： server.py 的 prepare_store + commit_store + unregister_kv_cache

修复： 在 unregister_kv_cache 中，对被清理的 _pending_shm_writes 条目，调用 self.storage_manager.finish_write(v) 释放已 reserve 的内存（而不仅仅是从 dict 中删除）：

Python
# 在 unregister_kv_cache 中
stale_writes = {k: v for k, v in self._pending_shm_writes.items() if k[0] == instance_id}
for k, v in stale_writes.items():
    if v:
        self.storage_manager.finish_write(v)
    del self._pending_shm_writes[k]
# 同理处理 _pending_shm_reads（调用 finish_read_prefetched）
stale_reads = {k: v for k, v in self._pending_shm_reads.items() if k[0] == instance_id}
for k, v in stale_reads.items():
    if v:
        self.storage_manager.finish_read_prefetched(v)
    del self._pending_shm_reads[k]
6. prepare_retrieve SHM 路径使用 unsafe_read 缺少前置条件校验
文件： server.py 的 prepare_retrieve SHM 分支

修复： 在 unsafe_read 调用前添加注释说明前置条件（lookup 阶段已获取 read lock），并添加防御性检查：如果 unsafe_read 返回的 shm_memory_objs 数量少于 obj_keys，确保释放已获取的 keys 后返回失败（当前代码已有此逻辑，确认保留）。

🟡 Medium Issues（建议修复）
7. _make_non_gpu_transfer_key 包含 key.token_ids（大 tuple）导致 hash 开销大
修复： 如果 IPCCacheEngineKey 是 hashable 的，改用 (instance_id, key) 作为 dict key。否则考虑用 (instance_id, key.request_id, key.start, key.end) 作为更轻量的唯一标识。

8. 死代码：memory_manager.py 中 _check_shm_capacity 未被调用
修复： 删除 _check_shm_capacity 函数，或在 create_memory_allocator 中调用它替代内联逻辑。

9. 确认 MixedMemoryAllocator 支持 shm_name 参数
修复： 确认 MixedMemoryAllocator.__init__ 接受 shm_name kwarg。如果不支持，需要在本次改动中添加该参数的处理逻辑，否则运行时会 TypeError。

🟢 Minor Issues（可选修复）
10. 测试覆盖不足
建议在 test_shm_l1_pool.py 中增加：

SHM 文件不存在时 NonGpuContextShm.__init__ 应抛异常的测试
create_non_gpu_context(shm_name="", pool_size=0) 回退到 pickle 的测试
重复 key store（幂等性）的测试
close() 被调用两次不崩溃的测试
11. non_gpu_context_shm.py 中用 torch.empty((), dtype=dtype).element_size() 获取 itemsize
风格建议：可改为 torch.tensor([], dtype=dtype).element_size() 或保持现状均可。

实施要求
基于 hlin99:ww20_PR_cpu_context_pickle 分支创建 PR
将 PR #278 (copilot/add-nongpucontext-shm-implementation 分支) 的所有文件改动移植过来
在移植的同时，应用上述所有 🔴 Critical 修复和 🟡 Medium 修复
确保现有测试通过，新增的 SHM 测试也通过
PR 标题建议：Port SHM transport to cpu_context_pickle branch with bug fixes

Copilot · 2026-05-20T06:32:33Z

@hlin99 I've opened a new pull request, #280, to work on those changes. Once the pull request is ready, I'll request review from you.

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/0b3178a9-b4e2-411e-9e1d-1c8617b05893 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot · 2026-05-20T06:38:27Z

server启动一次不动，worker前后启动两次，发两个一模一样的prompt，第一次都没问题，但是第二次worker启动以后报错如下。
(EngineCore pid=28702) [2026-05-20 06:32...

Fixed in ee5eed6 (plus test follow-ups 26a0cc7, 067f3e4). Root cause was SHM commit_store returning False when prepare_store reserved zero keys (valid case when KV already exists, e.g. second identical prompt after worker restart). I changed server logic so: matching prepare + empty reservation => success no-op; missing prepare state => failure. Added regression coverage for both paths in test_server_shm_commit_store_allows_noop_when_all_keys_exist. UI screenshot: N/A (no UI changes).

…ctness gaps from #278 review (#280) * Initial plan * Fix SHM non-GPU transport idempotency, locking, and cleanup issues Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/661cbeee-d0d4-40ef-9312-4044e4696a51 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> * Polish SHM feedback fixes and align validation comments Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/661cbeee-d0d4-40ef-9312-4044e4696a51 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> * Fix memory leak: early-return from prepare_store when all keys exist When reserve_write returns empty (all object keys already cached), return PrepareStoreResponse(context={}) immediately without storing an entry in _pending_shm_writes. This prevents leaked entries that would never be popped since the worker won't call commit_store. Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/182111d5-1737-49c0-be65-0287d5b9d6c5 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Initial plan

cf82347

Copilot AI assigned Copilot and hlin99 May 20, 2026

Copilot started work on behalf of hlin99 May 20, 2026 05:44 View session

Copilot AI and others added 2 commits May 20, 2026 05:57

Add SHM NonGpuContext and server/storage plumbing

0403c5b

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/07c7d0ab-d21a-4245-9109-006f91352b6c Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Harden SHM validation and address review feedback

18eeae1

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/07c7d0ab-d21a-4245-9109-006f91352b6c Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Add shared-memory based NonGpuContext implementation~~ Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces May 20, 2026

Copilot finished work on behalf of hlin99 May 20, 2026 06:03

Copilot AI requested a review from hlin99 May 20, 2026 06:03

hlin99 requested changes May 20, 2026

View reviewed changes

Copilot started work on behalf of hlin99 May 20, 2026 06:10 View session

Copilot AI and others added 2 commits May 20, 2026 06:19

Align SHM mmap usage and add pickle/shm transport logs

2b1f1d9

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/6bb8fb82-c368-43a5-a4b1-83bfedecc1a6 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Simplify SHM mmap buffer handling

1972767

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/6bb8fb82-c368-43a5-a4b1-83bfedecc1a6 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot finished work on behalf of hlin99 May 20, 2026 06:21

Copilot AI requested a review from hlin99 May 20, 2026 06:21

hlin99 requested changes May 20, 2026

View reviewed changes

Copilot started work on behalf of hlin99 May 20, 2026 06:28 View session

hlin99 requested changes May 20, 2026

View reviewed changes

Copilot AI mentioned this pull request May 20, 2026

Port SHM non-GPU transport to cpu_context_pickle branch and fix correctness gaps from #278 review #280

Merged

1 task

Copilot AI and others added 3 commits May 20, 2026 06:33

Handle SHM no-op commit on repeated prompt store

ee5eed6

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/0b3178a9-b4e2-411e-9e1d-1c8617b05893 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Clarify SHM commit semantics for no-op writes

26a0cc7

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/0b3178a9-b4e2-411e-9e1d-1c8617b05893 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Expand SHM no-op regression test coverage

067f3e4

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/0b3178a9-b4e2-411e-9e1d-1c8617b05893 Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot finished work on behalf of hlin99 May 20, 2026 06:38

Copilot AI requested a review from hlin99 May 20, 2026 06:38

hlin99 closed this May 28, 2026

hlin99 deleted the copilot/add-nongpucontext-shm-implementation branch May 28, 2026 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces#278

Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces#278
Copilot wants to merge 9 commits into
ww20_PR_cpu_context_picklefrom
copilot/add-nongpucontext-shm-implementation

Copilot AI commented May 20, 2026 •

edited

Loading

Uh oh!

hlin99 left a comment

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

hlin99 left a comment

Uh oh!

hlin99 left a comment

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Existing Interface (DO NOT CHANGE)

Changes Required

1. New file: lmcache/v1/multiprocess/non_gpu_context_shm.py

2. Update lmcache/v1/multiprocess/non_gpu_context.py

3. Add RegisterNonGpuContextResponse to lmcache/v1/multiprocess/protocols/engine.py

4. Update server: lmcache/v1/multiprocess/server.py

5. Update lmcache/v1/multiprocess/transfer_context.py

6. SHM pool infrastructure in memory/storage managers

7. Tests: tests/v1/distributed/test_shm_l1_pool.py

Key Design Princip...

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

Uh oh!

hlin99 left a comment

Choose a reason for hiding this comment

任务：将 PR #278 的 SHM transport 移植到 ww20_PR_cpu_context_pickle 分支并修复已知问题

背景

🔴 Critical Issues（必须修复）

1. 幂等性 Bug：重复 prompt 的 store 请求会失败

2. close() 双重调用导致 OSError: Bad file descriptor

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

Copilot AI commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 20, 2026 •

edited

Loading

1. New file: `lmcache/v1/multiprocess/non_gpu_context_shm.py`

2. Update `lmcache/v1/multiprocess/non_gpu_context.py`

3. Add `RegisterNonGpuContextResponse` to `lmcache/v1/multiprocess/protocols/engine.py`

4. Update server: `lmcache/v1/multiprocess/server.py`

5. Update `lmcache/v1/multiprocess/transfer_context.py`

7. Tests: `tests/v1/distributed/test_shm_l1_pool.py`

任务：将 PR #278 的 SHM transport 移植到 `ww20_PR_cpu_context_pickle` 分支并修复已知问题

2. `close()` 双重调用导致 `OSError: Bad file descriptor`