Skip to content

Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces#278

Closed
Copilot wants to merge 9 commits into
ww20_PR_cpu_context_picklefrom
copilot/add-nongpucontext-shm-implementation
Closed

Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces#278
Copilot wants to merge 9 commits into
ww20_PR_cpu_context_picklefrom
copilot/add-nongpucontext-shm-implementation

Conversation

Copy link
Copy Markdown

Copilot AI commented May 20, 2026

This PR adds a shared-memory (SHM) transport path on top of the existing pickle-based non-GPU context flow, while keeping the current NonGpuContext ABC/protocol contracts intact. It introduces SHM slot-based prepare/commit semantics with backward-compatible fallback to pickle mode.

  • Worker-side non-GPU context

    • Added NonGpuContextShm in lmcache/v1/multiprocess/non_gpu_context_shm.py.
    • Uses POSIX SHM via os.open("/dev/shm/...") + mmap (no multiprocessing.shared_memory.SharedMemory).
    • SHM attachment path is aligned with PR Add shared-memory (SHM) transport for non-GPU context #273’s mmap fix pattern: open shm fd, mmap attach, then close fd immediately to avoid lifecycle/resource-tracker issues.
    • Implements:
      • prepare_store: requests SHM slots and returns tensor views backed by SHM.
      • commit_store: sends COMMIT_STORE with empty bytes (b"") as SHM commit signal.
      • prepare_retrieve: requests SHM slots for retrieval and returns SHM-backed tensor views.
      • commit_retrieve: notifies server to release read locks.
    • Added _make_tensor_view(...) built on torch.frombuffer.
  • Context factory + registration handshake

    • Updated create_non_gpu_context(...) to accept shm_name and pool_size; chooses SHM context only when both are valid, otherwise keeps pickle path.
    • Added RegisterNonGpuContextResponse in protocol definitions and wired REGISTER_KV_CACHE_NON_GPU_CONTEXT to return it.
    • Updated DataTransferContext.register to consume registration response and instantiate either SHM or pickle context accordingly.
  • Server SHM protocol behavior

    • register_kv_cache_non_gpu_context now returns SHM pool metadata from storage manager.
    • prepare_store (SHM mode): reserves write objects and returns slot metadata (offset, length, shape, dtype) in PrepareStoreResponse.context["slots"].
    • commit_store (SHM mode): treats empty payload as SHM commit and finalizes finish_write without deserialization.
    • prepare_retrieve (SHM mode): reads already-prefetched objects and returns slot metadata in PrepareRetrieveResponse.context["slots"].
    • commit_retrieve (SHM mode): releases prefetched read locks.
    • Added internal pending SHM transfer tracking for write/read commit coordination.
  • Transport-path visibility (logging)

    • Added explicit logger.info messages on both worker and server sides indicating whether non-GPU transfer is using SHM or pickle transport.
    • SHM logs include pool metadata (shm_name, pool_size) when active.
  • SHM pool plumbing in memory/storage managers

    • L1MemoryManagerConfig now includes shm_name (default lmcache_l1_pool_<pid>).
    • create_memory_allocator now attempts SHM pool setup for non-lazy L1 allocator with graceful fallback to regular pickle-compatible allocator on failure.
    • Added SHM helpers in memory manager:
      • _check_shm_capacity(required_bytes)
      • _unlink_stale_shm(shm_name) (with name/path safety checks)
    • Added get_shm_pool_info() delegation chain:
      • L1MemoryManagerL1ManagerStorageManager.
    • Added StorageManager.unsafe_read(keys) for SHM retrieve path (no additional read-lock acquisition).
    • Added MemoryObj properties:
      • shm_offset (from meta.address)
      • shm_byte_length (from get_size())
  • New focused tests

    • Added tests/v1/distributed/test_shm_l1_pool.py covering:
      • SHM segment creation and cleanup.
      • Tensor view reconstruction from SHM buffer.
      • Basic NonGpuContextShm prepare/commit store/retrieve flow with mocked MQ.
# registration response now carries SHM pool info
@dataclass
class RegisterNonGpuContextResponse:
    shm_name: str = ""
    pool_size: int = 0

# context selection remains backward-compatible
ctx = create_non_gpu_context(
    metadata,
    mq_client,
    mq_timeout,
    shm_name=shm_name,
    pool_size=pool_size,
)
Original prompt

Goal

Add a shared-memory (SHM) based NonGpuContext implementation on top of the existing ww20_PR_cpu_context_pickle branch. The existing ABC interface and protocol definitions MUST NOT be changed — only new code should be added.

Existing Interface (DO NOT CHANGE)

The ABC in lmcache/v1/multiprocess/non_gpu_context.py already defines:

  • prepare_store(key, instance_id) -> list[torch.Tensor] | None
  • commit_store(key, instance_id, chunks) -> bool
  • prepare_retrieve(key, instance_id) -> list[torch.Tensor] | None
  • commit_retrieve(key, instance_id) -> bool

The protocols in lmcache/v1/multiprocess/protocols/engine.py already have PrepareStoreResponse (with context: dict) and PrepareRetrieveResponse (with success, data, context).

The server in lmcache/v1/multiprocess/server.py already has prepare_store, commit_store, prepare_retrieve, commit_retrieve handlers (pickle-only).

Changes Required

1. New file: lmcache/v1/multiprocess/non_gpu_context_shm.py

Create NonGpuContextShm(NonGpuContext) that:

  • Attaches to a named POSIX shared-memory segment via mmap (NOT multiprocessing.shared_memory.SharedMemory to avoid resource_tracker unlinking on worker exit)
  • prepare_store: sends PREPARE_STORE RPC, parses response.context["slots"] to create tensor views into shared memory, returns them as out-buffers
  • commit_store: sends COMMIT_STORE with empty bytes (data already in SHM), notifies server
  • prepare_retrieve: sends PREPARE_RETRIEVE RPC, parses response.context["slots"] to create tensor views
  • commit_retrieve: sends COMMIT_RETRIEVE to release read locks on server
  • Helper _make_tensor_view(offset, length, shape, dtype_str) -> torch.Tensor using torch.frombuffer

2. Update lmcache/v1/multiprocess/non_gpu_context.py

Update create_non_gpu_context factory to accept optional shm_name: str = "" and pool_size: int = 0 parameters. If both are provided and valid, return NonGpuContextShm; otherwise return NonGpuContextPickle.

3. Add RegisterNonGpuContextResponse to lmcache/v1/multiprocess/protocols/engine.py

Add a dataclass:

@dataclass
class RegisterNonGpuContextResponse:
    shm_name: str = ""
    pool_size: int = 0

Update the REGISTER_KV_CACHE_NON_GPU_CONTEXT protocol definition's response_class from None to RegisterNonGpuContextResponse.

4. Update server: lmcache/v1/multiprocess/server.py

  • Change register_kv_cache_non_gpu_context return type to RegisterNonGpuContextResponse
  • It should return SHM pool info from self.storage_manager.get_shm_pool_info() (if available)
  • Update prepare_store to check if SHM is active; if so, resolve obj_keys, call storage_manager.reserve_write, and return slot metadata (offset, length, shape, dtype) in PrepareStoreResponse.context["slots"]
  • Update commit_store: if cpu_data is empty bytes, treat as SHM mode (data already written), just call storage_manager.finish_write
  • Update prepare_retrieve: if SHM active, return slot metadata in PrepareRetrieveResponse.context["slots"] instead of serialized data
  • Update commit_retrieve: if SHM active, release read locks via storage_manager.finish_read_prefetched

5. Update lmcache/v1/multiprocess/transfer_context.py

In DataTransferContext.register:

  • After future.result(), parse RegisterNonGpuContextResponse to extract shm_name and pool_size
  • Pass them to create_non_gpu_context(..., shm_name=shm_name, pool_size=pool_size)

6. SHM pool infrastructure in memory/storage managers

In lmcache/v1/distributed/memory_manager.py:

  • Add helper _check_shm_capacity(required_bytes) that checks /dev/shm free space
  • Add helper _unlink_stale_shm(shm_name) that removes stale lmcache_l1_pool_* segments
  • In create_memory_allocator, if config.shm_name is set and not lazy mode, try to set up SHM (with graceful fallback to pickle on failure)
  • Pass shm_name to MixedMemoryAllocator constructor (add as optional kwarg)
  • Add get_shm_pool_info() -> dict to MemoryManager class

In lmcache/v1/distributed/config.py:

  • Add shm_name: str field to L1MemoryManagerConfig with default f"lmcache_l1_pool_{os.getpid()}"

In lmcache/v1/distributed/storage_manager.py:

  • Add get_shm_pool_info() -> dict delegating to l1_manager
  • Add unsafe_read(keys) -> tuple[list[ObjectKey], list[MemoryObj]] for SHM retrieve without re-locking

In lmcache/v1/distributed/l1_manager.py:

  • Add get_shm_pool_info() -> dict delegating to memory_manager

In lmcache/v1/memory_management.py:

  • Add properties shm_offset -> int (returns self.meta.address) and shm_byte_length -> int (returns self.get_size()) to the MemoryObj base class

7. Tests: tests/v1/distributed/test_shm_l1_pool.py

Add unit tests verifying:

  • SHM segment creation and cleanup
  • Tensor view creation from SHM buffer
  • Basic store/retrieve flow with NonGpuContextShm (mocked MQ)

Key Design Princip...

This pull request was created from Copilot chat.

Copilot AI and others added 2 commits May 20, 2026 05:57
Copilot AI changed the title [WIP] Add shared-memory based NonGpuContext implementation Add SHM-based non-GPU KV transport without changing existing NonGpuContext interfaces May 20, 2026
Copilot finished work on behalf of hlin99 May 20, 2026 06:03
Copilot AI requested a review from hlin99 May 20, 2026 06:03
Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can you refer to https://github.com/hlin99/LMCache/pull/273/commits the last commit to use mmap?
  2. worker/server side print out logger.info pickle path or shm path.

Copilot AI and others added 2 commits May 20, 2026 06:19
Copy link
Copy Markdown
Author

Copilot AI commented May 20, 2026

  1. can you refer to https://github.com/hlin99/LMCache/pull/273/commits the last commit to use mmap?
  2. worker/server side print out logger.info pickle path or shm path.

Done in 1972767 (plus 2b1f1d9). I aligned the SHM attach logic with PR #273’s mmap-style approach (open /dev/shm/..., mmap, then close fd immediately), and added explicit logger.info on both worker and server for transport selection (pickle vs SHM, including SHM name/pool size when active).

Copilot finished work on behalf of hlin99 May 20, 2026 06:21
Copilot AI requested a review from hlin99 May 20, 2026 06:21
Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server启动一次不动,worker前后启动两次,发两个一模一样的prompt,第一次都没问题,但是第二次worker启动以后报错如下。
(EngineCore pid=28702) [2026-05-20 06:32:33,134] LMCache INFO: Currently used by:
(EngineCore pid=28702) - vLLM non-MLA flash attention (utils.py:413:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:33,135] LMCache INFO: Creating NonGpuContextShm (shm_name=lmcache_l1_pool_26917, pool_size=21474836480) (non_gpu_context.py:128:lmcache.v1.multiprocess.non_gpu_context)
(EngineCore pid=28702) INFO 05-20 06:32:33 [gpu_worker.py:578] Compile and warming up model for size 2048
(EngineCore pid=28702) WARNING 05-20 06:32:33 [gpu_model_runner.py:5965] Skipping CUDA graph capture. To turn on CUDA graph capture, ensure cudagraph_mode was not manually set to NONE
(EngineCore pid=28702) INFO 05-20 06:32:33 [core.py:283] init engine (profile, create kv cache, warmup model) took 5.00 seconds
(EngineCore pid=28702) INFO 05-20 06:32:34 [factory.py:64] Creating v1 connector with name: LMCacheMPConnector and engine_id: 37ebcb6d-db18-410f-86ab-df39ef06e456
(EngineCore pid=28702) WARNING 05-20 06:32:34 [base.py:189] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
(EngineCore pid=28702) INFO 05-20 06:32:34 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=28702) WARNING 05-20 06:32:34 [xpu.py:181] XPU Graph is not supported in the current PyTorch version, disabling cudagraph_mode.
(APIServer pid=28433) INFO 05-20 06:32:34 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=28433) WARNING 05-20 06:32:34 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 0.6, 'top_p': 0.9}. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
(APIServer pid=28433) INFO 05-20 06:32:35 [hf.py:314] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this.
(APIServer pid=28433) INFO 05-20 06:32:35 [api_server.py:594] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:37] Available routes are:
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=28433) INFO 05-20 06:32:35 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=28433) INFO: Started server process [28433]
(APIServer pid=28433) INFO: Waiting for application startup.
(APIServer pid=28433) INFO: Application startup complete.
(EngineCore pid=28702) [2026-05-20 06:32:49,047] LMCache INFO: PeriodicThread lmcache-heartbeat entering main loop (interval=10.0s) (periodic_thread.py:279:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,047] LMCache INFO: Started PeriodicThread: lmcache-heartbeat (level=critical, interval=10.0s, init_wait=0.0s) (periodic_thread.py:239:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,050] LMCache INFO: PeriodicThread lmcache-heartbeat entering main loop (interval=10.0s) (periodic_thread.py:279:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,050] LMCache INFO: Started PeriodicThread: lmcache-heartbeat (level=critical, interval=10.0s, init_wait=0.0s) (periodic_thread.py:239:lmcache.v1.periodic_thread)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:509:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: GPU KV Cache Dimensions: [32][2, 372, 64, 8, 128] (utils.py:520:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: vLLM KV cache layout: NHD (utils.py:534:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: GPU KV Format: NL x [2, NB, BS, NH, HS] (utils.py:412:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:49,052] LMCache INFO: Currently used by:
(EngineCore pid=28702) - vLLM non-MLA flash attention (utils.py:413:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,916] LMCache INFO: list_depth: 1, tensor_dim: 5 (utils.py:509:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: GPU KV Cache Dimensions: [32][2, 372, 64, 8, 128] (utils.py:520:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: vLLM KV cache layout: NHD (utils.py:534:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: GPU KV Format: NL x [2, NB, BS, NH, HS] (utils.py:412:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,917] LMCache INFO: Currently used by:
(EngineCore pid=28702) - vLLM non-MLA flash attention (utils.py:413:lmcache.v1.gpu_connector.utils)
(EngineCore pid=28702) [2026-05-20 06:32:54,926] LMCache ERROR: Something went wrong when processing the store request for request_id=cmpl-a2ea2c8d4c493a4e-0-807a58c8 (vllm_multi_process_adapter.py:1155:lmcache.integration.vllm.vllm_multi_process_adapter)
(APIServer pid=28433) INFO 05-20 06:32:55 [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 12.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.5%, Prefix cache hit rate: 0.0%, External prefix cache hit rate: 97.6%

Copy link
Copy Markdown
Owner

@hlin99 hlin99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

任务:将 PR #278 的 SHM transport 移植到 ww20_PR_cpu_context_pickle 分支并修复已知问题

背景

PR #278 (copilot/add-nongpucontext-shm-implementation) 在 non-GPU context 的 pickle 路径之上新增了 SHM(shared memory)传输模式。需要将该 PR 的改动 cherry-pick / 移植到 hlin99:ww20_PR_cpu_context_pickle 分支,同时修复以下 code review 中发现的所有问题。


🔴 Critical Issues(必须修复)

1. 幂等性 Bug:重复 prompt 的 store 请求会失败

复现方式: server 不重启,worker 前后启动两次,发两次一模一样的 prompt。第二次 store 时报错 Something went wrong when processing the store request

根因: 第二次 store 时 key 已存在,reserve_write(..., "new") 返回空 dict,导致 prepare_store 存入空 reserved_keys 列表到 _pending_shm_writes,随后 commit_storenot reserved_keys → 返回 False

修复:

  • server.pyprepare_store:如果 reserved 为空(not reserved_keys),不要存入 _pending_shm_writes,直接返回 PrepareStoreResponse(context={})(不含 "slots" key)。
  • server.pycommit_store:当 cpu_data == b"" 且 SHM active 时,如果 not reserved_keys,返回 True(幂等语义:key 已存在 = 存储成功),而非 False
  • Worker 侧 NonGpuContextShm.prepare_store:如果 response 中 context 不包含 "slots" key 或 slots 为空列表,返回 None(让上层跳过 SHM 写入)。

2. close() 双重调用导致 OSError: Bad file descriptor

文件: non_gpu_context_shm.pyclose() 方法

修复: 添加 guard 防止双重关闭:

def close(self) -> None:
    if self._shm_fd < 0:
        return
    try:
        self._mmap_obj.close()
    finally:
        fd = self._shm_fd
        self._shm_fd = -1
        os.close(fd)
3. _pending_shm_writes / _pending_shm_reads 无锁保护存在竞态
文件server.py L262-264 及所有对这两个 dict 的读写操作

修复添加一个 threading.Lock例如 self._pending_shm_lock),在所有访问 _pending_shm_writes  _pending_shm_reads 的代码块中加锁prepare_store 中写入
commit_store  pop
prepare_retrieve 中写入
commit_retrieve  pop
unregister_kv_cache 中过滤清理
4. _is_shm_active() 每次调用都查询 storage manager热路径性能问题
文件server.py  _is_shm_active() 方法

修复 __init__ 中初始化 self._shm_active: bool = False register_kv_cache_non_gpu_context 成功注册后更新该标志基于 get_shm_pool_info() 结果)。后续所有 _is_shm_active() 调用直接返回 self._shm_active5. prepare_store reserve  worker 超时导致内存泄漏
文件server.py  prepare_store + commit_store + unregister_kv_cache

修复 unregister_kv_cache 对被清理的 _pending_shm_writes 条目调用 self.storage_manager.finish_write(v) 释放已 reserve 的内存而不仅仅是从 dict 中删除):

Python
# 在 unregister_kv_cache 中
stale_writes = {k: v for k, v in self._pending_shm_writes.items() if k[0] == instance_id}
for k, v in stale_writes.items():
    if v:
        self.storage_manager.finish_write(v)
    del self._pending_shm_writes[k]
# 同理处理 _pending_shm_reads(调用 finish_read_prefetched)
stale_reads = {k: v for k, v in self._pending_shm_reads.items() if k[0] == instance_id}
for k, v in stale_reads.items():
    if v:
        self.storage_manager.finish_read_prefetched(v)
    del self._pending_shm_reads[k]
6. prepare_retrieve SHM 路径使用 unsafe_read 缺少前置条件校验
文件server.py  prepare_retrieve SHM 分支

修复 unsafe_read 调用前添加注释说明前置条件lookup 阶段已获取 read lock),并添加防御性检查如果 unsafe_read 返回的 shm_memory_objs 数量少于 obj_keys确保释放已获取的 keys 后返回失败当前代码已有此逻辑确认保留)。

🟡 Medium Issues建议修复7. _make_non_gpu_transfer_key 包含 key.token_ids tuple导致 hash 开销大
修复如果 IPCCacheEngineKey  hashable 改用 (instance_id, key) 作为 dict key否则考虑用 (instance_id, key.request_id, key.start, key.end) 作为更轻量的唯一标识8. 死代码memory_manager.py  _check_shm_capacity 未被调用
修复删除 _check_shm_capacity 函数或在 create_memory_allocator 中调用它替代内联逻辑9. 确认 MixedMemoryAllocator 支持 shm_name 参数
修复确认 MixedMemoryAllocator.__init__ 接受 shm_name kwarg如果不支持需要在本次改动中添加该参数的处理逻辑否则运行时会 TypeError。

🟢 Minor Issues可选修复10. 测试覆盖不足
建议在 test_shm_l1_pool.py 中增加SHM 文件不存在时 NonGpuContextShm.__init__ 应抛异常的测试
create_non_gpu_context(shm_name="", pool_size=0) 回退到 pickle 的测试
重复 key store幂等性的测试
close() 被调用两次不崩溃的测试
11. non_gpu_context_shm.py 中用 torch.empty((), dtype=dtype).element_size() 获取 itemsize
风格建议可改为 torch.tensor([], dtype=dtype).element_size() 或保持现状均可实施要求
基于 hlin99:ww20_PR_cpu_context_pickle 分支创建 PR
 PR #278 (copilot/add-nongpucontext-shm-implementation 分支) 的所有文件改动移植过来
在移植的同时应用上述所有 🔴 Critical 修复和 🟡 Medium 修复
确保现有测试通过新增的 SHM 测试也通过
PR 标题建议Port SHM transport to cpu_context_pickle branch with bug fixes

Copy link
Copy Markdown
Author

Copilot AI commented May 20, 2026

@hlin99 I've opened a new pull request, #280, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 3 commits May 20, 2026 06:33
Copy link
Copy Markdown
Author

Copilot AI commented May 20, 2026

server启动一次不动,worker前后启动两次,发两个一模一样的prompt,第一次都没问题,但是第二次worker启动以后报错如下。
(EngineCore pid=28702) [2026-05-20 06:32...

Fixed in ee5eed6 (plus test follow-ups 26a0cc7, 067f3e4). Root cause was SHM commit_store returning False when prepare_store reserved zero keys (valid case when KV already exists, e.g. second identical prompt after worker restart). I changed server logic so: matching prepare + empty reservation => success no-op; missing prepare state => failure. Added regression coverage for both paths in test_server_shm_commit_store_allows_noop_when_all_keys_exist. UI screenshot: N/A (no UI changes).

Copilot finished work on behalf of hlin99 May 20, 2026 06:38
Copilot AI requested a review from hlin99 May 20, 2026 06:38
…ctness gaps from #278 review (#280)

* Initial plan

* Fix SHM non-GPU transport idempotency, locking, and cleanup issues

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/661cbeee-d0d4-40ef-9312-4044e4696a51

Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

* Polish SHM feedback fixes and align validation comments

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/661cbeee-d0d4-40ef-9312-4044e4696a51

Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

* Fix memory leak: early-return from prepare_store when all keys exist

When reserve_write returns empty (all object keys already cached),
return PrepareStoreResponse(context={}) immediately without storing
an entry in _pending_shm_writes. This prevents leaked entries that
would never be popped since the worker won't call commit_store.

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/182111d5-1737-49c0-be65-0287d5b9d6c5

Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>
@hlin99 hlin99 closed this May 28, 2026
@hlin99 hlin99 deleted the copilot/add-nongpucontext-shm-implementation branch May 28, 2026 12:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants