Use `multiprocessing.shared_memory` for cross-platform SHM transport by Copilot · Pull Request #291 · hlin99/LMCache

Copilot · 2026-05-21T12:13:47Z

The non-GPU SHM transport assumed Linux /dev/shm paths, which broke attachment and stale-segment cleanup on macOS. This updates the transport and allocator paths to use Python’s cross-platform shared memory API while preserving the existing Linux behavior.

Worker SHM attachment
- Replace os.open("/dev/shm/...") + mmap.mmap(...) in NonGpuContextShm with multiprocessing.shared_memory.SharedMemory.
- Keep tensor view construction on top of the shared buffer via torch.frombuffer(...).
- Simplify shutdown to close the shared memory handle instead of managing raw fds.
```
from multiprocessing import shared_memory

self._shm = shared_memory.SharedMemory(name=shm_name.lstrip("/"), create=False)
self._shm_buffer = self._shm.buf

tensor = torch.frombuffer(self._shm_buffer, dtype=dtype, count=count, offset=offset)
```
Stale SHM cleanup
- Replace /dev/shm/<name> unlink logic in _unlink_stale_shm with SharedMemory(...).unlink().
- Retain the existing name validation/prefix guard before attempting cleanup.
Allocator platform guard
- Keep the /dev/shm free-space check on Linux only.
- Skip that tmpfs-specific capacity check on non-Linux platforms, where POSIX shared memory is not exposed through a fixed /dev/shm mount.
Small robustness follow-ups
- Initialize SHM state defensively so failed attachment does not leave partial state behind.
- Make the buffer-not-initialized error message include the SHM name for easier diagnosis.

Original prompt

Replace Linux-specific /dev/shm hardcoded paths with cross-platform multiprocessing.shared_memory APIs so the SHM transport works on both Linux and macOS.

Do NOT modify any test files.

Changes needed:

1. `lmcache/v1/multiprocess/non_gpu_context_shm.py`

Replace the os.open("/dev/shm/...", os.O_RDWR) + mmap.mmap(fd, ...) pattern with multiprocessing.shared_memory.SharedMemory:

from multiprocessing import shared_memory

# In __init__:
self._shm = shared_memory.SharedMemory(name=shm_name.lstrip("/"), create=False)
self._mmap_obj = self._shm.buf  # memoryview works with torch.frombuffer

Remove the os.open/os.close/mmap.mmap logic and the INVALID_SHM_FD sentinel. Update close() to call self._shm.close() instead of os.close(fd) + mmap.close().

Note: torch.frombuffer accepts a memoryview (which is what SharedMemory.buf returns), so the _make_tensor_view method should still work with self._shm.buf as the buffer — just pass it directly to torch.frombuffer(self._shm.buf, ...).

2. `lmcache/v1/distributed/memory_manager.py` — `_unlink_stale_shm`

Replace os.unlink(os.path.join("/dev/shm", normalized)) with:

from multiprocessing import shared_memory

try:
    shm = shared_memory.SharedMemory(name=normalized, create=False)
    shm.close()
    shm.unlink()
except FileNotFoundError:
    return

3. `lmcache/v1/distributed/memory_manager.py` — `create_memory_allocator`

The shutil.disk_usage("/dev/shm") call only makes sense on Linux. Guard it:

import sys

if sys.platform == "linux":
    free_bytes = shutil.disk_usage("/dev/shm").free
    if free_bytes < config.size_in_bytes:
        raise RuntimeError(...)

On non-Linux platforms, skip the capacity check (macOS shm is not backed by a fixed-size tmpfs).

IMPORTANT: Do NOT modify any files under tests/.

The following is the prior conversation context from the user's chat exploration (may be truncated):

Assistant:

# Fix SHM transport correctness issues in non-GPU multiprocess path

修复 `ww21_PR_shm` 分支上 SHM 传输路径的 3 个问题，**不修改任何测试文件**。

## 1. prepare_store 超时直接抛异常

**文件**: `lmcache/v1/multiprocess/non_gpu_context_shm.py`

当前 `prepare_store` 超时返回 `None`，会导致后续 SHM 路径静默数据损坏。timeout 本身说明服务端异常，应该 fail fast。

将：
```python
try:
    response = future.result(timeout=self.mq_timeout)
except TimeoutError:
    return None
```

改为：
```python
try:
    response = future.result(timeout=self.mq_timeout)
except TimeoutError:
    raise TimeoutError(
        f"PREPARE_STORE timed out for instance_id={instance_id} "
        f"after {self.mq_timeout}s"
    )
```

## 2. store 和 retrieve 路径都需要 sync

**文件**: `lmcache/v1/multiprocess/transfer_context.py`

### 2a. submit_store：gather 后、commit 前加 sync

`gather_paged_kv_to_cpu` 使用 `non_blocking=True` 做异步拷贝。当 `out_buffers is not None`（SHM 路径）时，必须在 `commit_store` 之前同步，防止服务端读到未写完的数据。

```python
cpu_chunks = gather_paged_kv_to_cpu(
    ...
    out=out_buffers,
    chunk_indices=chunk_indices,
)
if out_buffers is not None:
    torch_dev.synchronize()
ok = self._non_gpu_context.commit_store(key, instance_id, cpu_chunks)
```

### 2b. submit_retrieve：scatter 后、commit_retrieve 前加 sync

`scatter_cpu_to_paged_kv` 从 SHM 拷贝数据到设备。`commit_retrieve` 会释放 SHM 槽位，服务端可能立刻复用。必须确保数据已完全离开 SHM 再释放。

```python
if src_buffers is not None:
    scatter_cpu_to_paged_kv(...)
    torch_dev.synchronize()
self._non_gpu_context.commit_retrieve(key, instance_id)
```

## 3. SHM 命名统一加前缀

**文件**: `lmcache/v1/distributed/memory_manager.py` 及相关 shm_name 构造处

将用户自定义的 `shm_name` 统一加上 `lmcache_l1_pool_` 前缀：

| 场景 | 结果 |
|------|------|
| 不指定 | `lmcache_l1_pool_{pid}`（保持现有行为） |
| 用户指定 `foo` | `lmcache_l1_pool_foo` |

这样 `_unlink_stale_shm` 只需 `startswith("lmcache_l1_pool_")` 即可，无需 `configured_shm_name` 参数特判，简化清理逻辑。

需要我用这个创建 issue 或者直接提 PR？

User: 看下最新的@hlin99/LMCache/pull/290。我感觉shm-name参数穿进去好像prefix的name没有指定吧？

Assistant: 你说得对。看 memory_manager.py 的改动：

shm_name = config.shm_name
if shm_name:
    bare = shm_name.lstrip("/")
    if not bare.startswith("lmcache_l1_pool_"):
        shm_name = f"lmcache_l1_pool_{bare}"

这里只改了局部变量 shm_name，但这个加了前缀的名字传到 MixedMemoryAllocator 后，注册响应返回给 worker 的 shm_name 是从 storage_manager.get_shm_pool_info() 取的。如果 get_shm_pool_info() 返回的还是原始的 config.shm_name（没有前缀），那 worker 端拿到的名字就对不上实际创建的 SHM 段名。

需要确认：

MixedMemoryAllocator 内部记录的 shm_name 是传入的（已加前缀）还是原始的？
get_shm_pool_info() 返回的是哪个？

如果 MixedMemoryAllocator 只是透传构造参数里的 shm_name，那这里应该没问题——因为传进去的已经是加了前缀的。但 config.shm_name 本身没被修改，如果有其他地方直接读 config.shm_name（比如 _unlink_stale_shm 的调用处），可能还是用的无前缀版本。

要我查一下 MixedMemoryAllocator 和 get_shm_pool_info 的实现确认？

User: 你去查啊。hlin99:ww21_PR_shm分支，你还等我给你查吗

Assistant: 找到了。看第 109-115 行：

self._shm_pool_info = {"shm_name": "", "pool_size": 0}
if isinstance(self._allocator, MixedMemoryAllocator):
    if self._allocator.shm_name:
        self._shm_pool_info = {
      ...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/5885bfad-28e5-4d4e-8ef5-c6fe1d4b89ae Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Workers do not own the SHM segment (server does), so we must prevent Python's resource_tracker from unlinking it when the worker exits. Without this, the second worker startup fails with FileNotFoundError.

Initial plan

2e3009c

Copilot AI assigned Copilot and hlin99 May 21, 2026

Copilot started work on behalf of hlin99 May 21, 2026 12:13 View session

Copilot AI and others added 3 commits May 21, 2026 12:24

Use SharedMemory APIs for SHM transport

c3d074e

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/5885bfad-28e5-4d4e-8ef5-c6fe1d4b89ae Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Polish SharedMemory transport follow-ups

55c61b1

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/5885bfad-28e5-4d4e-8ef5-c6fe1d4b89ae Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Address SharedMemory review feedback

e2f3717

Agent-Logs-Url: https://github.com/hlin99/LMCache/sessions/5885bfad-28e5-4d4e-8ef5-c6fe1d4b89ae Co-authored-by: hlin99 <73271530+hlin99@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Replace Linux-specific SHM paths with multiprocessing.shared_memory~~ Use multiprocessing.shared_memory for cross-platform SHM transport May 21, 2026

Copilot finished work on behalf of hlin99 May 21, 2026 12:31

Copilot AI requested a review from hlin99 May 21, 2026 12:31

hlin99 added 2 commits May 21, 2026 20:42

Add fallback to pickle transport when SHM context initialization fails

2bdb6ae

fix: unregister SHM from resource tracker to prevent premature unlink

d15ff6a

Workers do not own the SHM segment (server does), so we must prevent Python's resource_tracker from unlinking it when the worker exits. Without this, the second worker startup fails with FileNotFoundError.

hlin99 deleted the branch copilot/fix-shm-transport-issues May 28, 2026 12:20

hlin99 closed this May 28, 2026

hlin99 deleted the copilot/replace-linux-dev-shm-paths branch May 28, 2026 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `multiprocessing.shared_memory` for cross-platform SHM transport#291

Use `multiprocessing.shared_memory` for cross-platform SHM transport#291
Copilot wants to merge 6 commits into
copilot/fix-shm-transport-issuesfrom
copilot/replace-linux-dev-shm-paths

Copilot AI commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Copilot AI commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes needed:

1. lmcache/v1/multiprocess/non_gpu_context_shm.py

2. lmcache/v1/distributed/memory_manager.py — _unlink_stale_shm

3. lmcache/v1/distributed/memory_manager.py — create_memory_allocator

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented May 21, 2026 •

edited

Loading

1. `lmcache/v1/multiprocess/non_gpu_context_shm.py`

2. `lmcache/v1/distributed/memory_manager.py` — `_unlink_stale_shm`

3. `lmcache/v1/distributed/memory_manager.py` — `create_memory_allocator`