Skip to content

Restore SHM safety fallback and transport selection observability in NonGPUTransferModule#309

Closed
Copilot wants to merge 4 commits into
ww21_PR_shmfrom
copilot/add-shm-capacity-check-logging
Closed

Restore SHM safety fallback and transport selection observability in NonGPUTransferModule#309
Copilot wants to merge 4 commits into
ww21_PR_shmfrom
copilot/add-shm-capacity-check-logging

Conversation

Copy link
Copy Markdown

Copilot AI commented May 27, 2026

This PR restores two behaviors that were present before the compositor refactor: proactive /dev/shm capacity gating for SHM transport, and explicit per-instance transport selection logs during non-GPU context registration. Without these, SHM can fail late at allocation time and operators cannot easily tell whether SHM or pickle was selected.

  • SHM capacity guard in _compute_shm_pool_info

    • Added Linux-only free-space check for /dev/shm after SHM name/pool resolution and before returning active SHM config.
    • If free_bytes < pool_size, the module now logs a warning and returns disabled SHM config (shm_name="", pool_size=0) to force pickle fallback.
    • If /dev/shm capacity cannot be checked (OSError), the module now logs and disables SHM conservatively.
  • Transport selection logging in register_kv_cache_non_gpu_context

    • Added explicit info logs for final strategy choice per instance:
      • SHM path logs instance_id, shm_name, and pool_size.
      • Pickle path logs instance_id with pickle selection.
  • Focused regression coverage

    • Added tests for:
      • insufficient /dev/shm capacity disables SHM;
      • non-Linux path skips /dev/shm probing;
      • registration logs for pickle selection;
      • registration logs for SHM selection.
# non_gpu_transfer.py (new behavior)
if pool_size > 0 and sys.platform.startswith("linux"):
    free_bytes = shutil.disk_usage("/dev/shm").free
    if free_bytes < pool_size:
        logger.warning("Insufficient /dev/shm capacity ... Disabling SHM transport.")
        return {"shm_name": "", "pool_size": 0}

if isinstance(strategy, ShmTransferStrategy):
    logger.info(
        "Instance %s non-GPU context using SHM transport (shm_name=%s, pool_size=%d)",
        payload.instance_id, shm_name, pool_size
    )
else:
    logger.info("Instance %s non-GPU context using pickle transport", payload.instance_id)
Original prompt

Add missing /dev/shm capacity check and transport selection logging to NonGPUTransferModule.

These were present in the old monolithic server.py (commit c61c215) but lost during the compositor pattern refactor.

Changes needed in lmcache/v1/multiprocess/modules/non_gpu_transfer.py:

1. Add /dev/shm capacity check in _compute_shm_pool_info

Before returning a valid SHM pool info, check that /dev/shm has enough free space. If not, log a warning and return empty (disable SHM, fallback to pickle).

Original logic from old _resolve_shm_config:

import shutil
import sys

if sys.platform.startswith("linux"):
    try:
        free_bytes = shutil.disk_usage("/dev/shm").free
        if free_bytes < mem_cfg.size_in_bytes:
            logger.warning(
                "Insufficient /dev/shm capacity: need %d bytes, have %d bytes. "
                "Disabling SHM transport.",
                mem_cfg.size_in_bytes,
                free_bytes,
            )
            return {"shm_name": "", "pool_size": 0}
    except OSError:
        logger.warning(
            "Cannot verify /dev/shm capacity required for SHM transport; "
            "disabling SHM mode.",
            exc_info=True,
        )
        return {"shm_name": "", "pool_size": 0}

This check should happen AFTER determining that shm_name is non-empty and pool_size > 0, but BEFORE returning the valid SHM pool info.

2. Add transport selection logging in register_kv_cache_non_gpu_context

After deciding the strategy (SHM vs pickle), log which transport was selected:

if shm_active:
    logger.info(
        "Instance %s non-GPU context using SHM transport "
        "(shm_name=%s, pool_size=%d)",
        payload.instance_id,
        shm_name,
        pool_size,
    )
else:
    logger.info(
        "Instance %s non-GPU context using pickle transport",
        payload.instance_id,
    )

Why:

  • Without the capacity check, if /dev/shm is too small the SHM allocator will fail at runtime instead of gracefully falling back to pickle.
  • Without the logging, operators cannot tell which transport mode was selected for each worker instance.

The following is the prior conversation context from the user's chat exploration (may be truncated):

Assistant: PR #308 的实现是正确的,没有丢逻辑:

  1. server.py — 只改了一行:NonGPUTransferModule(ctx)NonGPUTransferModule(ctx, mp_config)
  2. non_gpu_transfer.py__init__ 接收 mp_config,保存 _shm_name_override_compute_shm_pool_info() 优先用 override(空串→禁用SHM,非空→normalize后返回 pool_size)
  3. 测试 — 覆盖了空字符串禁用和自定义名称两个 case

你说的"逻辑全没了"是指这个 PR #308,还是之前的 PR #307(storage manager 方案)?如果是 #307 你可以直接关掉那个。#308 的 diff 看起来完整正确。

User: hlin99:ww21_PR_shm这个分支,你看看这个commit (c61c2156cb445afbd467205d6e92f1d33367f2d9)上的逻辑都移到最新的code上了吗?

Assistant: 这个 commit 只是一个重命名 — 把 transport/ 目录改名为 worker_transfer/,以及把 transfer_strategy.py 改名为 server_transfer.py。没有业务逻辑变更。

让我看看最新代码里这些文件是否都已经在正确的位置:

Assistant: 那个 commit c61c2156 只是 重命名 transport/worker_transfer/,以及把 import 路径从 lmcache.v1.multiprocess.transport 改成 lmcache.v1.multiprocess.worker_transfer

当前 ww21_PR_shm 分支上已经完整保留了这些改动:

commit 里的改动 当前分支状态
transport/worker_transfer/ 目录重命名 worker_transfer/ 存在,含 __init__.py, base.py, pickle.py, shm.py, shm_types.py,sha 全部一致
server.py import 从 transport.baseworker_transfer.base ✅ 但 server.py 后来被重构成模块化了(compositor pattern),non-GPU 逻辑移到了 modules/non_gpu_transfer.py,import 已经用 worker_transfer.base
server_transfer.py import 更新 ✅ sha 075ccfb7 与 commit 一致
transfer_context.py import 更新 ✅ 已用 worker_transfer.base
测试 import 更新 ✅ 全部改成 worker_transfer.*

结论:commit c61c2156 的所有逻辑都已经在最新代码上了。 这个 commit 纯粹是路径重命名,没有功能逻辑变化。

你说的"逻辑全没了"具体是哪个文件/函数的逻辑?能贴一下你发现缺失的代码片段吗?

User: 那shm< L1的时候要fallback到pickle的逻辑有吗?

Assistant: 有的。 ShmTransferStrategy 构造时接收了 fallback_strategy: PickleTransferStrategy(第 224 行),在 commit_store 里已经有 fallback 逻辑(第 306-313 行):

if cpu_data != b"":
    return self._fallback_strategy.commit_store(...)

这是 per-request 的 fallback:当 worker 发来非空 cpu_data(说明 worker 用了 pickle 路径),server 侧自动降级到 pickle commit。


但你问的"SHM < L1 时 fallback 到 pickle"——即 注册阶段/dev/shm 容量不够存放 L1 pool 时,整个 instance 应该直接用 pickle strategy——这个逻辑没有

当前在 non_gpu_transfer.py 第 236 行:

if shm_name and pool_size > 0:
    strategy = ShmTransferStrategy(...)

只要 config 里有 shm_namepool_size > 0 就选 SHM,没有检查 /dev/shm 实际可用空间是否 ≥ pool_size

需要加类似这样的检查吗?

import shutil
shm_avail = shutil.disk_usage("/dev/shm").free
if shm_name and pool_size > 0 and shm_avail >= pool_size:
    strategy = ShmTransferStrategy(...)
else:
    strategy = PickleTransferStrategy(...)
    logger.warning("SHM unavailable or too small, falling back to pickle")

要我开个 PR 加这个?

User: 回退到c61c2156cb445afbd467205d6e92f1d33367f2d9这个commit,里面是有判断shm 大小是不是大于L1的

Assistan...

This pull request was created from Copilot chat.

Copilot AI changed the title [WIP] Add missing /dev/shm capacity check and logging to NonGPUTransferModule Restore SHM safety fallback and transport selection observability in NonGPUTransferModule May 27, 2026
Copilot finished work on behalf of hlin99 May 27, 2026 07:15
Copilot AI requested a review from hlin99 May 27, 2026 07:15
@hlin99 hlin99 closed this May 28, 2026
@hlin99 hlin99 deleted the copilot/add-shm-capacity-check-logging branch May 28, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants