Restore SHM safety fallback and transport selection observability in NonGPUTransferModule#309
Closed
Copilot wants to merge 4 commits into
Closed
Restore SHM safety fallback and transport selection observability in NonGPUTransferModule#309Copilot wants to merge 4 commits into
Copilot wants to merge 4 commits into
Conversation
Copilot
AI
changed the title
[WIP] Add missing /dev/shm capacity check and logging to NonGPUTransferModule
Restore SHM safety fallback and transport selection observability in NonGPUTransferModule
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR restores two behaviors that were present before the compositor refactor: proactive
/dev/shmcapacity gating for SHM transport, and explicit per-instance transport selection logs during non-GPU context registration. Without these, SHM can fail late at allocation time and operators cannot easily tell whether SHM or pickle was selected.SHM capacity guard in
_compute_shm_pool_info/dev/shmafter SHM name/pool resolution and before returning active SHM config.free_bytes < pool_size, the module now logs a warning and returns disabled SHM config (shm_name="",pool_size=0) to force pickle fallback./dev/shmcapacity cannot be checked (OSError), the module now logs and disables SHM conservatively.Transport selection logging in
register_kv_cache_non_gpu_contextinstance_id,shm_name, andpool_size.instance_idwith pickle selection.Focused regression coverage
/dev/shmcapacity disables SHM;/dev/shmprobing;Original prompt
Add missing
/dev/shmcapacity check and transport selection logging toNonGPUTransferModule.These were present in the old monolithic
server.py(commit c61c215) but lost during the compositor pattern refactor.Changes needed in
lmcache/v1/multiprocess/modules/non_gpu_transfer.py:1. Add
/dev/shmcapacity check in_compute_shm_pool_infoBefore returning a valid SHM pool info, check that
/dev/shmhas enough free space. If not, log a warning and return empty (disable SHM, fallback to pickle).Original logic from old
_resolve_shm_config:This check should happen AFTER determining that
shm_nameis non-empty andpool_size > 0, but BEFORE returning the valid SHM pool info.2. Add transport selection logging in
register_kv_cache_non_gpu_contextAfter deciding the strategy (SHM vs pickle), log which transport was selected:
Why:
/dev/shmis too small the SHM allocator will fail at runtime instead of gracefully falling back to pickle.The following is the prior conversation context from the user's chat exploration (may be truncated):
Assistant: PR #308 的实现是正确的,没有丢逻辑:
server.py— 只改了一行:NonGPUTransferModule(ctx)→NonGPUTransferModule(ctx, mp_config)non_gpu_transfer.py—__init__接收mp_config,保存_shm_name_override;_compute_shm_pool_info()优先用 override(空串→禁用SHM,非空→normalize后返回 pool_size)你说的"逻辑全没了"是指这个 PR #308,还是之前的 PR #307(storage manager 方案)?如果是 #307 你可以直接关掉那个。#308 的 diff 看起来完整正确。
User: hlin99:ww21_PR_shm这个分支,你看看这个commit (c61c2156cb445afbd467205d6e92f1d33367f2d9)上的逻辑都移到最新的code上了吗?
Assistant: 这个 commit 只是一个重命名 — 把
transport/目录改名为worker_transfer/,以及把transfer_strategy.py改名为server_transfer.py。没有业务逻辑变更。让我看看最新代码里这些文件是否都已经在正确的位置:
Assistant: 那个 commit
c61c2156只是 重命名transport/→worker_transfer/,以及把 import 路径从lmcache.v1.multiprocess.transport改成lmcache.v1.multiprocess.worker_transfer。当前
ww21_PR_shm分支上已经完整保留了这些改动:transport/→worker_transfer/目录重命名worker_transfer/存在,含__init__.py,base.py,pickle.py,shm.py,shm_types.py,sha 全部一致server.pyimport 从transport.base→worker_transfer.basemodules/non_gpu_transfer.py,import 已经用worker_transfer.baseserver_transfer.pyimport 更新075ccfb7与 commit 一致transfer_context.pyimport 更新worker_transfer.baseworker_transfer.*结论:commit
c61c2156的所有逻辑都已经在最新代码上了。 这个 commit 纯粹是路径重命名,没有功能逻辑变化。你说的"逻辑全没了"具体是哪个文件/函数的逻辑?能贴一下你发现缺失的代码片段吗?
User: 那shm< L1的时候要fallback到pickle的逻辑有吗?
Assistant: 有的。
ShmTransferStrategy构造时接收了fallback_strategy: PickleTransferStrategy(第 224 行),在commit_store里已经有 fallback 逻辑(第 306-313 行):这是 per-request 的 fallback:当 worker 发来非空
cpu_data(说明 worker 用了 pickle 路径),server 侧自动降级到 pickle commit。但你问的"SHM < L1 时 fallback 到 pickle"——即 注册阶段 当
/dev/shm容量不够存放 L1 pool 时,整个 instance 应该直接用 pickle strategy——这个逻辑没有。当前在
non_gpu_transfer.py第 236 行:只要 config 里有
shm_name且pool_size > 0就选 SHM,没有检查/dev/shm实际可用空间是否 ≥ pool_size。需要加类似这样的检查吗?
要我开个 PR 加这个?
User: 回退到c61c2156cb445afbd467205d6e92f1d33367f2d9这个commit,里面是有判断shm 大小是不是大于L1的
Assistan...
This pull request was created from Copilot chat.