fix: fail fast on NIXL region layout mismatch in recv transfer#246
fix: fail fast on NIXL region layout mismatch in recv transfer#246the-david-oy wants to merge 1 commit into
Conversation
… pairs
When MX_CONTIGUOUS_REG=1 is set on both source and target, each side
groups adjacent tensors into contiguous memory regions and registers
those regions with NIXL instead of individual tensors. The region
layout is computed from each process's PyTorch CUDA allocator state,
which is NOT deterministic across processes: fragmentation, temporary
buffers, and pinned-memory state vary between runs. Two workers loading
the same model weights can therefore produce different region
groupings.
Previously the recv path paired source regions with local regions by
index, logged per-region WARNINGs when sizes disagreed, and then still
built `(src, local)` pairs of mismatched length. NIXL rejected these:
makeXferReq: length mismatch at index pair 52 with local index 52
and remote index 52
NIXL_ERR_INVALID_PARAM
aborting the whole transfer and falling back to an HF download.
Fix: validate that source and local region layouts have the same count
and the same per-region sizes BEFORE building transfer descriptors. If
they disagree, raise a new typed exception `RegionLayoutMismatchError`.
The RDMA strategy maps this to `ManifestMismatchError` so the caller
simply tries the next source candidate (without marking this source
STALE, since the source itself is healthy — the mismatch is a property
of the local allocator state). If no candidate's layout matches, the
loader falls through to a non-RDMA strategy as before.
The `[Contiguous Registration]` optimization is preserved unchanged
when layouts do match; only the failure mode is fixed.
Added unit tests for the new validation helper that run without NIXL,
CUDA, or a GPU (pure Python inputs), including a regression test for
the observed 223 vs 219 Llama-3.1-8B mismatch.
WalkthroughThe changes introduce stricter validation for RDMA region layout compatibility. A new exception type Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
modelexpress_client/python/modelexpress/nixl_transfer.py (1)
58-60: Redundant alias with a misleading justification — consider removing.The comment claims
_RegionLayoutMismatchErroris "kept short for the raise site," but the private name (_RegionLayoutMismatchError, 27 chars) is actually longer than the public one (RegionLayoutMismatchError, 26 chars). There is only one raise site (line 460) and it could use the public name directly.♻️ Proposed cleanup
-# Internal alias kept short for the raise site; re-exported via the public -# name above for callers that want to catch it. -_RegionLayoutMismatchError = RegionLayoutMismatchError - - class NixlTransferManager:And update line 460:
- raise _RegionLayoutMismatchError(mismatch_summary) + raise RegionLayoutMismatchError(mismatch_summary)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelexpress_client/python/modelexpress/nixl_transfer.py` around lines 58 - 60, The private alias _RegionLayoutMismatchError and its comment are redundant and misleading; remove the alias declaration and its comment, then update the single raise site that currently uses _RegionLayoutMismatchError to raise RegionLayoutMismatchError directly (ensure any re-export logic still exposes RegionLayoutMismatchError if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelexpress_client/python/modelexpress/nixl_transfer.py`:
- Around line 323-346: The comprehension uses a single-letter variable name `l`
which Ruff flags as ambiguous; change that to a descriptive name (e.g.,
`local_sz`) so the unpacking and f-string are clear: update the generator in
`sample = ", ".join(f"region {i}: source={s} local={l}" for i, s, l in head)` to
use `for i, s, local_sz in head` and the f-string to `local={local_sz}` (no
other logic changes needed—`size_mismatches`, `head`, `sample`,
`source_regions`, and `local_regions` stay the same).
---
Nitpick comments:
In `@modelexpress_client/python/modelexpress/nixl_transfer.py`:
- Around line 58-60: The private alias _RegionLayoutMismatchError and its
comment are redundant and misleading; remove the alias declaration and its
comment, then update the single raise site that currently uses
_RegionLayoutMismatchError to raise RegionLayoutMismatchError directly (ensure
any re-export logic still exposes RegionLayoutMismatchError if needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 77de96ca-25a0-4162-a758-8d7e1a27e8bb
📒 Files selected for processing (3)
modelexpress_client/python/modelexpress/load_strategy/rdma_strategy.pymodelexpress_client/python/modelexpress/nixl_transfer.pymodelexpress_client/python/tests/test_nixl_transfer.py
| size_mismatches: list[tuple[int, int, int]] = [] | ||
| for i, (src_region, (_local_addr, local_size)) in enumerate( | ||
| zip(source_regions, local_regions, strict=True) | ||
| ): | ||
| if src_region.size != local_size: | ||
| size_mismatches.append((i, src_region.size, local_size)) | ||
|
|
||
| if size_mismatches: | ||
| # Log just the first few so errors stay readable. | ||
| head = size_mismatches[:5] | ||
| sample = ", ".join( | ||
| f"region {i}: source={s} local={l}" for i, s, l in head | ||
| ) | ||
| suffix = ( | ||
| f" (+{len(size_mismatches) - len(head)} more)" | ||
| if len(size_mismatches) > len(head) | ||
| else "" | ||
| ) | ||
| return False, ( | ||
| f"{len(size_mismatches)} region size mismatch(es) " | ||
| f"out of {len(source_regions)}: {sample}{suffix} " | ||
| "(PyTorch CUDA allocator non-determinism produced different " | ||
| "contiguous-region groupings across processes)" | ||
| ) |
There was a problem hiding this comment.
Rename single-letter l — Ruff E741.
Static analysis flags l as ambiguous (easily confused with 1 / I). Rename to something like loc or local_sz in the unpacking and f-string.
🔧 Proposed fix
- size_mismatches: list[tuple[int, int, int]] = []
+ size_mismatches: list[tuple[int, int, int]] = []
for i, (src_region, (_local_addr, local_size)) in enumerate(
zip(source_regions, local_regions, strict=True)
):
if src_region.size != local_size:
size_mismatches.append((i, src_region.size, local_size))
if size_mismatches:
- # Log just the first few so errors stay readable.
head = size_mismatches[:5]
sample = ", ".join(
- f"region {i}: source={s} local={l}" for i, s, l in head
+ f"region {idx}: source={src} local={loc}" for idx, src, loc in head
)🧰 Tools
🪛 Ruff (0.15.10)
[error] 334-334: Ambiguous variable name: l
(E741)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelexpress_client/python/modelexpress/nixl_transfer.py` around lines 323 -
346, The comprehension uses a single-letter variable name `l` which Ruff flags
as ambiguous; change that to a descriptive name (e.g., `local_sz`) so the
unpacking and f-string are clear: update the generator in `sample = ",
".join(f"region {i}: source={s} local={l}" for i, s, l in head)` to use `for i,
s, local_sz in head` and the f-string to `local={local_sz}` (no other logic
changes needed—`size_mismatches`, `head`, `sample`, `source_regions`, and
`local_regions` stay the same).
Problem
When
MX_CONTIGUOUS_REG=1is set on both source and target, each side groups adjacent tensors into contiguous memory regions and registers those regions with NIXL instead of individual tensors (for Llama-3.1-8B this reduces 324 tensors -> 223 regions, a 31.2% reduction in descriptors).The recv path in
nixl_transfer.pypaired source regions with local regions by index. When the layouts disagreed it logged per-regionWARNINGs but still built(src, local)pairs of mismatched length and passed them to NIXL, which rejected the entire transfer:After NIXL rejected the transfer the worker fell back to a full HuggingFace download, defeating the purpose of the P2P path.
Root cause
The contiguous-region layout is computed from each process's PyTorch CUDA caching allocator state, which is not deterministic across processes. Fragmentation pattern, temporary buffer lifetimes, and pinned-memory state all vary between runs. Two workers that load the same model weights can therefore produce different region groupings (e.g. the observed 223 source regions vs 219 local regions). Index-based pairing then inevitably produces mismatched pairs.
Fix
Validate that source and local region layouts have the same count and the same per-region sizes before building transfer descriptors. If they disagree, raise a new typed exception
RegionLayoutMismatchErrorfromnixl_transfer.RdmaStrategytranslates this toManifestMismatchError, which its retry loop already handles: it tries the next source candidate without marking the current source asSTALE(because the source itself is healthy — the mismatch is a property of the local allocator state). If no candidate's layout matches, the loader falls through to a non-RDMA strategy exactly as before when no source is usable.The
[Contiguous Registration]optimization is preserved unchanged for the common case where layouts do match; only the failure mode is fixed. The old warning-spam-and-proceed behavior is gone.Chose the fail-fast / typed-exception approach over a per-tensor fallback because: once the source publishes only region descriptors (
__region_N__), the recv side has no per-tensor source info to fall back to; a per-tensor transfer would require a metadata-schema change. AddingRegionLayoutMismatchErrorgives clean fallback to the next source now and leaves a natural place to hook in a layout fingerprint later if desired.Summary by CodeRabbit
Bug Fixes
Tests