Skip to content

fix: fail fast on NIXL region layout mismatch in recv transfer#246

Open
the-david-oy wants to merge 1 commit into
ai-dynamo:mainfrom
the-david-oy:fix/region-transfer-size-mismatch
Open

fix: fail fast on NIXL region layout mismatch in recv transfer#246
the-david-oy wants to merge 1 commit into
ai-dynamo:mainfrom
the-david-oy:fix/region-transfer-size-mismatch

Conversation

@the-david-oy
Copy link
Copy Markdown

@the-david-oy the-david-oy commented Apr 20, 2026

Problem

When MX_CONTIGUOUS_REG=1 is set on both source and target, each side groups adjacent tensors into contiguous memory regions and registers those regions with NIXL instead of individual tensors (for Llama-3.1-8B this reduces 324 tensors -> 223 regions, a 31.2% reduction in descriptors).

The recv path in nixl_transfer.py paired source regions with local regions by index. When the layouts disagreed it logged per-region WARNINGs but still built (src, local) pairs of mismatched length and passed them to NIXL, which rejected the entire transfer:

WARNING nixl_transfer.py:383 Region 84 size mismatch: source=33554432, local=83886080
WARNING nixl_transfer.py:383 Region 85 size mismatch: source=33554432, local=83886080
WARNING nixl_transfer.py:383 Region 88 size mismatch: source=33554432, local=4
WARNING nixl_transfer.py:383 Region 89 size mismatch: source=33554432, local=4
WARNING nixl_transfer.py:383 Region 90 size mismatch: source=33554432, local=4
WARNING nixl_transfer.py:383 Region 91 size mismatch: source=33554432, local=4
WARNING nixl_transfer.py:383 Region 92 size mismatch: source=4, local=16388
WARNING nixl_transfer.py:383 Region 216 size mismatch: source=16388, local=24576
WARNING nixl_transfer.py:383 Region 217 size mismatch: source=4, local=234881024
WARNING nixl_transfer.py:383 Region 218 size mismatch: source=4, local=167772160
INFO nixl_transfer.py:393 [Region Transfer] Matched 219 regions, 15.69 GB
INFO nixl_transfer.py:429 [Region Transfer] Skipping coalesce - already optimized with 219 regions
E nixl_agent.cpp:752] makeXferReq: length mismatch at index pair 52 with local index 52 and remote index 52
WARNING rdma_strategy.py:89 Source transfer failed: RDMA receive failed: NIXL_ERR_INVALID_PARAM

After NIXL rejected the transfer the worker fell back to a full HuggingFace download, defeating the purpose of the P2P path.

Root cause

The contiguous-region layout is computed from each process's PyTorch CUDA caching allocator state, which is not deterministic across processes. Fragmentation pattern, temporary buffer lifetimes, and pinned-memory state all vary between runs. Two workers that load the same model weights can therefore produce different region groupings (e.g. the observed 223 source regions vs 219 local regions). Index-based pairing then inevitably produces mismatched pairs.

Fix

Validate that source and local region layouts have the same count and the same per-region sizes before building transfer descriptors. If they disagree, raise a new typed exception RegionLayoutMismatchError from nixl_transfer.

RdmaStrategy translates this to ManifestMismatchError, which its retry loop already handles: it tries the next source candidate without marking the current source as STALE (because the source itself is healthy — the mismatch is a property of the local allocator state). If no candidate's layout matches, the loader falls through to a non-RDMA strategy exactly as before when no source is usable.

The [Contiguous Registration] optimization is preserved unchanged for the common case where layouts do match; only the failure mode is fixed. The old warning-spam-and-proceed behavior is gone.

Chose the fail-fast / typed-exception approach over a per-tensor fallback because: once the source publishes only region descriptors (__region_N__), the recv side has no per-tensor source info to fall back to; a per-tensor transfer would require a metadata-schema change. Adding RegionLayoutMismatchError gives clean fallback to the next source now and leaves a natural place to hook in a layout fingerprint later if desired.

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced region layout validation to detect mismatches upfront, preventing transfers with incompatible source-destination layouts; replaced silent warnings with clear error messages.
  • Tests

    • Added comprehensive test coverage for region layout validation scenarios.

… pairs

When MX_CONTIGUOUS_REG=1 is set on both source and target, each side
groups adjacent tensors into contiguous memory regions and registers
those regions with NIXL instead of individual tensors. The region
layout is computed from each process's PyTorch CUDA allocator state,
which is NOT deterministic across processes: fragmentation, temporary
buffers, and pinned-memory state vary between runs. Two workers loading
the same model weights can therefore produce different region
groupings.

Previously the recv path paired source regions with local regions by
index, logged per-region WARNINGs when sizes disagreed, and then still
built `(src, local)` pairs of mismatched length. NIXL rejected these:

    makeXferReq: length mismatch at index pair 52 with local index 52
    and remote index 52
    NIXL_ERR_INVALID_PARAM

aborting the whole transfer and falling back to an HF download.

Fix: validate that source and local region layouts have the same count
and the same per-region sizes BEFORE building transfer descriptors. If
they disagree, raise a new typed exception `RegionLayoutMismatchError`.
The RDMA strategy maps this to `ManifestMismatchError` so the caller
simply tries the next source candidate (without marking this source
STALE, since the source itself is healthy — the mismatch is a property
of the local allocator state). If no candidate's layout matches, the
loader falls through to a non-RDMA strategy as before.

The `[Contiguous Registration]` optimization is preserved unchanged
when layouts do match; only the failure mode is fixed.

Added unit tests for the new validation helper that run without NIXL,
CUDA, or a GPU (pure Python inputs), including a regression test for
the observed 223 vs 219 Llama-3.1-8B mismatch.
@the-david-oy the-david-oy changed the title Fail fast on NIXL region layout mismatch in recv transfer fix: fail fast on NIXL region layout mismatch in recv transfer Apr 20, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 20, 2026

Walkthrough

The changes introduce stricter validation for RDMA region layout compatibility. A new exception type RegionLayoutMismatchError is added to detect layout mismatches early in the transfer process. Error handling in RDMA strategy is updated to recognize this exception and propagate it upstream as a manifest mismatch rather than a generic transfer error. Region layout validation now fails fast instead of issuing warnings.

Changes

Cohort / File(s) Summary
Region Layout Validation
modelexpress_client/python/modelexpress/nixl_transfer.py
Added new exception RegionLayoutMismatchError; introduced _validate_region_layout_match() method to compare source and local region layouts by count and per-index size. Updated receive_from_source() to perform fail-fast validation instead of warning on mismatches, with human-readable mismatch reporting.
RDMA Strategy Error Handling
modelexpress_client/python/modelexpress/load_strategy/rdma_strategy.py
Added dedicated exception handler for RegionLayoutMismatchError in RDMA receive path; re-raises as ManifestMismatchError to allow upstream candidate selection to treat region incompatibilities as manifest issues.
Unit Tests
modelexpress_client/python/tests/test_nixl_transfer.py
New test module validating _validate_region_layout_match() behavior; covers success cases with matching layouts, failure cases with region count or size mismatches, bounded mismatch messaging, regression scenarios, and exception type verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 With whiskers twitching, I validate with care,
Region layouts must align, beyond compare!
No more warnings whispered soft—we fail quite fast,
When mismatches are found, exceptions are cast!
Hop-hop, the transfer knows its place,
Safety and strictness win the race!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 92.86% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'fix: fail fast on NIXL region layout mismatch in recv transfer' directly describes the main change: adding fail-fast validation for region layout mismatches in the NIXL recv transfer path, which is the core objective of this PR.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
modelexpress_client/python/modelexpress/nixl_transfer.py (1)

58-60: Redundant alias with a misleading justification — consider removing.

The comment claims _RegionLayoutMismatchError is "kept short for the raise site," but the private name (_RegionLayoutMismatchError, 27 chars) is actually longer than the public one (RegionLayoutMismatchError, 26 chars). There is only one raise site (line 460) and it could use the public name directly.

♻️ Proposed cleanup
-# Internal alias kept short for the raise site; re-exported via the public
-# name above for callers that want to catch it.
-_RegionLayoutMismatchError = RegionLayoutMismatchError
-
-
 class NixlTransferManager:

And update line 460:

-                raise _RegionLayoutMismatchError(mismatch_summary)
+                raise RegionLayoutMismatchError(mismatch_summary)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelexpress_client/python/modelexpress/nixl_transfer.py` around lines 58 -
60, The private alias _RegionLayoutMismatchError and its comment are redundant
and misleading; remove the alias declaration and its comment, then update the
single raise site that currently uses _RegionLayoutMismatchError to raise
RegionLayoutMismatchError directly (ensure any re-export logic still exposes
RegionLayoutMismatchError if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelexpress_client/python/modelexpress/nixl_transfer.py`:
- Around line 323-346: The comprehension uses a single-letter variable name `l`
which Ruff flags as ambiguous; change that to a descriptive name (e.g.,
`local_sz`) so the unpacking and f-string are clear: update the generator in
`sample = ", ".join(f"region {i}: source={s} local={l}" for i, s, l in head)` to
use `for i, s, local_sz in head` and the f-string to `local={local_sz}` (no
other logic changes needed—`size_mismatches`, `head`, `sample`,
`source_regions`, and `local_regions` stay the same).

---

Nitpick comments:
In `@modelexpress_client/python/modelexpress/nixl_transfer.py`:
- Around line 58-60: The private alias _RegionLayoutMismatchError and its
comment are redundant and misleading; remove the alias declaration and its
comment, then update the single raise site that currently uses
_RegionLayoutMismatchError to raise RegionLayoutMismatchError directly (ensure
any re-export logic still exposes RegionLayoutMismatchError if needed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 77de96ca-25a0-4162-a758-8d7e1a27e8bb

📥 Commits

Reviewing files that changed from the base of the PR and between 5566a62 and 19c8048.

📒 Files selected for processing (3)
  • modelexpress_client/python/modelexpress/load_strategy/rdma_strategy.py
  • modelexpress_client/python/modelexpress/nixl_transfer.py
  • modelexpress_client/python/tests/test_nixl_transfer.py

Comment on lines +323 to +346
size_mismatches: list[tuple[int, int, int]] = []
for i, (src_region, (_local_addr, local_size)) in enumerate(
zip(source_regions, local_regions, strict=True)
):
if src_region.size != local_size:
size_mismatches.append((i, src_region.size, local_size))

if size_mismatches:
# Log just the first few so errors stay readable.
head = size_mismatches[:5]
sample = ", ".join(
f"region {i}: source={s} local={l}" for i, s, l in head
)
suffix = (
f" (+{len(size_mismatches) - len(head)} more)"
if len(size_mismatches) > len(head)
else ""
)
return False, (
f"{len(size_mismatches)} region size mismatch(es) "
f"out of {len(source_regions)}: {sample}{suffix} "
"(PyTorch CUDA allocator non-determinism produced different "
"contiguous-region groupings across processes)"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Rename single-letter l — Ruff E741.

Static analysis flags l as ambiguous (easily confused with 1 / I). Rename to something like loc or local_sz in the unpacking and f-string.

🔧 Proposed fix
-        size_mismatches: list[tuple[int, int, int]] = []
+        size_mismatches: list[tuple[int, int, int]] = []
         for i, (src_region, (_local_addr, local_size)) in enumerate(
             zip(source_regions, local_regions, strict=True)
         ):
             if src_region.size != local_size:
                 size_mismatches.append((i, src_region.size, local_size))

         if size_mismatches:
-            # Log just the first few so errors stay readable.
             head = size_mismatches[:5]
             sample = ", ".join(
-                f"region {i}: source={s} local={l}" for i, s, l in head
+                f"region {idx}: source={src} local={loc}" for idx, src, loc in head
             )
🧰 Tools
🪛 Ruff (0.15.10)

[error] 334-334: Ambiguous variable name: l

(E741)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelexpress_client/python/modelexpress/nixl_transfer.py` around lines 323 -
346, The comprehension uses a single-letter variable name `l` which Ruff flags
as ambiguous; change that to a descriptive name (e.g., `local_sz`) so the
unpacking and f-string are clear: update the generator in `sample = ",
".join(f"region {i}: source={s} local={l}" for i, s, l in head)` to use `for i,
s, local_sz in head` and the f-string to `local={local_sz}` (no other logic
changes needed—`size_mismatches`, `head`, `sample`, `source_regions`, and
`local_regions` stay the same).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant