Skip to content

TL/CUDA: fix alltoall(v) race#1268

Open
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:a2a_fix
Open

TL/CUDA: fix alltoall(v) race#1268
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:a2a_fix

Conversation

@ikryukov
Copy link
Collaborator

What

The issue is a use-after-release race condition in the alltoallv unmap path. Here's the sequence that causes it:

  1. All peers finish their copies and enter the completion barrier (ALLTOALL_CE_STAGE_BAR).
  2. The barrier completes — every peer has signaled it is done.
  3. A fast peer proceeds to ucc_tl_cuda_put_sync(task), which releases its shared-memory sync slot, and immediately starts a new collective that reuses the same slot. This overwrites peer_sync->mem_info_src.ptr with a completely different device address.
  4. A slow peer is still in ucc_tl_cuda_alltoallv_unmap. It reads peer_sync->mem_info_src.ptr from the volatile shared-memory sync slot — but that value now belongs to the fast peer's next collective. It passes this stale key to ucc_tl_cuda_unmap_memhandle, which looks it up in the IPC cache, finds no matching entry, and dereferences a NULL page-table pointer, causing a segfault.

How ?

The fix snapshots each peer's device pointer (mem_info_src.ptr / mem_info_dst.ptr) into task-local arrays (peer_src_d_ptr[i] / peer_dst_d_ptr[dst]) at map time — before the completion barrier — so the unmap path reads a stable, private copy that cannot be overwritten by a peer that has moved on to a new collective.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@ikryukov ikryukov changed the title TL/CUDA: fix alltoallv race TL/CUDA: fix alltoall(v) race Feb 26, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 26, 2026

Greptile Summary

This PR fixes a critical use-after-release race condition in the CUDA alltoallv unmap path. The issue occurred when a fast peer completed the collective, released its shared-memory sync slot, and immediately reused it for a new collective—overwriting the device pointer before a slow peer could unmap it. The slow peer would then read the stale pointer, causing a NULL dereference during cache lookup.

The fix elegantly addresses this by snapshotting each peer's device pointers (mem_info_src.ptr and mem_info_dst.ptr) into task-local arrays during the map phase (before the completion barrier). The unmap phase then uses these stable copies instead of reading from volatile shared memory. This ensures consistency even when peers move on to new collectives.

  • Adds peer_src_d_ptr[] and peer_dst_d_ptr[] arrays to the task structure
  • Captures device pointers at lines 155-156 and 182-183 during ucc_tl_cuda_alltoallv_setup_test()
  • Uses captured values at lines 457 and 471 in ucc_tl_cuda_alltoallv_unmap()
  • Removes the now-unnecessary volatile peer_sync variable from the unmap function
  • Includes various formatting improvements (whitespace, line breaks, comment style) that don't affect functionality

The timing is correct: snapshots are taken during the SETUP stage before any copies execute, and used after the completion barrier where the race can occur. Array indexing is consistent between map and unmap phases.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix directly addresses a well-understood race condition with a clean, minimal solution. The implementation correctly snapshots device pointers at map time (before the race window) and uses these stable copies during unmap. Array indexing is consistent, data types are appropriate, and the timing is correct. No logical errors or edge cases were found. The extensive formatting changes are purely cosmetic and don't affect functionality.
  • No files require special attention

Important Files Changed

Filename Overview
src/components/tl/cuda/tl_cuda.h Adds two arrays to snapshot device pointers at map time, preventing race condition during unmap
src/components/tl/cuda/alltoallv/alltoallv_ce.c Implements fix by capturing device pointers during setup and using snapshots during unmap; includes formatting improvements

Last reviewed commit: b08ce8a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant