TL/CUDA: fix alltoall(v) race by ikryukov · Pull Request #1268 · openucx/ucc

ikryukov · 2026-02-26T15:42:25Z

What

The issue is a use-after-release race condition in the alltoallv unmap path. Here's the sequence that causes it:

All peers finish their copies and enter the completion barrier (ALLTOALL_CE_STAGE_BAR).
The barrier completes — every peer has signaled it is done.
A fast peer proceeds to ucc_tl_cuda_put_sync(task), which releases its shared-memory sync slot, and immediately starts a new collective that reuses the same slot. This overwrites peer_sync->mem_info_src.ptr with a completely different device address.
A slow peer is still in ucc_tl_cuda_alltoallv_unmap. It reads peer_sync->mem_info_src.ptr from the volatile shared-memory sync slot — but that value now belongs to the fast peer's next collective. It passes this stale key to ucc_tl_cuda_unmap_memhandle, which looks it up in the IPC cache, finds no matching entry, and dereferences a NULL page-table pointer, causing a segfault.

How ?

The fix snapshots each peer's device pointer (mem_info_src.ptr / mem_info_dst.ptr) into task-local arrays (peer_src_d_ptr[i] / peer_dst_d_ptr[dst]) at map time — before the completion barrier — so the unmap path reads a stable, private copy that cannot be overwritten by a peer that has moved on to a new collective.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps · 2026-02-26T15:47:24Z

Greptile Summary

This PR fixes a critical use-after-release race condition in the CUDA alltoallv unmap path. The issue occurred when a fast peer completed the collective, released its shared-memory sync slot, and immediately reused it for a new collective—overwriting the device pointer before a slow peer could unmap it. The slow peer would then read the stale pointer, causing a NULL dereference during cache lookup.

The fix elegantly addresses this by snapshotting each peer's device pointers (mem_info_src.ptr and mem_info_dst.ptr) into task-local arrays during the map phase (before the completion barrier). The unmap phase then uses these stable copies instead of reading from volatile shared memory. This ensures consistency even when peers move on to new collectives.

Adds peer_src_d_ptr[] and peer_dst_d_ptr[] arrays to the task structure
Captures device pointers at lines 155-156 and 182-183 during ucc_tl_cuda_alltoallv_setup_test()
Uses captured values at lines 457 and 471 in ucc_tl_cuda_alltoallv_unmap()
Removes the now-unnecessary volatile peer_sync variable from the unmap function
Includes various formatting improvements (whitespace, line breaks, comment style) that don't affect functionality

The timing is correct: snapshots are taken during the SETUP stage before any copies execute, and used after the completion barrier where the race can occur. Array indexing is consistent between map and unmap phases.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
The fix directly addresses a well-understood race condition with a clean, minimal solution. The implementation correctly snapshots device pointers at map time (before the race window) and uses these stable copies during unmap. Array indexing is consistent, data types are appropriate, and the timing is correct. No logical errors or edge cases were found. The extensive formatting changes are purely cosmetic and don't affect functionality.
No files require special attention

Important Files Changed

Filename	Overview
src/components/tl/cuda/tl_cuda.h	Adds two arrays to snapshot device pointers at map time, preventing race condition during unmap
src/components/tl/cuda/alltoallv/alltoallv_ce.c	Implements fix by capturing device pointers during setup and using snapshots during unmap; includes formatting improvements

_{Last reviewed commit: b08ce8a}

greptile-apps

_{2 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

ikryukov self-assigned this Feb 26, 2026

ikryukov added the Ready-for-Review label Feb 26, 2026

ikryukov requested review from Sergei-Lebedev, janjust and nsarka February 26, 2026 15:43

TL/CUDA: fix alltoallv race

b08ce8a

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov force-pushed the a2a_fix branch from dccfed7 to b08ce8a Compare February 26, 2026 15:44

ikryukov changed the title ~~TL/CUDA: fix alltoallv race~~ TL/CUDA: fix alltoall(v) race Feb 26, 2026

greptile-apps bot reviewed Feb 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/CUDA: fix alltoall(v) race#1268

TL/CUDA: fix alltoall(v) race#1268
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:a2a_fix

ikryukov commented Feb 26, 2026

Uh oh!

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ikryukov commented Feb 26, 2026

What

How ?

Uh oh!

greptile-apps bot commented Feb 26, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant