Conversation
Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
|
| Filename | Overview |
|---|---|
| src/components/tl/cuda/tl_cuda.h | Adds two arrays to snapshot device pointers at map time, preventing race condition during unmap |
| src/components/tl/cuda/alltoallv/alltoallv_ce.c | Implements fix by capturing device pointers during setup and using snapshots during unmap; includes formatting improvements |
Last reviewed commit: b08ce8a
What
The issue is a use-after-release race condition in the alltoallv unmap path. Here's the sequence that causes it:
How ?
The fix snapshots each peer's device pointer (mem_info_src.ptr / mem_info_dst.ptr) into task-local arrays (peer_src_d_ptr[i] / peer_dst_d_ptr[dst]) at map time — before the completion barrier — so the unmap path reads a stable, private copy that cannot be overwritten by a peer that has moved on to a new collective.