Skip to content

[ROCCLR] Mark VirtualMapCommand event complete after synchronous submit#6594

Open
BertanDogancay wants to merge 1 commit into
developfrom
users/bdoganca/rccl-rocclr-patch
Open

[ROCCLR] Mark VirtualMapCommand event complete after synchronous submit#6594
BertanDogancay wants to merge 1 commit into
developfrom
users/bdoganca/rccl-rocclr-patch

Conversation

@BertanDogancay
Copy link
Copy Markdown
Contributor

Motivation

Fixes a distributed deadlock in hipMemMap() / hipMemUnmap() that hangs RCCL multi-segment IPC registration at ≥ 8 ranks.

Technical Details

The fix is in VirtualGPU::submitVirtualMap(): mark the command's event CL_COMPLETE immediately after Hsa::vmem_map / Hsa::vmem_unmap return HSA_STATUS_SUCCESS, and CL_INVALID_OPERATION on failure. This causes the subsequent Event::awaitCompletion() short-circuit at the status() > CL_COMPLETE check, so the Marker is never enqueued and the entire deadlock-forming code path disappears for both APIs.

JIRA ID

ROCM-25387

Test Plan

rccl-UnitTestsMPI UMR_MultiSegment tests up to 8 ranks.

Test Result

Pass / No hangs

Submission Checklist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant