Skip to content

TL/UCP: use memcpy instead of sendrecv in agv#1255

Open
Sergei-Lebedev wants to merge 1 commit intoopenucx:masterfrom
Sergei-Lebedev:topic/allgatherv_memcpy
Open

TL/UCP: use memcpy instead of sendrecv in agv#1255
Sergei-Lebedev wants to merge 1 commit intoopenucx:masterfrom
Sergei-Lebedev:topic/allgatherv_memcpy

Conversation

@Sergei-Lebedev
Copy link
Contributor

What

Use ucc_mc_memcpy instead of sendrecv in allgatherv ring

Why ?

UCP doesn't support efficient copy of CUDA buffers in self transport. It does it through host staging buffer

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 14, 2026

Greptile Summary

Replaced inefficient UCP self-transport sendrecv with ucc_mc_memcpy for local buffer copy in allgatherv ring algorithm. UCP doesn't efficiently copy CUDA buffers through self-transport and uses host staging buffers, while ucc_mc_memcpy provides direct GPU-to-GPU copies.

  • Replaced local sendrecv pair with single ucc_mc_memcpy call in non-inplace case
  • Adjusted ring algorithm loop from tsize to tsize - 1 iterations to account for local copy being done upfront
  • Updated send_idx calculation to remove + 1 offset (now trank - send_posted)
  • Updated recv_idx calculation to add - 1 offset (now trank - recv_posted - 1)
  • Removed inplace branch that manually set counters, as loop now handles all cases uniformly
  • Added profiling events for start and completion
  • Simplified error handling by removing separate error label

The index adjustments align with the pattern used in allgather_ring.c and correctly implement the ring algorithm with upfront local copy.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The changes follow an established pattern from allgather_ring.c, with correct mathematical adjustments to loop bounds and index calculations. The replacement of sendrecv with memcpy is a straightforward performance optimization that addresses UCP's inefficient handling of CUDA self-transport. Error handling has been appropriately simplified.
  • No files require special attention

Important Files Changed

Filename Overview
src/components/tl/ucp/allgatherv/allgatherv_ring.c Replaced inefficient UCP self-transport sendrecv with direct memcpy for CUDA buffers, adjusted ring algorithm loop bounds and indices accordingly

Copy link
Collaborator

@wfaderhold21 wfaderhold21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

data_size = ucc_coll_args_get_count(
args, args->dst.info_v.counts, grank) *
rdt_size;
status = ucc_mc_memcpy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! question though: would ucc_mc_memcpy potentially cause a deadlock if using cuda buffers for a scenario of NCCL + UCC? should this be a local copy instead?

@janjust
Copy link
Collaborator

janjust commented Jan 17, 2026

/build

1 similar comment
@janjust
Copy link
Collaborator

janjust commented Jan 23, 2026

/build

@janjust janjust force-pushed the topic/allgatherv_memcpy branch from 89eb485 to b5282de Compare January 23, 2026 15:43
@janjust
Copy link
Collaborator

janjust commented Jan 23, 2026

/build

1 similar comment
@janjust
Copy link
Collaborator

janjust commented Jan 29, 2026

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants