Skip to content

Allow shuffle extraction before protocol drain#1019

Open
pentschev wants to merge 2 commits into
rapidsai:mainfrom
pentschev:shuffle-extraction-before-protocol-drain
Open

Allow shuffle extraction before protocol drain#1019
pentschev wants to merge 2 commits into
rapidsai:mainfrom
pentschev:shuffle-extraction-before-protocol-drain

Conversation

@pentschev
Copy link
Copy Markdown
Member

#927 made Shuffler::wait() wait until all internal protocol/send cleanup had drained before returning, which introduced a 10-20% regression in bench_shuffle on an 8-GPU node by delaying output extraction and unpack/concat work. For the reported 16 input partitions / 4 output partitions case, mean per-rank elapsed time improved from about 207 ms back down to about 172 ms after restoring overlap. This change makes wait() return when local shuffle output is ready to extract, while adding wait_reusable() for callers that need the stronger full-drain guarantee before reusing an op_id.

rapidsai#927 made `Shuffler::wait()`
wait until all internal protocol/send cleanup had drained before
returning, which introduced a 10-20% regression in `bench_shuffle` on an
8-GPU node by delaying output extraction and unpack/concat work. For the
reported 16 input partitions / 4 output partitions case, mean per-rank
elapsed time improved from about 207 ms back down to about 172 ms after
restoring overlap. This change makes `wait()` return when local shuffle
output is ready to extract, while adding `wait_reusable()` for callers
that need the stronger full-drain guarantee before reusing an op_id.
@pentschev pentschev self-assigned this May 8, 2026
@pentschev pentschev requested a review from a team as a code owner May 8, 2026 12:06
@pentschev pentschev added bug Something isn't working non-breaking Introduces a non-breaking change labels May 8, 2026
Copy link
Copy Markdown
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update the Python bindings as well?

What about renaming to wait_drained()? Or maybe wait_for_drain()?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants