Skip to content

Address some performance regressions in shuffle#1018

Open
wence- wants to merge 7 commits into
rapidsai:mainfrom
wence-:wence/fea/shuffle-perf
Open

Address some performance regressions in shuffle#1018
wence- wants to merge 7 commits into
rapidsai:mainfrom
wence-:wence/fea/shuffle-perf

Conversation

@wence-
Copy link
Copy Markdown
Contributor

@wence- wence- commented May 8, 2026

After #927 we lost about 10% performance in the shuffle benchmarks when using cuda async memory.

Recover, from my benchmarking, most of this with a number of updates:

  • Switch back to closer to the pre-Support reuse of op_ids in shuffles #927 "wake" scheme. We now wake a waiter once all data is ready to be extracted and all sends have been posted (but not necessarily completed)
  • Rather than breaking when we see the first ready buffer per rank, post all receives up to the first non-ready buffer (TODO: confirm this is safe with message ordering, I think it is)
  • Apply a circulant shift to the polling for metadata so we don't all look for metadata from rank-0, then rank-1, etc...

In addition, to allow benchmarking just the communication part of the shuffle, add a "discard output without even concatenating it" mode to the shuffle benchmark.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wence- wence- force-pushed the wence/fea/shuffle-perf branch 6 times, most recently from 92e2244 to fbebd32 Compare May 8, 2026 13:21
This allows benchmarking just the data movement part of the shuffle without
the unspilling and concatenation of the results.

Additionally, remove the unnecessary stream sync, the contract is the
downstream data is available in stream-ordered fashion, so do that.
@wence- wence- force-pushed the wence/fea/shuffle-perf branch 2 times, most recently from 8e200d0 to 8be0963 Compare May 8, 2026 14:24
@wence- wence- force-pushed the wence/fea/shuffle-perf branch from 8be0963 to a583ce0 Compare May 8, 2026 16:29
@wence- wence- changed the title Wence/fea/shuffle perf Address some performance regressions in shuffle May 8, 2026
@wence- wence- added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels May 8, 2026
@wence- wence- marked this pull request as ready for review May 11, 2026 07:55
@wence- wence- requested a review from a team as a code owner May 11, 2026 07:55
@wence- wence- force-pushed the wence/fea/shuffle-perf branch 3 times, most recently from 0b25269 to a3fff0b Compare May 11, 2026 13:50
@wence- wence- force-pushed the wence/fea/shuffle-perf branch from a3fff0b to 1c763bb Compare May 11, 2026 14:00
@wence- wence- added the DO NOT MERGE Hold off on merging; see PR for details label May 11, 2026
@wence-
Copy link
Copy Markdown
Contributor Author

wence- commented May 11, 2026

The waking is not safe yet.

wence- added 2 commits May 11, 2026 17:17
We can wake if the MPE has finished polling for new metadata, and don't
need to wait for it to be completely idle.
Seems from benchmarking this won't be worth it
@wence- wence- force-pushed the wence/fea/shuffle-perf branch from 1c763bb to 011ed99 Compare May 11, 2026 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DO NOT MERGE Hold off on merging; see PR for details improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant