Skip to content

TL/UCP: add allreduce ring algorithm#1258

Open
wfaderhold21 wants to merge 5 commits intoopenucx:masterfrom
wfaderhold21:topic/allreduce-ring
Open

TL/UCP: add allreduce ring algorithm#1258
wfaderhold21 wants to merge 5 commits intoopenucx:masterfrom
wfaderhold21:topic/allreduce-ring

Conversation

@wfaderhold21
Copy link
Collaborator

What

This a reproduction of PR #1082 with some changes. The original algorithm was an implementation of a ring-based reduce-scatter + allgather algorithm to perform allreduce. This is the same but has been converted to a schedule-based approach to reuse reduce-scatter and allgather ring algorithms. This improves performance compared to the original approach.

Allreduce Performance Comparison

Configuration: Thor cluster, 16 Nodes, 32 PPN (512 processes total)

Size UCC Ring (μs) UCC Default (μs) OMPI Tuned (μs) Ring vs Default Ring vs OMPI
4B 20.09 17.41 30.04 0.87x 1.49x
8B 14.33 12.18 12.34 0.85x 0.86x
16B 14.96 12.96 6.93 0.87x 0.46x
32B 11.88 12.11 5.46 1.02x 0.46x
64B 12.60 12.12 7.06 0.96x 0.56x
128B 18.31 18.88 5.73 1.03x 0.31x
256B 19.06 18.43 8.41 0.97x 0.44x
512B 27.81 27.37 7.48 0.98x 0.27x
1KB 35.48 35.84 19.34 1.01x 0.55x
2KB 401.68 51.39 22.06 0.13x 0.05x
4KB 971.32 36.01 28.00 0.04x 0.03x
8KB 940.93 44.18 43.39 0.05x 0.05x
16KB 979.73 56.61 103.19 0.06x 0.11x
32KB 1120.04 71.34 118.56 0.06x 0.11x
64KB 1220.42 97.15 108.31 0.08x 0.09x
128KB 1489.83 138.17 149.70 0.09x 0.10x
256KB 3164.63 218.73 231.86 0.07x 0.07x
512KB 3519.75 367.45 378.86 0.10x 0.11x
1MB 4429.24 708.90 737.39 0.16x 0.17x
2MB 5701.08 2113.65 2231.48 0.37x 0.39x
4MB 7557.97 6639.26 6621.17 0.88x 0.88x
8MB 12714.40 16802.76 13050.44 1.32x 1.03x
16MB 19043.60 36955.76 25359.61 1.94x 1.33x
32MB 32388.59 73831.34 51227.33 2.28x 1.58x
64MB 60050.21 149227.26 135101.16 2.49x 2.25x
128MB 139286.78 300387.02 278375.47 2.16x 2.00x
256MB 287431.78 597718.12 572264.84 2.08x 1.99x

See attached graphs for visual graphs of the table above.
allreduce_comparison.pdf
allreduce_comparison_large.pdf

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 22, 2026

Greptile Overview

Greptile Summary

This PR adds a ring-based allreduce algorithm that composes reduce-scatter and allgather ring operations using UCC schedules. The implementation shows significant performance improvements for large messages (>4MB), with 2-2.5x speedup over the default algorithm.

Key changes:

  • Schedule-based composition cleanly reuses existing reduce-scatter and allgather ring implementations
  • Proper handling of in-place and out-of-place modes with memcpy when needed
  • Count validation ensures count % team_size == 0 as required by ring algorithm
  • Tests properly use counts divisible by team sizes (e.g., 15, 273, 65535, 123570)
  • Expanded test coverage with multiple data types and operations

The implementation correctly passes total count to sub-algorithms which internally handle per-rank block size calculations when in-place flag is set.

Confidence Score: 5/5

  • This PR is safe to merge with no identified issues
  • Clean schedule-based implementation that properly reuses existing tested components, correct count handling verified through code inspection, comprehensive test coverage with appropriate constraints
  • No files require special attention

Important Files Changed

Filename Overview
src/components/tl/ucp/allreduce/allreduce_ring.c New ring allreduce implementation using schedule-based composition of reduce-scatter and allgather ring algorithms
test/gtest/coll/test_allreduce.cc Test suite updated with ring algorithm tests using counts divisible by team sizes, expanded type coverage

Last reviewed commit: e1229b6

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 3 comments

Edit Code Review Agent Settings | Greptile

}

// Test with various data sizes: small, medium, large
for (auto count : {8, 65536, 123567}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test counts not divisible by team size (n_procs=15)

The ring algorithm requires count % tsize == 0 (enforced at allreduce_ring.c:101-105). With n_procs=15:

  • count 8: 8 % 15 = 8 (fails)
  • count 65536: 65536 % 15 = 1 (fails)
  • count 123567: 123567 % 15 = 12 (fails)

Use counts divisible by 15, e.g. {15, 65520, 123570}

Suggested change
for (auto count : {8, 65536, 123567}) {
for (auto count : {15, 65520, 123570}) {

UccTeam_h team = job.create_team(team_size);
UccCollCtxVec ctxs;

for (auto count : {0, 1, 3, 17}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most test counts not divisible by team sizes

The ring algorithm requires count % tsize == 0. Most combinations will fail:

  • team_size=3: only count 0 and 3 work (1, 17 fail)
  • team_size=7: only count 0 works (1, 3, 17 fail)
  • team_size=13: only count 0 works (1, 3, 17 fail)

Use counts that are multiples of all team sizes (e.g., LCM(3,7,13)=273):

Suggested change
for (auto count : {0, 1, 3, 17}) {
for (auto count : {0, 273, 546, 819}) {

Comment on lines +222 to +231
struct {
void *scratch;
ucc_mc_buffer_header_t *scratch_mc_header;
ucc_ee_executor_task_t *etask;
ucc_ee_executor_t *executor;
ucc_tl_ucp_copy_task_t *copy_task;
int phase;
int step;
size_t chunk_size;
int p2p_posted;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused struct - schedule-based implementation doesn't use task state

The schedule-based ring allreduce (allreduce_ring.c) uses ucc_schedule_t and doesn't access task->allreduce_ring. This struct appears to be leftover from the original non-schedule implementation (PR #1082).

Suggested change
struct {
void *scratch;
ucc_mc_buffer_header_t *scratch_mc_header;
ucc_ee_executor_task_t *etask;
ucc_ee_executor_t *executor;
ucc_tl_ucp_copy_task_t *copy_task;
int phase;
int step;
size_t chunk_size;
int p2p_posted;

Copy link
Contributor

@Sergei-Lebedev Sergei-Lebedev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, but the requirement of count%tsize is too strict imho. We can modify ring reduce scatter and ring allgather so that they can handle remainder correctly similar to knomial algorithms. This can be improved in next PR though

Pls fix greptile comments

@@ -0,0 +1,196 @@
/**
* Copyright (c) 2021-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +124 to +126
* For reduce_scatter_ring:
* - dst.info.count should be the per-rank output count (count/tsize)
* - The algorithm internally computes total = dst.info.count * tsize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is misleading - states count should be per-rank (count/tsize), but code passes total count at line 132. This is actually correct because IN_PLACE flag is set at line 134, and reduce_scatter_ring expects total count when in-place. Update comment to clarify this.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +95 to +107
if (UCC_TL_UCP_TEAM_LIB(tl_team)->cfg.reduce_avg_pre_op &&
coll_args->args.op == UCC_OP_AVG) {
return UCC_ERR_NOT_SUPPORTED;
}

/* Check that count is divisible by team size for ring algorithm */
if (count % tsize != 0) {
tl_debug(team->context->lib,
"ring requires count (%zu) divisible by team size (%u)",
count, tsize);
return UCC_ERR_NOT_SUPPORTED;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong count for ring

ucc_tl_ucp_allreduce_ring_init enforces count % tsize == 0 (where count is dst.info.count, i.e., total elements per rank), but then passes that same total count as rs_args.args.dst.info.count while also forcing UCC_COLL_ARGS_FLAG_IN_PLACE. In reduce_scatter_ring the in-place path interprets dst.info.count as per-rank block count and internally computes total count as dst.info.count * size (reduce_scatter/reduce_scatter_ring.c:95-114, 213-231, 364-366), so this will run with an effective total of count*tsize and produce incorrect offsets/results for normal allreduce inputs.

Also note that count=0 will currently be rejected by the count % tsize check even though reduce_scatter_ring supports it (it just results in no work), so edge-case tests can’t pass as written.

Comment on lines +426 to +434
for (auto count : {0, 273, 546, 819}) {
SET_MEM_TYPE(UCC_MEMORY_TYPE_HOST);
this->set_inplace(TEST_NO_INPLACE);
this->data_init(team_size, TypeParam::dt, count, ctxs, false);
UccReq req(team, ctxs);

req.start();
req.wait();
EXPECT_EQ(true, this->data_validate(ctxs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero-count ring test fails

ring_edge_cases includes count=0, but the ring allreduce init currently returns UCC_ERR_NOT_SUPPORTED when count % team_size != 0 (and 0 % team_size is 0, but earlier validation paths in data_init/coll selection can still make this fail depending on how count=0 is handled). More importantly, with the current allreduce_ring wiring, even if init succeeded, the reduce-scatter/allgather composition assumes a meaningful block size; it’s safer to avoid count=0 here unless the algorithm explicitly supports it end-to-end.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants