TL/UCP: add allreduce ring algorithm#1258
TL/UCP: add allreduce ring algorithm#1258wfaderhold21 wants to merge 5 commits intoopenucx:masterfrom
Conversation
Greptile Overview
|
| Filename | Overview |
|---|---|
| src/components/tl/ucp/allreduce/allreduce_ring.c | New ring allreduce implementation using schedule-based composition of reduce-scatter and allgather ring algorithms |
| test/gtest/coll/test_allreduce.cc | Test suite updated with ring algorithm tests using counts divisible by team sizes, expanded type coverage |
Last reviewed commit: e1229b6
test/gtest/coll/test_allreduce.cc
Outdated
| } | ||
|
|
||
| // Test with various data sizes: small, medium, large | ||
| for (auto count : {8, 65536, 123567}) { |
There was a problem hiding this comment.
test counts not divisible by team size (n_procs=15)
The ring algorithm requires count % tsize == 0 (enforced at allreduce_ring.c:101-105). With n_procs=15:
- count 8: 8 % 15 = 8 (fails)
- count 65536: 65536 % 15 = 1 (fails)
- count 123567: 123567 % 15 = 12 (fails)
Use counts divisible by 15, e.g. {15, 65520, 123570}
| for (auto count : {8, 65536, 123567}) { | |
| for (auto count : {15, 65520, 123570}) { |
test/gtest/coll/test_allreduce.cc
Outdated
| UccTeam_h team = job.create_team(team_size); | ||
| UccCollCtxVec ctxs; | ||
|
|
||
| for (auto count : {0, 1, 3, 17}) { |
There was a problem hiding this comment.
most test counts not divisible by team sizes
The ring algorithm requires count % tsize == 0. Most combinations will fail:
- team_size=3: only count 0 and 3 work (1, 17 fail)
- team_size=7: only count 0 works (1, 3, 17 fail)
- team_size=13: only count 0 works (1, 3, 17 fail)
Use counts that are multiples of all team sizes (e.g., LCM(3,7,13)=273):
| for (auto count : {0, 1, 3, 17}) { | |
| for (auto count : {0, 273, 546, 819}) { |
src/components/tl/ucp/tl_ucp_task.h
Outdated
| struct { | ||
| void *scratch; | ||
| ucc_mc_buffer_header_t *scratch_mc_header; | ||
| ucc_ee_executor_task_t *etask; | ||
| ucc_ee_executor_t *executor; | ||
| ucc_tl_ucp_copy_task_t *copy_task; | ||
| int phase; | ||
| int step; | ||
| size_t chunk_size; | ||
| int p2p_posted; |
There was a problem hiding this comment.
unused struct - schedule-based implementation doesn't use task state
The schedule-based ring allreduce (allreduce_ring.c) uses ucc_schedule_t and doesn't access task->allreduce_ring. This struct appears to be leftover from the original non-schedule implementation (PR #1082).
| struct { | |
| void *scratch; | |
| ucc_mc_buffer_header_t *scratch_mc_header; | |
| ucc_ee_executor_task_t *etask; | |
| ucc_ee_executor_t *executor; | |
| ucc_tl_ucp_copy_task_t *copy_task; | |
| int phase; | |
| int step; | |
| size_t chunk_size; | |
| int p2p_posted; |
Sergei-Lebedev
left a comment
There was a problem hiding this comment.
looks good to me, but the requirement of count%tsize is too strict imho. We can modify ring reduce scatter and ring allgather so that they can handle remainder correctly similar to knomial algorithms. This can be improved in next PR though
Pls fix greptile comments
| @@ -0,0 +1,196 @@ | |||
| /** | |||
| * Copyright (c) 2021-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
TEST: update and reorder gtest allreduce tests
6cb06b1 to
ce2cde7
Compare
| * For reduce_scatter_ring: | ||
| * - dst.info.count should be the per-rank output count (count/tsize) | ||
| * - The algorithm internally computes total = dst.info.count * tsize |
There was a problem hiding this comment.
comment is misleading - states count should be per-rank (count/tsize), but code passes total count at line 132. This is actually correct because IN_PLACE flag is set at line 134, and reduce_scatter_ring expects total count when in-place. Update comment to clarify this.
| if (UCC_TL_UCP_TEAM_LIB(tl_team)->cfg.reduce_avg_pre_op && | ||
| coll_args->args.op == UCC_OP_AVG) { | ||
| return UCC_ERR_NOT_SUPPORTED; | ||
| } | ||
|
|
||
| /* Check that count is divisible by team size for ring algorithm */ | ||
| if (count % tsize != 0) { | ||
| tl_debug(team->context->lib, | ||
| "ring requires count (%zu) divisible by team size (%u)", | ||
| count, tsize); | ||
| return UCC_ERR_NOT_SUPPORTED; | ||
| } | ||
|
|
There was a problem hiding this comment.
Wrong count for ring
ucc_tl_ucp_allreduce_ring_init enforces count % tsize == 0 (where count is dst.info.count, i.e., total elements per rank), but then passes that same total count as rs_args.args.dst.info.count while also forcing UCC_COLL_ARGS_FLAG_IN_PLACE. In reduce_scatter_ring the in-place path interprets dst.info.count as per-rank block count and internally computes total count as dst.info.count * size (reduce_scatter/reduce_scatter_ring.c:95-114, 213-231, 364-366), so this will run with an effective total of count*tsize and produce incorrect offsets/results for normal allreduce inputs.
Also note that count=0 will currently be rejected by the count % tsize check even though reduce_scatter_ring supports it (it just results in no work), so edge-case tests can’t pass as written.
test/gtest/coll/test_allreduce.cc
Outdated
| for (auto count : {0, 273, 546, 819}) { | ||
| SET_MEM_TYPE(UCC_MEMORY_TYPE_HOST); | ||
| this->set_inplace(TEST_NO_INPLACE); | ||
| this->data_init(team_size, TypeParam::dt, count, ctxs, false); | ||
| UccReq req(team, ctxs); | ||
|
|
||
| req.start(); | ||
| req.wait(); | ||
| EXPECT_EQ(true, this->data_validate(ctxs)); |
There was a problem hiding this comment.
Zero-count ring test fails
ring_edge_cases includes count=0, but the ring allreduce init currently returns UCC_ERR_NOT_SUPPORTED when count % team_size != 0 (and 0 % team_size is 0, but earlier validation paths in data_init/coll selection can still make this fail depending on how count=0 is handled). More importantly, with the current allreduce_ring wiring, even if init succeeded, the reduce-scatter/allgather composition assumes a meaningful block size; it’s safer to avoid count=0 here unless the algorithm explicitly supports it end-to-end.
What
This a reproduction of PR #1082 with some changes. The original algorithm was an implementation of a ring-based reduce-scatter + allgather algorithm to perform allreduce. This is the same but has been converted to a schedule-based approach to reuse reduce-scatter and allgather ring algorithms. This improves performance compared to the original approach.
Allreduce Performance Comparison
Configuration: Thor cluster, 16 Nodes, 32 PPN (512 processes total)
See attached graphs for visual graphs of the table above.
allreduce_comparison.pdf
allreduce_comparison_large.pdf