TL/CUDA: guard team size by ikryukov · Pull Request #1274 · openucx/ucc

ikryukov · 2026-03-05T17:10:55Z

What

Add a team size check in TL/CUDA UCC_CLASS_INIT_FUNC that returns
UCC_ERR_NOT_SUPPORTED when the number of ranks exceeds UCC_TL_CUDA_MAX_PEERS (8).

Why ?

Running ucc_test_mpi with more than 8 ranks on a single node causes a
segmentation fault inside ucc_tl_cuda_team_create_test. Several fixed-size
arrays (scratch.rem, scratch.rem_info, barrier state/local_sense) are
bounded by UCC_TL_CUDA_MAX_PEERS but loops iterate over UCC_TL_TEAM_SIZE,
resulting in out-of-bounds writes when team size > 8.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

greptile-apps · 2026-03-05T17:17:23Z

Greptile Summary

This PR introduces a targeted guard in UCC_CLASS_INIT_FUNC for ucc_tl_cuda_team_t that returns UCC_ERR_NOT_SUPPORTED when the team size exceeds UCC_TL_CUDA_MAX_PEERS (8), preventing a segmentation fault caused by out-of-bounds writes into fixed-size arrays (scratch.rem, scratch.rem_info, barrier state, and local_sense) in subsequent initialization and collective code.

The check is placed correctly — after the multinode early-return (so the multinode NVLS path, which has its own peer limit of 144, is unaffected) but before any memory allocation, avoiding any resource-leak on the error path.
The boundary condition (> rather than >=) is correct: arrays are sized exactly UCC_TL_CUDA_MAX_PEERS, so a team of exactly 8 is valid.
The UCC_CLASS_CLEANUP_FUNC iterates over UCC_TL_TEAM_SIZE for scratch.rem[i], but since a team whose init returned an error will never have its cleanup function invoked by the UCC framework, there is no secondary out-of-bounds access.
One minor style improvement: the debug message should clarify that UCC_TL_CUDA_MAX_PEERS is a compile-time constant requiring recompilation, not a runtime parameter.

Confidence Score: 5/5

This PR is safe to merge — the fix is minimal, correctly placed, and directly addresses the reported segfault without introducing new risks.
The change is a single, focused guard that returns an appropriate error code before any unsafe memory operations. The boundary condition and code placement are both correct, and the fix aligns with how other unsupported configurations (e.g. multinode without NVLS) are already handled in the same function. One minor style suggestion improves user clarity in the debug message.
No files require special attention.

_{Last reviewed commit: ba40374}

src/components/tl/cuda/tl_cuda_team.c

ikryukov · 2026-03-05T17:31:42Z

/build

Sergei-Lebedev · 2026-03-05T20:30:52Z

we already have this check in class init super, see

    if (attr.max_team_size < params->size) {
        tl_debug(lib, "team size %d is too big, max supported %d",
                 params->size, attr.max_team_size);
        return UCC_ERR_NOT_SUPPORTED;
    }

janjust · 2026-03-05T20:36:45Z

@ikryukov here too - please force a rebase, for some reason this and your previous PR show as "problem with creating rebase commit"

TL/CUDA: guard team size > MAX_PEERS

ba40374

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>

ikryukov self-assigned this Mar 5, 2026

ikryukov requested review from Sergei-Lebedev, janjust and nsarka March 5, 2026 17:14

ikryukov added the Ready-for-Review label Mar 5, 2026

greptile-apps bot reviewed Mar 5, 2026

View reviewed changes

src/components/tl/cuda/tl_cuda_team.c Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TL/CUDA: guard team size#1274

TL/CUDA: guard team size#1274
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:fix/tl-cuda-team-size-check

ikryukov commented Mar 5, 2026

Uh oh!

greptile-apps bot commented Mar 5, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

ikryukov commented Mar 5, 2026

Uh oh!

Sergei-Lebedev commented Mar 5, 2026

Uh oh!

janjust commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ikryukov commented Mar 5, 2026

What

Why ?

Uh oh!

greptile-apps bot commented Mar 5, 2026

Greptile Summary

Confidence Score: 5/5

Uh oh!

Uh oh!

ikryukov commented Mar 5, 2026

Uh oh!

Sergei-Lebedev commented Mar 5, 2026

Uh oh!

janjust commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants