Skip to content

TL/CUDA: guard team size#1274

Open
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:fix/tl-cuda-team-size-check
Open

TL/CUDA: guard team size#1274
ikryukov wants to merge 1 commit intoopenucx:masterfrom
ikryukov:fix/tl-cuda-team-size-check

Conversation

@ikryukov
Copy link
Collaborator

@ikryukov ikryukov commented Mar 5, 2026

What

Add a team size check in TL/CUDA UCC_CLASS_INIT_FUNC that returns
UCC_ERR_NOT_SUPPORTED when the number of ranks exceeds UCC_TL_CUDA_MAX_PEERS (8).

Why ?

Running ucc_test_mpi with more than 8 ranks on a single node causes a
segmentation fault inside ucc_tl_cuda_team_create_test. Several fixed-size
arrays (scratch.rem, scratch.rem_info, barrier state/local_sense) are
bounded by UCC_TL_CUDA_MAX_PEERS but loops iterate over UCC_TL_TEAM_SIZE,
resulting in out-of-bounds writes when team size > 8.

Signed-off-by: Ilya Kryukov <ikryukov@nvidia.com>
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR introduces a targeted guard in UCC_CLASS_INIT_FUNC for ucc_tl_cuda_team_t that returns UCC_ERR_NOT_SUPPORTED when the team size exceeds UCC_TL_CUDA_MAX_PEERS (8), preventing a segmentation fault caused by out-of-bounds writes into fixed-size arrays (scratch.rem, scratch.rem_info, barrier state, and local_sense) in subsequent initialization and collective code.

  • The check is placed correctly — after the multinode early-return (so the multinode NVLS path, which has its own peer limit of 144, is unaffected) but before any memory allocation, avoiding any resource-leak on the error path.
  • The boundary condition (> rather than >=) is correct: arrays are sized exactly UCC_TL_CUDA_MAX_PEERS, so a team of exactly 8 is valid.
  • The UCC_CLASS_CLEANUP_FUNC iterates over UCC_TL_TEAM_SIZE for scratch.rem[i], but since a team whose init returned an error will never have its cleanup function invoked by the UCC framework, there is no secondary out-of-bounds access.
  • One minor style improvement: the debug message should clarify that UCC_TL_CUDA_MAX_PEERS is a compile-time constant requiring recompilation, not a runtime parameter.

Confidence Score: 5/5

  • This PR is safe to merge — the fix is minimal, correctly placed, and directly addresses the reported segfault without introducing new risks.
  • The change is a single, focused guard that returns an appropriate error code before any unsafe memory operations. The boundary condition and code placement are both correct, and the fix aligns with how other unsupported configurations (e.g. multinode without NVLS) are already handled in the same function. One minor style suggestion improves user clarity in the debug message.
  • No files require special attention.

Last reviewed commit: ba40374

@ikryukov
Copy link
Collaborator Author

ikryukov commented Mar 5, 2026

/build

@Sergei-Lebedev
Copy link
Contributor

we already have this check in class init super, see

    if (attr.max_team_size < params->size) {
        tl_debug(lib, "team size %d is too big, max supported %d",
                 params->size, attr.max_team_size);
        return UCC_ERR_NOT_SUPPORTED;
    }

@janjust
Copy link
Collaborator

janjust commented Mar 5, 2026

@ikryukov here too - please force a rebase, for some reason this and your previous PR show as "problem with creating rebase commit"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants