Skip to content

Add env SegmentedReduce#7795

Open
gonidelis wants to merge 9 commits intoNVIDIA:mainfrom
gonidelis:segmented_redude_env
Open

Add env SegmentedReduce#7795
gonidelis wants to merge 9 commits intoNVIDIA:mainfrom
gonidelis:segmented_redude_env

Conversation

@gonidelis
Copy link
Member

@gonidelis gonidelis commented Feb 25, 2026

Adds env based overloads for all DeviceSegmentedReduce::* algorithms

closes #7550

Segmented Reduce is inherently run_to_run deterministic thus this is the largest deterministic guarantee allowed. If you believe there at some point can be an a perf optimization that will ruin this contract let me know and we will remove this promise in this PR. Otherwise we stay bound to that.

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Feb 25, 2026
@gonidelis gonidelis force-pushed the segmented_redude_env branch from 2aa7447 to d442948 Compare February 25, 2026 20:29
@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 25, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@gonidelis gonidelis marked this pull request as ready for review February 26, 2026 01:49
@gonidelis gonidelis requested a review from a team as a code owner February 26, 2026 01:49
@gonidelis gonidelis requested a review from pauleonix February 26, 2026 01:49
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Feb 26, 2026
@github-actions

This comment has been minimized.

@gonidelis gonidelis force-pushed the segmented_redude_env branch from 83c6791 to 535da7d Compare March 5, 2026 14:47
@github-actions

This comment has been minimized.

@gonidelis gonidelis force-pushed the segmented_redude_env branch from 535da7d to 7a15fdf Compare March 9, 2026 21:57
@gonidelis
Copy link
Member Author

I removed the helper underlying implementation function for fixed segment size overloads as it pre required knowledge of the AccumT and added extra logic that was unnecessary. Non fixed-size overloads still do use the *_impl function though

@gonidelis
Copy link
Member Author

adding missing unit tests just now

@github-actions

This comment has been minimized.

@gonidelis gonidelis enabled auto-merge (squash) March 10, 2026 03:18
Copy link
Contributor

@miscco miscco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@bernhardmgruber I observe that we are really loose with the naming conventions We have InitValueT, init_value_t, init_t, no alias at all Same for AccumT and so on

We really should be more consistent

d_out,
num_segments,
segment_size,
::cuda::std::plus{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: This uses plus<void> and we have observed performance issues with this, because for smaller integer types it promotes. Shuld this rather be

Suggested change
::cuda::std::plus{},
::cuda::std::plus<detail::it_value_t<InputIteratorT>>{},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such changes should definitely go to separate PRs, since they change the status quo. AFAIK @gonidelis copies the setup for the dispatch call from the other non-env overloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true ☝🏼 why do they change status quo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For integer types plus<> introduces integer promotion, which e.g plus<short> does not.

So depending on the tested types, this can actually have some considerable performance implications

using OffsetT = detail::common_iterator_value_t<BeginOffsetIteratorT, EndOffsetIteratorT>;
using InputT = detail::it_value_t<InputIteratorT>;
using init_t = InputT;
using op_t = ::cuda::minimum<>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto: Should this rather be

Suggested change
using op_t = ::cuda::minimum<>;
using op_t = ::cuda::minimum<InputT>;

d_out,
num_segments,
segment_size,
::cuda::minimum<>{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto: explicit

Suggested change
::cuda::minimum<>{},
::cuda::minimum<input_t>{},

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolving these per bernhard's suggestion and will handle in a separate PR

@gonidelis
Copy link
Member Author

@miscco #7974 (comment) ok?

…r to common impl

  - Add private segmented_reduce_impl that centralizes determinism
    validation (static_assert rejecting gpu_to_gpu), dispatch_with_env,
    and tuning extraction, eliminating boilerplate across all env overloads
  - Refactor Reduce, Sum, Min, Max env overloads to delegate to
    segmented_reduce_impl
  - Add new env overloads for ArgMin and ArgMax with full documentation
    including literalinclude snippet tags
  - Rewrite env_api tests covering all 6 APIs (Reduce, Sum, Min, Max,
    ArgMin, ArgMax) with determinism and stream_ref acceptance tests
  - Unify _env.cu and _env_launch.cu into a single _env.cu test file
    with default env, launch wrapper, custom stream, and tuning tests
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 1h 22m: Pass: 14%/249 | Total: 3d 07h | Max: 1h 04m | Hits: 88%/44193

See results here.

REQUIRE(error == cudaSuccess);
}

C2H_TEST("cub::DeviceSegmentedReduce::Reduce env-based API", "[segmented_reduce][env]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is env used in the env API tests ? If the focus here is just to show single-phase API with default env ?

We use stream or memory resources in other algorithm env API tests to show the usage. Do we want to do same here as-well ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Add env-based API for cub::DeviceSegmentedReduce

4 participants