Accumulate becomes slow for very large input sizes

For input sizes about 2^27 or larger, the `gpu__accumulate_previous_coupled_preblocks_` call starts to heavily dominate the runtime of the scan, according to `CUDA.@profile` on my system (tested with int32 and float32).

For example, using 2^27 elements, 
- 3.37 ms are spent in the top-level `gpu__accumulate_block_`
- 13.11 µs are spent in the recursed `gpu__accumulate_block_`
- **11.16 ms** are spent in the `gpu__accumulate_previous_coupled_preblocks_`

Since the accumulate_previous_coupled_preblocks is essentially just a vectorized add, it should not be this slow. The problem only gets much worse for larger vectors, for example 2^30 elements, where the `gpu__accumulate_previous_coupled_preblocks_` takes 580ms, 92% of the total accumulate time on my system.

In comparison, in a simple C++ cub reference implementation, AcceleratedKernels.jl keeps up with cub performance very well for smaller inputs, but then suddenly falls off a cliff for these larger sizes. The cub reference takes ~12ms for 2^30 elements, and an alpaka3 implementation of the same coupled lookback takes ~27ms for this size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accumulate becomes slow for very large input sizes #75

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Accumulate becomes slow for very large input sizes #75

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions