Skip to content

Accumulate becomes slow for very large input sizes #75

@AntonReinhard

Description

@AntonReinhard

For input sizes about 2^27 or larger, the gpu__accumulate_previous_coupled_preblocks_ call starts to heavily dominate the runtime of the scan, according to CUDA.@profile on my system (tested with int32 and float32).

For example, using 2^27 elements,

  • 3.37 ms are spent in the top-level gpu__accumulate_block_
  • 13.11 µs are spent in the recursed gpu__accumulate_block_
  • 11.16 ms are spent in the gpu__accumulate_previous_coupled_preblocks_

Since the accumulate_previous_coupled_preblocks is essentially just a vectorized add, it should not be this slow. The problem only gets much worse for larger vectors, for example 2^30 elements, where the gpu__accumulate_previous_coupled_preblocks_ takes 580ms, 92% of the total accumulate time on my system.

In comparison, in a simple C++ cub reference implementation, AcceleratedKernels.jl keeps up with cub performance very well for smaller inputs, but then suddenly falls off a cliff for these larger sizes. The cub reference takes ~12ms for 2^30 elements, and an alpaka3 implementation of the same coupled lookback takes ~27ms for this size.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions