Skip to content

Latest commit

 

History

History
13 lines (8 loc) · 439 Bytes

File metadata and controls

13 lines (8 loc) · 439 Bytes

Tradeoffs for scattering within the momintegrals kernel vs in a separate kernel:

+Separate kernel: . Could maybe use shared memory tiling? . scheduling flexibility (?adv)

Using CUDA.jl's handy launch_configuration function

to auto-select block size/count based on detected hardware

Optimization: while loops instead of steprange in both kernels

Reduces register usage per thread running the kernel, directly increases occupancy