Tradeoffs for scattering within the momintegrals kernel vs in a separate kernel:

+Separate kernel: . Could maybe use shared memory tiling? . scheduling flexibility (?adv)

Using CUDA.jl's handy launch_configuration function

to auto-select block size/count based on detected hardware

Reduces register usage per thread running the kernel, directly increases occupancy