that way, we won't need to call `cudaStreamSynchronize()` in each step and can just call `cudaDeviceSynchronize()` at the very end
that way, we won't need to call
cudaStreamSynchronize()in each step and can just callcudaDeviceSynchronize()at the very end