for block size of 128, load 128 features into shared memory and iterate over those, then load next 128 features and repeat
for block size of 128, load 128 features into shared memory and iterate over those, then load next 128 features and repeat