Here's sample code.
https://gist.github.com/raver119/3988237c9bb2376b0cd745120c5bc38e
Both gather and load functions do exactly the same, but since they use slightly different inner loop, compiler either uses gather or load instructions. This results in performance degradation.
Output on Aurora:
/opt/nec/ve/bin/nc++ -O3 -fopenmp bug_gather.cpp
Time gather: [210 us]; Time load: [60 us]
Output on x86:
g++ -O3 -mmmx -msse -msse2 -msse3 -msse4.1 -msse4.2 -mavx -mavx2 -mfma -mf16c -mprefetchwt1 -fopenmp bug_gather.cpp
Time gather: [209 us]; Time load: [215 us]
Here's sample code.
https://gist.github.com/raver119/3988237c9bb2376b0cd745120c5bc38e
Both
gatherandloadfunctions do exactly the same, but since they use slightly different inner loop, compiler either uses gather or load instructions. This results in performance degradation.Output on Aurora:
Output on x86: