I have posted a former version of this problem on the QUDA page, where I found plenty of time was wasted when calculating the propagators. And the time does not change essentially no matter how I change OMP_NUM_THREADS. This is because OpenMP is not working, at some stage. @SaltyChiang has pointed out on the QUDA page that CMakeLists.txt in the devel branch of qdpxx did not actually set QDP_USE_OMP_THREADS. I think this can be fixed in later versions.
I use the latest versions: QMP 2-5-4, QDP++ 1-46-0, QUDA 1.1.0, and Chroma 3-44-0, checked out to their development branch, and all build with CMake, with cc=mpicc and -fopenmp flag, and -DQDP_USE_OPENMP=ON, -DQUDA_OPENMP=ON, -DChroma_ENABLE_OPENMP=ON. The log of a typical propagator calculation shows a very low invertQuda / initQuda-endQuda ratio. If I use top to look at the process, I see clearly the Chroma program uses only one thread.
However, after I modified the CMakeLists.txt of qdpxx, the output of the Chroma program prints QDP use OpenMP threading. We have x threads as expected (it does not do so before the change), the program still uses only one thread. Are there any possible problems going on here? I have checked with a simple C++ program that OpenMP works on the cluster.
I have posted a former version of this problem on the QUDA page, where I found plenty of time was wasted when calculating the propagators. And the time does not change essentially no matter how I change
OMP_NUM_THREADS. This is because OpenMP is not working, at some stage. @SaltyChiang has pointed out on the QUDA page thatCMakeLists.txtin the devel branch of qdpxx did not actually setQDP_USE_OMP_THREADS. I think this can be fixed in later versions.I use the latest versions: QMP 2-5-4, QDP++ 1-46-0, QUDA 1.1.0, and Chroma 3-44-0, checked out to their development branch, and all build with CMake, with
cc=mpiccand-fopenmpflag, and-DQDP_USE_OPENMP=ON,-DQUDA_OPENMP=ON,-DChroma_ENABLE_OPENMP=ON. The log of a typical propagator calculation shows a very low invertQuda / initQuda-endQuda ratio. If I usetopto look at the process, I see clearly the Chroma program uses only one thread.However, after I modified the
CMakeLists.txtof qdpxx, the output of the Chroma program printsQDP use OpenMP threading. We have x threadsas expected (it does not do so before the change), the program still uses only one thread. Are there any possible problems going on here? I have checked with a simple C++ program that OpenMP works on the cluster.