I noticed that the SHA256 implementation sha_256_x86_sha() for x86-64 with the sha extension is over 10x slower that the other implementation (sha_256_x86_bmi()).
evmone-statetest fixtures_static/fixtures/state_tests --gtest_filter=static/state_tests/stQuadraticComplexityTest.Call50000_sha256
Gives 29.4s for sha and 2.5s for bmi.