benchmarks: No way to get reproducible results

Variance between runs is way too high.

I guess we could wrap the function in a for loop and use the lowest measurement, that would certainly improve things.

But I think we better use performance counters instead.
Epiphany isn't affected by this since we already use CTIMERs there.

PAPI seems to have the cross-platform support we need:
http://icl.cs.utk.edu/papi/