Provoni wrote:Instead of using "asm nop" it is better to do "x += rnd" and print x after the measurement because then the compiler can still do loop optimizations resulting in a higher and more realistic Mhz.
I am using 'asm nop' in the dummy loop. You are not suggesting we use 'x += rnd' in the dummy loop, are you?
Although 'asm nop' tells the compiler to 'back off' it introduces its own distortion. We should also include 'asm nop' in the rnd loop so that it gets eliminated in the 'dt1-dto' calculation.
All seemed to be going well with most of the generators showing higher 'true' speeds until I tested CryptoS in CryptoRNDII where I got a negative MHz. It transpired that the rnd loop was running faster than the dummy loop which is, obviously, nonsense. There wasn't much in it but the result was negative nonetheless.
I then swapped the order of execution - rnd loop and then dummy loop. This time the rnd loop was far to slow compared to the dummy loop giving much smaller MHz. I increased the loops from 10^8 to 10^9; 10^8 is too small considering the speed of the generators. I then increased the tests from 1 to 9 so as to use the median and included a small Sleep between them to avoid the 9 tests running without allowing some idle time between them.
I sometimes think that with all the shenanigans going on executing code in the CPU cache compared to years gone by and optimizing compilers sometimes being too clever for their own good when it comes to our timing code we are on a hiding to nothing.
Why 'x = CryptoS' seems to be faster than 'x = 0' is testing me somewhat.
This is why I favor giving generators a specific task and then compare how much work they get through given the same time for each of them to do their stuff. The question then is do we test all in one application or test them individually to avoid the order of execution anomaly.
BTW, Provoni, did you get the PCG32II Help file?