The linked code expects the floating-point value to be in ST(0), so it’s apparently intended to be called from FPU code. To keep it simple I tested a function version that takes the value as an argument, with the additional overhead of pushing it onto the stack (2 PUSHs), and loading it into the FPU (FLD).
Code: Select all
''===================================================================================
#include "counter.bas"
#include "crt.bi"
''===================================================================================
''
'' The newer cycle count macros are available here:
''
'' http://www.freebasic.net/forum/viewtopic.php?f=7&t=20003
''
''===================================================================================
dim as double d = 12345.6789
dim as integer i
''===================================================================================
function ftol2 naked( byval x as double ) as integer
asm
fld QWORD PTR [esp+4] '' load x
fnstcw [esp-2] '' get old cw
mov ax, [esp-2] '' copy to ax
or ax, 0x0c00 '' set RC bits for truncate
mov [esp-4], ax '' copy new cw to memory
fldcw [esp-4] '' load new cw
fistp QWORD PTR [esp-12] '' store value rounded to 64-bit integer
fldcw [esp-2] '' load old cw
mov eax, [esp-12] '' return in EDX:EAX
mov edx, [esp-8]
ret 8
end asm
end function
''===================================================================================
i = int(d)
print i
i = ftol2(d)
print i
print
SetProcessAffinityMask( GetCurrentProcess(), 1)
sleep 5000
for j as integer = 1 to 4
counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
counter_end()
print counter_cycles;" cycles, empty"
counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
i = int(d)
counter_end()
print counter_cycles;" cycles, int()"
counter_begin( 10000000, REALTIME_PRIORITY_CLASS, THREAD_PRIORITY_TIME_CRITICAL )
i = ftol2(d)
counter_end()
print counter_cycles;" cycles, ftol2()"
print
next
sleep
Running on a P3:
Code: Select all
12345
12345
0 cycles, empty
65 cycles, int()
47 cycles, ftol2()
0 cycles, empty
65 cycles, int()
47 cycles, ftol2()
0 cycles, empty
65 cycles, int()
47 cycles, ftol2()
0 cycles, empty
65 cycles, int()
47 cycles, ftol2()
That’s 13 cycles over what I got for the Microsoft version and I think the additional overhead could possibly add that many cycles.
Edit:
What I timed for the Microsoft version included the FLD, and the pushes alone would not have taken 13 cycles, so I suspect the Microsoft version is different than what I tested here. My attempts to optimize the code by eliminating the partial register accesses did not make it faster.
Edit2:
The Microsoft code is nothing like what I tested here. It does not change the FPU rounding mode, and since it contains several conditional jumps the cycle count may vary with the input value.