LeapYear Function (Boolean eval.)

jj2007 · Post by **jj2007** » Nov 15, 2017 2:29

deltarho[1859] wrote:Your latest results bear little resemblance to earlier results where R2 was faster than Asm(function).

You mean this post?
Yes indeed; as mentioned earlier, GCC inserts 2x2x2=8 calls before and after the asm code. With "naked", that crap is gone, and of course it increases speed dramatically. But still, the inlined asm is twice as fast as the function asm. The reason is that in the inlined version, I replaced also the FB for ... next with the asm equivalent. The LeapYear code itself is so fast that the little overhead of a for ... next loop has already a significant share of total time.

deltarho[1859] · Post by **deltarho[1859]** » Nov 15, 2017 6:26

@jj2007

Yes, HERE

It is a pity that the forum software does not show post numbers.

From that post we have ( -gen gcc -s console )

Code: Select all

elapsed Asm: 172ms           24250000 leap years found (inline)
elapsed Asm: 175ms           24250000 leap years found (inline)
elapsed Asm: 174ms           24250000 leap years found (inline) (repeated 5x to make sure timings are reliable)
elapsed Asm: 173ms           24250000 leap years found (inline)
elapsed Asm: 175ms           24250000 leap years found (inline)
elapsed Asm: 699ms           24250000 leap years found (function)
elapsed D:   878ms           24250000 leap years found
elapsed C:   560ms           24250000 leap years found
elapsed B:   534ms           24250000 leap years found
elapsed R2:  625ms           24250000 leap years found (Roland Chastain 2)
elapsed AL:  904ms           24250000 leap years found (ULong)
elapsed AI:  2639ms          24250000 leap years found (ULongInt)

From your latest figures we have

Code: Select all

elapsed Asm: 184 ms          24250000 leap years found (inline)
elapsed Asm: 187 ms          24250000 leap years found (inline)
elapsed Asm: 178 ms          24250000 leap years found (inline)
elapsed Asm: 358 ms          24250000 leap years found (function)  <<< naked cdecl
elapsed Asm: 361 ms          24250000 leap years found (function)
elapsed Asm: 361 ms          24250000 leap years found (function)
elapsed D:   1737 ms         24250000 leap years found
elapsed C:   1627 ms         24250000 leap years found
elapsed B:   1615 ms         24250000 leap years found
elapsed R2:  1618 ms         24250000 leap years found (Roland Chastain 2)
elapsed AL:  1782 ms         24250000 leap years found (ULong)
elapsed AI:  3317 ms         24250000 leap years found (ULongInt)

The differences with the inline are less than 7% which we can get simply by running code at different times. For some years now I have never regarded code A as being faster than code B, for example, unless the difference is greater than 7%.

My point about "bear little resemblance" is with regard D, ..., AI. R2, for example, is showing 625ms in the first list and 1618 ms in the second list.

jj2007 wrote:With "naked", that crap is gone, and of course it increases speed dramatically.

It does if no compiler optimisation is employed. I invariably use -gen gcc -Wc -O3 and in this case Naked has no effect.

For -gen gcc -Wc -O3 fans the following uses just that with jj2007's code, without alteration, used to produce the first list above. Bear in mind that I am using an Intel i7 so we cannot just multiply the first list by a factor and get my list - there will be variations.

Code: Select all

elapsed Asm: 143ms           24250000 leap years found (inline)
elapsed Asm: 145ms           24250000 leap years found (inline)
elapsed Asm: 144ms           24250000 leap years found (inline)
elapsed Asm: 144ms           24250000 leap years found (inline)
elapsed Asm: 143ms           24250000 leap years found (inline)
elapsed Asm: 261ms           24250000 leap years found (function)
elapsed D:   123ms           24250000 leap years found
elapsed C:   71ms            24250000 leap years found
elapsed B:   71ms            24250000 leap years found
elapsed R2:  70ms            24250000 leap years found (Roland Chastain 2)
elapsed AL:  199ms           24250000 leap years found (ULong)
elapsed AI:  1176ms          24250000 leap years found (ULongInt)

What we have here is proof, on my machine anyway, that sometimes compiler optimised FreeBASIC BASIC will be faster than our attempts to replace BASIC by asm. I found this when I was developing my fast random number generators. With PowerBASIC the BASIC I replaced with asm saw a speed improvement in each case. With FreeBASIC I had to check that was still the case and, in some cases, it was not true so I reverted to BASIC. In theory, asm should be faster than BASIC, even when optimised, but if it is not then it is not. In this case do we choose asm because it should be faster or optimised BASIC because it is faster. The choice is clear to me. <smile>

From my list Roland Chastain's second contribution as posted takes the crown.

jj2007 · Post by **jj2007** » Nov 15, 2017 10:04

deltarho[1859] wrote:sometimes compiler optimised FreeBASIC BASIC will be faster than our attempts to replace BASIC by asm.

Now I am really curious what your compiler does! Can you please post the executable somewhere?

deltarho[1859] · Post by **deltarho[1859]** » Nov 15, 2017 13:52

http://deltarho.org.uk/Downloads/jj2007 ... 122017.zip

jj2007 · Post by **jj2007** » Nov 15, 2017 18:14

deltarho[1859] wrote:http://deltarho.org.uk/Downloads/jj2007 ... 122017.zip

Thanks a lot. This is an interesting piece of code indeed. The compiler replaces the division (very slow) with a multiplication by a magic number. The technique is well-known in assembler, but I didn't use it out of laziness ;-)

Code: Select all

Address         Hex dump              Command                     Comments
00406952        ³.  31DB              xor ebx, ebx                ; leap years=0
00406954        ³.  B9 01000000       mov ecx, 1                  ; loop counter, initial value
00406959        ³.  BF 1F85EB51       mov edi, 51EB851F           ; magic number for fake division
0040695E        ³.  66:90             nop
00406960        ³>  41                inc ecx                     ; FOR loop: inc counter
00406961        ³.  81F9 01E1F505     cmp ecx, 5F5E101            ; check boundary
00406967        ³. 74 3A             je short 004069A3
00406969        ³>  F6C1 03           Útest cl, 03                ; mod 4 test
0040696C        ³. 75 F2             ³jnz short 00406960
0040696E        ³.  89C8              ³mov eax, ecx               ; simulate
00406970        ³.  F7E7              ³mul edi                    ; division
00406972        ³.  89D0              ³mov eax, edx
00406974        ³.  C1E8 05           ³shr eax, 5                 ; divide by 32
00406977        ³.  8D0480            ³lea eax, [eax*4+eax]       ; mul 5
0040697A        ³.  8D0480            ³lea eax, [eax*4+eax]       ; mul 5
0040697D        ³.  C1E0 02           ³shl eax, 2                 ; mul 4 -> 5*5*4=100
00406980        ³.  39C1              ³cmp ecx, eax
00406982        ³. 75 10             ³jne short 00406994         ; 100x?
00406984        ³.  C1EA 07           ³shr edx, 7                 ; div 128
00406987        ³.  8D0492            ³lea eax, [edx*4+edx]       ; mul 5
0040698A        ³.  8D0480            ³lea eax, [eax*4+eax]       ; mul 5
0040698D        ³.  C1E0 04           ³shl eax, 4                 ; mul 16 -> 5*5*16=400
00406990        ³.  39C1              ³cmp ecx, eax
00406992        ³. 75 CC             ³jne short 00406960         ; 400x?
00406994        ³>  43                ³inc ebx                    ; inc leap years
00406995        ³.  BE 01000000       ³mov esi, 1                 ; ???
0040699A        ³.  41                ³inc ecx                    ; inc counter
0040699B        ³.  81F9 01E1F505     ³cmp ecx, 5F5E101           ; check boundary
004069A1        ³. 75 C6             Àjne short 00406969         ; NEXT

Comments are mine. I have not yet fully understood what it does, but I have to admit that these are clever tricks.

deltarho[1859] · Post by **deltarho[1859]** » Nov 15, 2017 18:52

So, it would appear then that on those occasions when compiler optimised FreeBASIC BASIC is faster than my attempts to replace BASIC by asm is simply because on those occasions the compiler is smarter than I am. <HaHaHa>

That makes sense because the number of opcodes that I am au fait with is a small percentage of those available. In fact, even with your leap year asm there was an opcode I had to look up.

In future when the compiler is beating me perhaps I should post my asm and ask if anyone can improve on it. <smile>

jj2007 · Post by **jj2007** » Nov 15, 2017 20:45

deltarho[1859] wrote:In future when the compiler is beating me perhaps I should post my asm and ask if anyone can improve on it. <smile>

We have played that game many times in the Masm32 forum. It is fun to beat a C compiler, or the C Runtime library. And often you can indeed beat the CRT, for example by using SIMD instructions. But as the example above shows, C compiler know some good tricks, too. Guess who taught them these tricks. Hint: You can't use C for teaching them ;-)

LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)

Re: LeapYear Function (Boolean eval.)