Thanks greenink and Stonemonkey,
Some data is not aligned because instructions that work with aligned data crash the program. I tried using ".align 16" before the FreeBASIC initialization of the arrays but it didn't work.
Here's my assembler code. It is around 15 to 20% faster than the -O max generated FreeBASIC code. I went through various optimization guides and wonder if anyone could offer further advice.
Code: Select all
'64-bit
'- the following asm block is a small part of an inner loop
asm
'- would it be worthwhile to store the following memory lookups
'in an xmm# register or similar earlier on?
movapd xmm0,[mul0] 'can use movapd here instead of movupd (aligned)
xor rax,rax
mov r8,[map2_ptr]
lea r9,[sol] 'can use lea here
mov r11,[g5_ptr] 'can't use lea here, why?
mov r12,[ngrams_ptr] 'can't use lea here, why?
movsx rsi,word ptr[r8+2]
movsx rcx,word ptr[r8]
l1:
'- various operations have been moved around to break dependencies
'- sse operations are used to calculate 5-dim array lookup, is that
'actually a worthwhile optimization to consider?
movupd xmm1,[4+r9+rsi*4] 'must use moveupd here (unaligned)
movsx rbx,dword ptr[r9+rsi*4]
pmulld xmm1,xmm0 'multiplicate 4 values by xmm0
add r8,2
movsx r13,word ptr[r12+rsi*2]
phaddd xmm1,xmm1 'horizontal addition a+b,c+d
sub rax,r13
phaddd xmm1,xmm1 'horizontal addition a+b+c+d
movsx rsi,word ptr[r8+2] 'get rsi for next loop iteration and break dependencies
movd r10d,xmm1 'get a+b+c+d
add rbx,r10
movsx rdx,word ptr[r11+rbx*2]
add rax,rdx
dec rcx
jnz l1
add [new_ngram_score],rax
end asm