Inline assembler

Provoni · Post by **Provoni** » Apr 02, 2017 12:09

That was an error on my behalf but it seems to be working as intended as the following behaviour is true for both code examples.

When the line "movd [a],xmm1" is changed to "movd [a],xmm0", the value 26 is returned indicating that at least one value of the mula and mulb arrays is moved to the xmm# registers. Though, after the multiplication step, the product register returns 0.

Wanted to do,

Code: Select all

mov r8,[mula_ptr] 'memory offset mula into r8
mov r9,[mulb_ptr] 'memory offset mula into r9
	
movaps xmm0,[r8] 'move mula into xmm0
movaps xmm1,[r9] 'move mulb into xmm1
	
mulps xmm1,xmm0 'multiplicate xmm1 by xmm0
movd [a],xmm1 'get result

which seems equal to:

Code: Select all

movaps xmm0,[mula] 'move mula into xmm0
movaps xmm1,[mulb] 'move mulb into xmm1
	
mulps xmm1,xmm0 'multiplicate xmm1 by xmm0
movd [a],xmm1 'get result

Provoni · Post by **Provoni** » Apr 03, 2017 10:30

Problem solved. I was not using the right multiplication instruction (integer dword). Here is a good list of instructions that helped me sort it out: https://cmpsb.net/asm/x86/instr/

Here is an example of SSE multiplication and horizontal addition with 32-bit integers.

Code: Select all

'64-bit

screenres 800,600

dim as long a
dim as long mula(3)
dim as long mulb(3)

'xmm0
mula(0)=26
mula(1)=676
mula(2)=17576
mula(3)=456976

'xmm1
mulb(0)=7
mulb(1)=13
mulb(2)=8
mulb(3)=24

asm
		
	movupd xmm0,[mula] 'copy mula array into xmm0
	movupd xmm1,[mulb] 'copy mulb array into xmm1
	
	pmulld xmm1,xmm0 'copy multiplication of xmm0 by xmm1 in xmm1
	
	'sum all 4 dwords in the xmm1 register
	phaddd xmm1,xmm1 'horizontal addition
	phaddd xmm1,xmm1 'horizontal addition
	
	movd [a],xmm1
	
end asm

print "value check: ";(7*26)+(13*676)+(8*17576)+(24*456976)
print "sse value  : ";a

sleep

Provoni · Post by **Provoni** » Apr 05, 2017 8:08

How do you work with mixed variable types in assembler? 16-bit, 32-bit and 64-bit. The following program is supposed to return "123" but does not.

Thanks

Code: Select all

'64-bit

screenres 800,600

dim as short word1=123
dim as long dword1=123456
dim as integer qword1=1234567

dim as any ptr word1_ptr=@word1
dim as any ptr dword1_ptr=@dword1
dim as any ptr qword1_ptr=@qword1

dim as short w
dim as long dw
dim as integer qw

asm
	
	'following code needs to return 123
	mov rax,[word1_ptr] 'copy memory offset to rax
	mov rax,[rax] 'copy value at memory offset to rax
	mov [qw],rax 'copy rax to qword variable qw

end asm

print qw

sleep

Stonemonkey · Post by **Stonemonkey** » Apr 05, 2017 10:25

On x86 you need to use registers of appropriate size, for 8 bit use registers AL or AH or BL or BH etc.
For 16 bit use AX or BX etc.
For 32 bit use EAX, EBX etc.

Not sure how that translates to 64 bit processors though.

adele · Post by **adele** » Apr 05, 2017 11:08

Hi Provoni,

Provoni wrote:The following program is supposed to return "123" but does not.

123 is decimal; let us try with hex encoding (quick and dirty).

You`ll have to define some kind of Union to access the lower parts qword values. Maybe later, but this should help at least a bit:

Code: Select all

'64-bit

'screenres 800,600

dim as short word1=&h123
dim as long dword1=&h123456
dim as integer qword1=&h1234567

dim as any ptr word1_ptr=@word1
dim as any ptr dword1_ptr=@dword1
dim as any ptr qword1_ptr=@qword1

dim as short w
dim as long dw
' dim as integer qw ' OS / CPU dependent; LongInt is safer/more distinct 
Dim As LongInt qw
asm
   ' this is _just_ a demonstration, _not_  good code! .adi
   'following code needs to return 123
   mov rax,[word1_ptr] 'copy memory offset to rax
   Xor rdx,rdx	' later, we`ll write back _all_ 64 bits, so "zap" them (poor coding by myself!)
   mov dx,word Ptr [rax] 'copy value at memory offset to 16 bit register
   mov word Ptr [qw],dx ' copy 64 bits rdx to qword variable qw

end asm

print Hex(qw,16)
print Hex(qw)

sleep

I don´t go into too deep, but it seems you still are playing around with the code. IMO the best way to learn ASM.

adi

Provoni · Post by **Provoni** » Apr 05, 2017 15:11

Thanks for the help Stonemonkey and adele,

The FreeBASIC generated code handles the conversion with the "movsx" instruction. The following now works:

Code: Select all

	mov rax,[word1_ptr] 'copy memory offset to rax
	mov rax,[rax] 'copy value at memory offset to rax
	movsx rax,ax '<---
	mov [qw],rax 'copy rax to qword variable qw

Is it possible to compress the following? To get the value from the pointer word1_ptr:

Code: Select all

	mov rax,[word1_ptr] 'copy memory offset to rax
	mov rax,[rax] 'copy value at memory offset to rax

The brackets around word1_ptr are not needed, it is the same.

Stonemonkey · Post by **Stonemonkey** » Apr 05, 2017 16:24

I don't know how words (16 bit) are stored in 64 bit but in 32 bit x86 doing that could possibly result in a memory violation and when writing to memory could too or overwrite other data.

You'd do something like:

Mov eax,dword ptr[word1_ptr]
Movsx eax,word ptr[eax]

So in 64 bit it might be

Mov rax,qword ptr[word1_ptr]
Movsx rax,word ptr[rax]
Mov qword ptr[qw],rax

But I'm not sure tbh.

The pointer stores an address which points to a memory location where data is stored, a pointer is variable and can be altered to point to different locations so its value had to be loaded into a register first before the location it points to can be accessed. Variables in functions are stored on the stack and the assembler will address them relative to the base pointer so if you have a function with variable a as integer you could write

Mov eax,dword ptr[a]

To move the value in variable a into eax
But it might assemble to something like

Mov eax,dword ptr[ebp-12]

Again, I'm talking about 32 bit and not really sure about 64 bit.

Provoni · Post by **Provoni** » Apr 06, 2017 9:11

Stonemonkey wrote: Mov eax,dword ptr[a]

To move the value in variable a into eax
But it might assemble to something like

Mov eax,dword ptr[ebp-12]

[epb-12] is the memory offset on the stack where the variable a is stored right?

Provoni · Post by **Provoni** » Apr 06, 2017 9:27

What baffles me is that reducing the amount of instructions is not always faster. With my FreeBASIC programs I often noticed that using 16-bit arrays were faster over using 64-bit arrays. With the use of the compiler option -rr I found out why.

If only 64-bit arrays are used a piece of code looks like this:

Code: Select all

add	rbp, QWORD PTR [rsi+rax*8]

When a 16-bit array is used it has to use the movsx instruction and the code becomes faster!

Code: Select all

movsx	rax, WORD PTR [rsi+rax*2]
add	rbp, rax

Why is this faster?

- Specific to the CPU? Mine is i7 930.
- Less bits have to be moved?
- Offset calculation with the add instruction is not as efficient as with movsx?

Stonemonkey · Post by **Stonemonkey** » Apr 06, 2017 16:22

Provoni wrote:
Stonemonkey wrote: Mov eax,dword ptr[a]

To move the value in variable a into eax
But it might assemble to something like

Mov eax,dword ptr[ebp-12]
[epb-12] is the memory offset on the stack where the variable a is stored right?

Yes, the variables in a function/sub are created on the stack and ebp points to them so be careful or avoid modifying that register, the assembler knows the offset to index each variable from ebp so if it's altered it no longer knows where to find them.
Variables declared within different scopes in a function can share the same location on the stack too.

Stonemonkey · Post by **Stonemonkey** » Apr 06, 2017 16:30

Provoni wrote:What baffles me is that reducing the amount of instructions is not always faster. With my FreeBASIC programs I often noticed that using 16-bit arrays were faster over using 64-bit arrays. With the use of the compiler option -rr I found out why.

If only 64-bit arrays are used a piece of code looks like this:
Code: Select all
add	rbp, QWORD PTR [rsi+rax*8]
When a 16-bit array is used it has to use the movsx instruction and the code becomes faster!
Code: Select all
movsx	rax, WORD PTR [rsi+rax*2]
add	rbp, rax
Why is this faster?

- Specific to the CPU? Mine is i7 930.
- Less bits have to be moved?
- Offset calculation with the add instruction is not as efficient as with movsx?

It's possible that the qword isn't aligned and crosses an 8 byte boundary in memory and the CPU has to do 2 loads from memory (assuming it's a 64 bit data bus) to load the value to add.

Provoni · Post by **Provoni** » Apr 06, 2017 17:32

Thanks for the feedback Stonemonkey.

How can you align data? Does FreeBASIC support this yet?

I've done a search and found: http://freebasic.net/forum/viewtopic.php?t=22975
And: https://sourceforge.net/p/fbc/bugs/659/

greenink · Post by **greenink** » Apr 06, 2017 23:34

Code: Select all

	.align 16

You can also put static data in eg.

Code: Select all

	lea rdi,[rdi+64]
	jnz flipAlp
	ret
 flipshift:   .int 1,2,4,8,16,32,64,128
	        .int 256,512,1024,2048,4096,8192,16384,32768
 flipmask:	   .int 0x80000000,0x80000000,0x80000000,0x80000000
 rndphi:	   .quad 0x9E3779B97F4A7C15
 rndsqr3:	   .quad 0xBB67AE8584CAA73B

I forget if you can still use .text .data .bss sections with the current version of the compiler

Stonemonkey · Post by **Stonemonkey** » Apr 07, 2017 6:10

If you print the hex address of a variable you can see if it is aligned or not, for 64 bit it should end in either 0 or 8.

Provoni · Post by **Provoni** » Apr 08, 2017 8:28

Thanks greenink and Stonemonkey,

Some data is not aligned because instructions that work with aligned data crash the program. I tried using ".align 16" before the FreeBASIC initialization of the arrays but it didn't work.

Here's my assembler code. It is around 15 to 20% faster than the -O max generated FreeBASIC code. I went through various optimization guides and wonder if anyone could offer further advice.

Code: Select all

'64-bit

'- the following asm block is a small part of an inner loop

asm
		
	'- would it be worthwhile to store the following memory lookups
	'in an xmm# register or similar earlier on?
						
	movapd xmm0,[mul0] 'can use movapd here instead of movupd (aligned)
	xor rax,rax
  	mov r8,[map2_ptr]
  	lea r9,[sol] 'can use lea here
  	mov r11,[g5_ptr] 'can't use lea here, why?
  	mov r12,[ngrams_ptr] 'can't use lea here, why?
  	movsx rsi,word ptr[r8+2]
  	movsx rcx,word ptr[r8]
  	
	l1:
	
		'- various operations have been moved around to break dependencies
		'- sse operations are used to calculate 5-dim array lookup, is that
		'actually a worthwhile optimization to consider?
	
		movupd xmm1,[4+r9+rsi*4] 'must use moveupd here (unaligned)
		movsx rbx,dword ptr[r9+rsi*4]
		pmulld xmm1,xmm0 'multiplicate 4 values by xmm0
		add r8,2
		movsx r13,word ptr[r12+rsi*2]
		phaddd xmm1,xmm1 'horizontal addition a+b,c+d
		sub rax,r13
		phaddd xmm1,xmm1 'horizontal addition a+b+c+d
		movsx rsi,word ptr[r8+2] 'get rsi for next loop iteration and break dependencies
		movd r10d,xmm1 'get a+b+c+d
		add rbx,r10
		movsx rdx,word ptr[r11+rbx*2]
		add rax,rdx
		dec rcx
		jnz l1
			
	add [new_ngram_score],rax

end asm

Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler

Re: Inline assembler