How to use SSE2 in freebasic ?

Windows specific questions.
srvaldez
Posts: 3385
Joined: Sep 25, 2005 21:54

Re: How to use SSE2 in freebasic ?

Post by srvaldez »

it's a i9-9900K CPU @ 3.60GHz, 3600 Mhz
Total Cores 8.
Total Threads 16.
Max Turbo Frequency 5.00 GHz.
Intel® Turbo Boost Technology 2.0 Frequency‡ 5.00 GHz.
Processor Base Frequency 3.60 GHz.
Cache 16 MB Intel® Smart Cache.
Bus Speed 8 GT/s.
TDP 95 W.
I maxed-out the RAM because I frequently run VM's
D.J.Peters
Posts: 8586
Joined: May 28, 2005 3:28
Contact:

Re: How to use SSE2 in freebasic ?

Post by D.J.Peters »

Cool 8 Cores with 16 Threads 2 of this PC's (32 threads) would be ideal for the mitsuba 3D renderer I use and love so much :-)

To get 24 hardware threads for the renderer I use mitsuba render nodes in my local network with 6 quad core PC's :lol:

Joshy
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

CPU: i7-3770 8MB

[08:43:13.11]time v1: 2.66459666513823
[08:43:13.11]time v2: 2.67936478230213
[08:43:13.11]time v3: 2.68704450507995
[08:43:13.11]time v4: 2.680799485646958

I can cnot ompile v5 , so v5 no data
Last edited by quickbbbb on Dec 09, 2021 1:01, edited 1 time in total.
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

srvaldez wrote:it's a i9-9900K CPU @ 3.60GHz, 3600 Mhz
Total Cores 8.
Total Threads 16.
Max Turbo Frequency 5.00 GHz.
Intel® Turbo Boost Technology 2.0 Frequency‡ 5.00 GHz.
Processor Base Frequency 3.60 GHz.
Cache 16 MB Intel® Smart Cache.
Bus Speed 8 GT/s.
TDP 95 W.
I maxed-out the RAM because I frequently run VM's
when you run example of D.J.Peters , your cpu run in 5.00 GHz. ?


2.68 second ( my test ) div 0.812 second ( your test ) = 3.3 times

my cpu = 3.7G

if your cpu run in 5.00 GHz ----> 5G / 3.7G = 1.35135 times

3.3 times / 1.35135 times = 2.44 times ( let i7-3770 , i9-9900k has same frequency )

CPU_I9_9900k speed is 2.44 times progress at the same frequency more than CPU_I7_3770
srvaldez
Posts: 3385
Joined: Sep 25, 2005 21:54

Re: How to use SSE2 in freebasic ?

Post by srvaldez »

quickbbbb wrote: when you run example of D.J.Peters , your cpu run in 5.00 GHz. ?
yes, the CPU throttles up and down all the time usually between 800 and 5000 MHz
but I am suspicious about some of the reported speeds of the tests, what command line options are you using?
they can have a huge impact on performance, my compile command is fbc64 -t 4096 -w all -arch native -gen gcc -Wc -O2,-fno-builtin -v "%f"
where "%f" is the filename
D.J.Peters
Posts: 8586
Joined: May 28, 2005 3:28
Contact:

Re: How to use SSE2 in freebasic ?

Post by D.J.Peters »

quickbbbb wrote:I can cnot ompile v5 , so v5 no data
you need SSE on 32-bit windows ?

Joshy
D.J.Peters
Posts: 8586
Joined: May 28, 2005 3:28
Contact:

Re: How to use SSE2 in freebasic ?

Post by D.J.Peters »

@quickbbbb try a gain I added 32-bit SSE also.

Joshy
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

D.J.Peters wrote:@quickbbbb try a gain I added 32-bit SSE also.

Joshy

Sorry!
Because I use VisialFreeBasic IDE , it can not compile v5 ( occur error )

so I download FBIde , and use FBIde to compile.

Now compile success !! ( FBIde )

Test result is following:

i7-3770 + win7_64bit
================================= first run
time v1: 1.920214737634524
time v2: 2.544125353175332
time v3: 2.45014096495288
time v4: 2.449346490073367
time v5: 0.8082955011923332
================================= second run
time v1: 1.967784827691503
time v2: 2.472062978427857
time v3: 2.495674760328257
time v4: 2.464804641276714
time v5: 0.8106885851593688
Last edited by quickbbbb on Dec 09, 2021 4:03, edited 3 times in total.
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

srvaldez wrote: they can have a huge impact on performance, my compile command is fbc64 -t 4096 -w all -arch native -gen gcc -Wc -O2,-fno-builtin -v "%f"
where "%f" is the filename

I still do not know how to use the command in PBide , I will study it .


D.J.Peters wrote:@quickbbbb try a gain I added 32-bit SSE also.

Joshy
thank you very much

I will study how to compile use command

=============



movsd xmm1, QWORD PTR [rbx+rax*8]
mulsd xmm1, QWORD PTR [rdx+rax*8]
addsd xmm0, xmm1

WOW!
so now I will have function (1) ar1 * ar2 (2) ar1 + ar2 (3) ar1 - ar2

mulsd xmm1, QWORD PTR [rdx+rax*8] -> math *
addsd xmm1, QWORD PTR [rdx+rax*8] -> math +
subsd xmm1, QWORD PTR [rdx+rax*8] -> math -
D.J.Peters
Posts: 8586
Joined: May 28, 2005 3:28
Contact:

Re: How to use SSE2 in freebasic ?

Post by D.J.Peters »

@quickbbbb there is o need to use other compiler for 32-bit if you use the right compiler switches the optimized SSE code are really fast.
Here you can see v4() BASIC function is faster as the hand written SSE naked v5() assembler code :-)

on 32-bit use this:
fbc -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast -asm intel ssetest.bas

on 64-bit I use:
fbc -arch x86-64 -Wc -O3 -fpmode fast -fpu sse -O 3 -asm intel ssetest.bas

Joshy

file: "ssetest.bas"

Code: Select all

function v1(l as double ptr,r as double ptr,s as uinteger,e as uinteger) as double
  dim as double result
  for i as uinteger = s to e
    result += l[i] * r[i]
  next
  return result
end function  


function v2(l as double ptr,r as double ptr,n as uinteger) as double
  dim as double result
  for i as uinteger = 0 to n
    result += l[i] * r[i]
  next
  return result
end function  

function v3(l as double ptr,r as double ptr,n as uinteger) as double
  dim as double result
  for i as uinteger = 0 to n
    result += *l * *r : l+=1 : r+=1
  next
  return result
end function  

function v4(l as double ptr,r as double ptr,n as uinteger) as double
  dim as double result
  dim as double ptr e=l+n+1
  while l<e : result += *l * *r : l+=1 : r+=1 : wend
  return result
end function


#ifndef __FB_64BIT__
' 32-bit used the stack
' 64-bit params: rcx=@a, rdx=@b, r8=n
sub v5 naked (byval a as double ptr, _
              byval b as double ptr, _
              byval n as uinteger, _
              byval r as double ptr)
#define BASIS 8             
  asm
  push ebp  
  mov ebp,esp  
  push ebx
  push edx
  mov ebx,[ebp+BASIS]
  lea ebx,[ebx]
  mov edx,[ebp+BASIS+4]
  lea edx,[edx]
  mov ecx,[ebp+BASIS+8]
  inc ecx
  xorpd	xmm0, xmm0 
  xor eax,eax
loop_x86_v5:
  movsd	xmm1, QWORD PTR [ebx+eax*8]
	mulsd	xmm1, QWORD PTR [edx+eax*8]
  addsd	xmm0, xmm1
  inc eax
  dec ecx
  jnz loop_x86_v5
  mov edx,[ebp+BASIS+12]
  lea edx,[edx]
  movsd QWORD PTR [edx],xmm0
  pop edx
  pop ebx
  pop ebp
  ret 16
  end asm
end sub

#else

' 64-bit params: rcx=@a, rdx=@b, r8=n
function v5 (byval a as double ptr, _
             byval b as double ptr, _
             byval n as uinteger) as double
  asm
  push rbx
  push rdx
  lea rbx,[rcx]  
  lea rdx,[rdx]
  mov rcx,r8
  inc rcx
  xorpd	xmm0, xmm0 
  xor rax,rax
  
loop_x86_64_v5:
  movsd	xmm1, QWORD PTR [rbx+rax*8]
	mulsd	xmm1, QWORD PTR [rdx+rax*8]
  addsd	xmm0, xmm1
  inc rax
  dec rcx
  jnz loop_x86_64_v5
  pop rdx
  pop rbx
  movsd [function],xmm0
  end asm
end function

#endif  


const as uinteger N = 100000
const as uinteger S =   1000 ' first item
const as uinteger E =  99000 ' last item


dim shared as double a(N-1),b(N-1)
for i as uinteger = 0 to N-1
  a(i)=i:b(i)=i
next

const as uinteger NLOOPS = 5000
print "please wait while run 5 tests ..."

dim as double result

var t1 = timer()
for i as uinteger = 1 to NLOOPS
  result = v1(@a(0),@b(0),S,E)
next  
t1=timer()-t1
print "result v1: " & result

var t2 = timer()
for i as uinteger = 1 to NLOOPS
  result = v2(@a(s),@b(s),E-S)  
next    
t2=timer()-t2
print "result v2: " & result

var t3 = timer()
for i as uinteger = 1 to NLOOPS
  result = v3(@a(s),@b(s),E-S)
next    
t3=timer()-t3
print "result v3: " & result

var t4 = timer()
for i as uinteger = 1 to NLOOPS
  result = v4(@a(s),@b(s),E-S)
next    
t4=timer()-t4
print "result v4: " & result
result=0
var t5=timer()
for i as uinteger = 1 to NLOOPS
  #ifndef __FB_64BIT__
  ' on 32.bit implemented as sub
  v5(@a(s),@b(s),E-S,@result)
  #else
  result = v5(@a(s),@b(s),E-S)
  #endif
next
t5=timer()-t5
print "result v5: " & result 
print 
print "time v1: " & t1 
print "time v2: " & t2
print "time v3: " & t3
print "time v4: " & t4
print "time v5: " & t5
sleep
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

D.J.Peters wrote:@quickbbbb there is o need to use other compiler for 32-bit if you use the right compiler switches the optimized SSE code are really fast.
Here you can see v4() BASIC function is faster as the hand written SSE naked v5() assembler code :-)

on 32-bit use this:
fbc -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast -asm intel ssetest.bas

on 64-bit I use:
fbc -arch x86-64 -Wc -O3 -fpmode fast -fpu sse -O 3 -asm intel ssetest.bas

Joshy

OK! Thank you vey much!
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

D.J.Peters wrote:@quickbbbb there is o need to use other compiler for 32-bit if you use the right compiler switches the optimized SSE code are really fast.
Here you can see v4() BASIC function is faster as the hand written SSE naked v5() assembler code :-)

on 32-bit use this:
fbc -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast -asm intel ssetest.bas

on 64-bit I use:
fbc -arch x86-64 -Wc -O3 -fpmode fast -fpu sse -O 3 -asm intel ssetest.bas

Joshy



WOW , My God!

old Test as following
====================================
time v1: 1.920214737634524
time v2: 2.544125353175332
time v3: 2.45014096495288
time v4: 2.449346490073367

New Test as following command = -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast
====================================
time v1: 0.8113254932250129
time v2: 0.8090549612388713
time v3: 0.8096571563073667
time v4: 0.8080199101677863

Image



thanks again everyone
D.J.Peters
Posts: 8586
Joined: May 28, 2005 3:28
Contact:

Re: How to use SSE2 in freebasic ?

Post by D.J.Peters »

Why do not post the result of V5() ?

Joshy
quickbbbb
Posts: 11
Joined: Dec 07, 2021 14:04

Re: How to use SSE2 in freebasic ?

Post by quickbbbb »

D.J.Peters wrote:Why do not post the result of V5() ?


Joshy
(1)
when include function v5 and use command= -s console -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast ---> compile error

I compile as 32 bit ,WinPE command = -s console -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast -----> compile error
I compile as 64 bit ,WinPE command = -s console -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast -----> compile error

WinPE show command as following
G:\FreeBasic\Compile\fbc32.exe -m "D:\QQQ.bas" -v -s console -gen gcc -arch pentium4-sse3 -Wc -O3 -fpu sse -O 3 -fpmode fast -x "D:\QQQ.exe"


(2)
when include function v5 and use command= -s console ---> compile success

WinPE show command as following
G:\FreeBasic\Compile\fbc64.exe -m "D:\QQQ.bas" -v -s console -x "D:\QQQ.exe"
D.J.Peters
Posts: 8586
Joined: May 28, 2005 3:28
Contact:

Re: How to use SSE2 in freebasic ?

Post by D.J.Peters »

quickbbbb wrote:WOW , My God!
I'm self are very impressed :-)

I compiled an old program I wrote in 2008 where over 4 GB 3D Vectors are calculated and solve a Jacobi Matrix (for radiosity)
With the command line switched for SSE both binaries 32-bit and 64-bit are 2 times faster as without :-)

Joshy
Post Reply