64-bit asm woes

General FreeBASIC programming questions.
caseih
Posts: 1340
Joined: Feb 26, 2007 5:32

Re: 64-bit asm woes

Postby caseih » Oct 02, 2018 21:55

I'm sure GCC 8 does produce faster code than GCC 5. However, I'd expect it to only be a slight improvement. Optimizations are continually getting better. Better than my assembly code, SSE2 notwithstanding!

On modern processors, it's possible that dipping out of a compiled language like FB into assembly has some hazards that can negatively impact performance. The compiler knows more about the CPU's state that you do generally; I expect sometimes ad-hoc assembly code could impact the CPU caches in a way that invalidates the compiler's understanding of the CPU cache state, which could lead to some slow downs. Could be very wrong, though. EDIT: this says it better than I did and I'm not far off. Inline assembly automatically disables certain optimizations as the compiler cannot reason about them (if compilers are capable of reasoning...).
deltarho[1859]
Posts: 1765
Joined: Jan 02, 2017 0:34
Location: UK

Re: 64-bit asm woes

Postby deltarho[1859] » Oct 02, 2018 22:21

caseih wrote:I'm sure GCC 8 does produce faster code than GCC 5. However, I'd expect it to only be a slight improvement.

Here is one example where the improvement is more than slight.

I found 'DontUseInlineAsm' a good read. I would only add to that with: The benefit of inline asm is inversely proportional to the improvement of a compiler's optimization.
jj2007
Posts: 1161
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: 64-bit asm woes

Postby jj2007 » Oct 02, 2018 22:48

Compilers can outperform hand-written assembler code - in rare cases. And in these rare cases, you can learn from the compiler and re-write the assembler code.

In practice, we can beat the C runtime library typically by a factor 2-3. There are a few cases where the CRT is equally fast, though, mostly because the C code has by accident hit the fastest option (under the hood, the choice of instructions is limited).

If there is enough interest, we could set up a test case. For example, here is a free 20MB database, a csv file. We could measure the time needed to load the database and calculate the World average for the indicator "Population undernourished, percentage" in the year 2000. 234 of 44k lines contain this indicator, the values are in the "2000" column. M$ Excel loads the file in about 3 seconds. FB can do it much faster, in about 0.2 seconds... and then the fun begins. My best guess is that calculating the average can be done in less than 0.1 seconds.
deltarho[1859] wrote:I found 'DontUseInlineAsm' a good read.
Partly true. One should distinguish, though, short sequences of inline asm vs subroutines written entirely in asm. The latter are quite a different story.
dodicat
Posts: 5771
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: 64-bit asm woes

Postby dodicat » Oct 03, 2018 0:46

What is your average?

Code: Select all



#include "file.bi"     'to use Fileexists()

Function loadfile(file As String) As String
   If Fileexists(file)=0 Then Print file;" not found":Sleep:End
    Dim As Long  f=Freefile
    Open file For Binary Access Read As #f
    Dim As String text
    If Lof(1) > 0 Then
        text = String(Lof(f), 0)
        Get #f, , text
    End If
    Close #f
    Return text
End Function

Sub savefile(filename As String,p As String) 'unused
    Dim As long n
    n=Freefile
    If Open (filename For Binary Access Write As #n)=0 Then
        Put #n,,p
        Close
    Else
        Print "Unable to save " + filename
    End If
End Sub

Function _TALLY(SomeString As String,PartString As String) As Long
    Dim As Long LenP=Len(PartString),count
    Dim As Long position=Instr(SomeString,PartString)
    If position=0 Then Return 0
    While position>0
        count+=1
        position=Instr(position+LenP,SomeString,PartString)
    Wend
    Return count
End Function

Function StringSplit(s_in As String,chars As String,result() As String) As Long
    Dim As Long ctr,ctr2,k,n,LC=len(chars)
    dim As boolean tally(Len(s_in))
    #macro check_instring()
        n=0
        while n<Lc
        If chars[n]=s_in[k] Then
        tally(k)=true
        If (ctr2-1) Then ctr+=1
        ctr2=0
        exit while
        end if
        n+=1
       wend
    #endmacro
   
    #macro split()
    If tally(k) Then
        If (ctr2-1) Then ctr+=1:result(ctr)=Mid(s_in,k+2-ctr2,ctr2-1)
        ctr2=0
    End If
    #endmacro
    '==================  LOOP TWICE =======================
    For k  =0 To Len(s_in)-1
        ctr2+=1:check_instring()
    Next k
    If ctr Then Redim result(1 To ctr): ctr=0:ctr2=0 Else  Return 0
    For k  =0 To Len(s_in)-1
        ctr2+=1:split()
    Next k
    '===================== Last one ========================
    If ctr2>0 Then
        Redim Preserve result(1 To ctr+1)
        result(ctr+1)=Mid(s_in,k+1-ctr2,ctr2)
    End If
    Return Ubound(result)
End Function


Dim As Double t=Timer
dim as string text=LoadFile("MDG_Export_20181003_012728491.csv")
redim as string s(),s2()
StringSplit(text,chr(10),s())

var t1=_tally(s(1),",") 'total commas in  line one (containing the header titles)

dim as string yr="2000" '<------ YEAR WANTED

var g= mid(s(1),instr(s(1),yr)) 'string after yr
var t2=_tally( g,",")    'commas in line after year

var column= t1-t2 +2     'difference with 2 commas  offset

dim as long i
dim as double d
for n as long=lbound(s) to ubound(s)
    if instr(s(n),"Population undernourished, percentage") then
        stringsplit(s(n),",",s2())
        i+=1
        d+=val(trim(s2(column),chr(34)))
    end if
next
print "average ";d/i,i;" readings"
Print "Time taken ";Timer-t
print "done"

Sleep

 

I hope i got the correct column.

Code: Select all

 average  11.5059829059829    234 readings
Time taken  0.1977459168992937
done
deltarho[1859]
Posts: 1765
Joined: Jan 02, 2017 0:34
Location: UK

Re: 64-bit asm woes

Postby deltarho[1859] » Oct 03, 2018 3:11

average 11.5059829059829 234 readings
Time taken 0.1281105909909925
done

What is the point of this exercise?
caseih
Posts: 1340
Joined: Feb 26, 2007 5:32

Re: 64-bit asm woes

Postby caseih » Oct 03, 2018 3:30

jj2007 wrote:Compilers can outperform hand-written assembler code - in rare cases. And in these rare cases, you can learn from the compiler and re-write the assembler code.
Sure, if you're an expert programmer who has a thorough understanding of the architecture and the chip family you are targeting. For everyone else... not so much. I keep harping on this but I'll say it again. It's far far better to optimize your algorithm before you resort to assembly. Sure you can speed up FB code 2-3 times by going to straight assembly. But if you could speed up the FB code by 10 times through proper profiling and algorithmic optimizations, I think I know which is more effective. Brute force in assembly is still brute force: it doesn't change the dominant big O() run time factor. in other words, O(n!) is still O(n!) even each iteration is 2-3 times faster[1]. After you've squeezed every drop of performance out of the FB code, and you know exactly where the CPU time is being spend, then you can try assembly and get your 2-3 times speedup.
In practice, we can beat the C runtime library typically by a factor 2-3. There are a few cases where the CRT is equally fast, though, mostly because the C code has by accident hit the fastest option (under the hood, the choice of instructions is limited).
Again that's true, but only if you target specific use cases that meet your current needs. The CRT aims to be more generic and covers many many edge cases that your adhoc fast assembly replacements simply cannot do. Assembly wins here, but only up until the point when things change that then violate your assumptions down the road.
If there is enough interest, we could set up a test case. For example, here is a free 20MB database, a csv file. We could measure the time needed to load the database and calculate the World average for the indicator "Population undernourished, percentage" in the year 2000. 234 of 44k lines contain this indicator, the values are in the "2000" column. M$ Excel loads the file in about 3 seconds. FB can do it much faster, in about 0.2 seconds... and then the fun begins. My best guess is that calculating the average can be done in less than 0.1 seconds.
Would such a test be even remotely relevant though? Would it have the adequate error checking to deal with the unexpected or badly formed CSV files? Unicode? If we're just coding to this one particular file, then I don't see the test as very useful. If I was serious about doing data analysis of this particular file, what's 3 seconds if I can use the rich functionality of Excel to analyze it, or some other high-level tool like a proper database engine and query language?
deltarho[1859] wrote:I found 'DontUseInlineAsm' a good read.
Partly true. One should distinguish, though, short sequences of inline asm vs subroutines written entirely in asm. The latter are quite a different story.
This is a FB forum. Writing something entirely in assembly is fun and all, but not terribly on topic. And here in FB land, there are pitfalls, even performance pitfalls, to using inline assembly which from what we can see here is quite evident. My take away is that except for educational and entertainment purposes, inline assembly should be avoided entirely. I question whether it should be in FB at all. If you know enough to use inline assembly, you probably know enough to just use C and inline assembly there in the first place, although the compiler writers (who are by definition experts in assembly language) even advise strongly against that for most cases.

I will forever remember PowerBASIC as a cautionary tale about the perils of using pure assembly for any project. It was once fast and unbeatable, but couldn't transition when the architecture and OS changed. I have to wonder if assembly played a role in Bob Zale's departure from Borland and the subsequent spinning off of Turbo Basic. Could it be there was a change at Borland, moving from the fast assembly-based compilers to ones written in C that were a little more future proof as the world moved from 16-bits to 32-bits? Would be interesting to find out some day.

[1] O(n!) is about what you get when you try to brute-force the traveling salesman problem, to pick an extreme example, which begins to be incredibly time consuming when you exceed about 20 or 30 cities, even if hand crafted assembly and used SSE2 instructions!
jj2007
Posts: 1161
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: 64-bit asm woes

Postby jj2007 » Oct 03, 2018 8:16

deltarho[1859 wrote:What is the point of this exercise?
caseih wrote:Would such a test be even remotely relevant though? Would it have the adequate error checking to deal with the unexpected or badly formed CSV files? Unicode?

Sure, it's one particular test (and it does deal with badly formed CSV files and Unicode if needed). But this test is close to what, for example, a spreadsheet application has to perform: Load a file, translate it to a string array, check values, etc.; i.e. it tests a variety of standard functions that matter when judging the overall performance of a compiler.
caseih wrote:I keep harping on this but I'll say it again. It's far far better to optimize your algorithm before you resort to assembly
Absolutely! Take dodicat's example:

Code: Select all

for n as long=lbound(s) to ubound(s)
    if instr(s(n),"Population undernourished, percentage") then
        stringsplit(s(n),",",s2())
        i+=1
        d+=val(trim(s2(column),chr(34)))
    end if
next

Clever! Many coders would have checked the "right" cell in the 44,000 rows, a very slow exercise.

Thanks for the examples, deltarho and dodicat, very interesting stuff. With FreeBasic, I still feel like a noob, and I am learning from your code.

Here is my version - apologies that it doesn't look like assembler, but under the hood it is pure assembler:

Code: Select all

include \masm32\MasmBasic\MasmBasic.inc
  SetGlobals double tempVal, sum, DWORD column, rowMatch, valMatch   ; Dim As xyz Shared
  Init
  NanoTimer()      ; start timing
  Recall "MdgData.csv", L$(), csv   ; UN database available at http://mdgs.un.org/unsd/mdg/Handlers/ExportHandler.ashx?Type=Csv
  .Repeat
   .Break .if !StringsDiffer(L$(0, column), "2000")   ; find the year 2000 column
   inc column
  .Until column>99
  For_ ct=0 To L$(?)-1
   .if Instr_(L$(ct), "Population undernourished, percentage")   ; check if row contains the string
      MovVal tempVal, L$(ct, column)   ; load the value to a temporary variable
      .if signed edx>0   ; check if the cell was valid
         inc valMatch   ; count valid matches
         ; Print Str$("line %___i", ct), Str$(" %3f\n", tempVal)   ; for testing, prints the valid matches
         fld tempVal   ; tmp on FPU
         fadd sum   ; add current sum
         fstp sum   ; save as new sum
      .endif
      inc rowMatch
   .endif
  Next
  Inkey Cpu$(), CrLf$, NanoTimer$(), Str$(" for loading the file and processing %i matches", rowMatch), Str$(" with an average of %5f", sum/valMatch)
EndOfCode

Result:
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
30 ms for loading the file and processing 234 matches with an average of 15.996

Excel spits out 15.996 as well. If you don't arrive at that average, probably because you didn't verify if the number was valid (d+=val(trim(s2(column),chr(34))); see the "lowbyte of edx" part at Val() and MovVal; some months ago, munair posted code for checking if the number format was valid).

Now this is certainly not the ultimate answer to the question "do compilers produce faster code than hand-written assembly", but it might give you an idea how to react when somebody posts that claim (which I've seen very, very often).
caseih wrote:This is a FB forum. Writing something entirely in assembly is fun and all, but not terribly on topic.
I hope we all agree that FreeBasic should be as fast as possible. The topic here is, more or less, "can it be made faster with assembly", and I try to demonstrate that indeed, it could be made faster if, under the hood, assembly would be used, instead of relying on the C compiler. That short inserts of inline assembly are not very useful is quite a different story.
deltarho[1859]
Posts: 1765
Joined: Jan 02, 2017 0:34
Location: UK

Re: 64-bit asm woes

Postby deltarho[1859] » Oct 03, 2018 12:12

The PowerBASIC compiler is not an optimizing compiler. More often than not there is no need to resort to inline assembler. Some applications do benefit from inline assembly including short inserts and not just whole procedures. There is no chance of optimizing algorithms interfering with our 'mix' of instructions, BASIC and asm, because there aren't any.

I think that the moral of our story here, taking into account comments from both caseih and jj2007, is that with FreeBASIC the inline assembler should really only be considered for whole procedures. When we do that it is worth looking at 'Naked' which may come into its own in this respect.
dodicat
Posts: 5771
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: 64-bit asm woes

Postby dodicat » Oct 03, 2018 12:33

I opened that comma seperated file with libre office (eventually)
But I cannot figure an average for the 2000 column.( Excuse:Never used a spreadsheet before)

I note that many readings for Population undernourished, percentage are duffers, could be E or "" or " "
So I tried to weed them out.
My average now is 15.93136094674556
If you uncomment line 98 it will show the valid results left.

Code: Select all

 

#include "file.bi"     'to use Fileexists()

Function loadfile(file As String) As String
   If Fileexists(file)=0 Then Print file;" not found":Sleep:End
    Dim As Long  f=Freefile
    Open file For Binary Access Read As #f
    Dim As String text
    If Lof(1) > 0 Then
        text = String(Lof(f), 0)
        Get #f, , text
    End If
    Close #f
    Return text
End Function

Sub Remove(Text As String,Char As String)
    Var index = 0,asci=Asc(char)
    For i As Integer = 0 To Len(Text) - 1
        If Text[i] <> ASCi Then Text[index] = Text[i] : index =index+ 1
    Next : Text = Left(Text,index)
End Sub

Function isnumber(Byval s As String) As boolean
    static As String * 13 number="-+.0123456789"
    Dim As Long ctr,dot,lnum=13
    For m As Long=0 To Len(s)-1
        If m>0 And s[m]=45 Or s[m]=43 Then Return 0
        If s[m]=46 Then dot+=1
        For n As Long=0 To lnum-1
            If s[m]=number[n] Then ctr+=1:Exit For
        Next n
    Next
    Return ctr<>0 Andalso ctr=Len(s) And dot<2 
End Function

Function StringSplit(s_in As String,chars As String,result() As String) As Long
    Dim As Long ctr,ctr2,k,n,LC=Len(chars)
    Dim As boolean tally(Len(s_in))
    #macro check_instring()
    n=0
    While n<Lc
        If chars[n]=s_in[k] Then
            tally(k)=true
            If (ctr2-1) Then ctr+=1
            ctr2=0
            Exit While
        End If
        n+=1
    Wend
    #endmacro
   
    #macro split()
    If tally(k) Then
        If (ctr2-1) Then ctr+=1:result(ctr)=Mid(s_in,k+2-ctr2,ctr2-1)
        ctr2=0
    End If
    #endmacro
    '==================  LOOP TWICE =======================
    For k  =0 To Len(s_in)-1
        ctr2+=1:check_instring()
    Next k
    If ctr Then Redim result(1 To ctr): ctr=0:ctr2=0 Else  Return 0
    For k  =0 To Len(s_in)-1
        ctr2+=1:split()
    Next k
    '===================== Last one ========================
    If ctr2>0 Then
        Redim Preserve result(1 To ctr+1)
        result(ctr+1)=Mid(s_in,k+1-ctr2,ctr2)
    End If
    Return Ubound(result)
End Function


Dim As Double t=Timer
Dim As String text=LoadFile("MDG_Export_20181003_012728491.csv")

Redim As String s(),s2()
StringSplit(text,Chr(10),s()) 'split on new line
StringSplit(s(1),",",s2())    'get the header strings

Dim As String yr="2000"       '<------ YEAR WANTED
Dim As Long column
For n As Long=Lbound(s2) To Ubound(s2)
    If s2(n)=yr Then column=n+1:Exit For 'get column from header titles
Next

Dim As Long i,i2
Dim As Double d
For n As Long=Lbound(s) To Ubound(s)
    If Instr(s(n),"Population undernourished, percentage") Then
        i2+=1      'total readings
        stringsplit(s(n),",",s2())
        remove(s2(column),Chr(34))
       
       ' print (s2(column)),isnumber(s2(column)),len(s2(column))
       
        If isnumber(s2(column)) Then  d+=Val(s2(column)):i+=1 'valid readings
    End If
Next
Print "total readings ";i2
Print "average ";d/i,i;"  valid readings"
Print "Time taken ";Timer-t
Print "done"

Sleep 
jj2007
Posts: 1161
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: 64-bit asm woes

Postby jj2007 » Oct 03, 2018 13:01

dodicat wrote:My average now is 15.93136094674556
Interesting! My result matches the M$ Excel result exactly, but I see that some "E" cells are included. Hmmmmm....
marcov
Posts: 2751
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: 64-bit asm woes

Postby marcov » Oct 03, 2018 13:51

caseih wrote:
jj2007 wrote:Compilers can outperform hand-written assembler code - in rare cases. And in these rare cases, you can learn from the compiler and re-write the assembler code.
Sure, if you're an expert programmer who has a thorough understanding of the architecture and the chip family you are targeting. For everyone else... not so much. I keep harping on this but I'll say it again. It's far far better to optimize your algorithm before you resort to assembly. Sure you can speed up FB code 2-3 times by going to straight assembly. But if you could speed up the FB code by 10 times through proper profiling and algorithmic optimizations, I think I know which is more effective. Brute force in assembly is still brute force: it doesn't change the dominant big O() run time factor.


( I as much hate the O() platitude as the mindless assembler references. In reality the cardinality (number of items) of many problems never are really infinite, if only because of limited memory, so constant factors do have to be weighted against the cardinality factor.

This is specially true because not just memory matters. A simple problem with a more complex algorithm can become slower because the additional data (per item) makes the working set larger and you get cache effects.

So be aware of all oversimplifications, both "assembler is lower level so faster", but also "higher level always wins in the end".
)

in other words, O(n!) is still O(n!) even each iteration is 2-3 times faster[1].


If your dataset is infinite, your memory is infinite and with uniform random access memory (no caching etc). This is not reality.

That said, I use assembler very sparingly. My experience with Delphi (where in the past assembler for simple utility routines was common) is that people tend to keep using assembler designed for older systems (hardware and software), without ever retesting.

I will forever remember PowerBASIC as a cautionary tale about the perils of using pure assembly for any project. It was once fast and unbeatable, but couldn't transition when the architecture and OS changed. I have to wonder if assembly played a role in Bob Zale's departure from Borland and the subsequent spinning off of Turbo Basic. Could it be there was a change at Borland, moving from the fast assembly-based compilers to ones written in C that were a little more future proof as the world moved from 16-bits to 32-bits? Would be interesting to find out some day.


No, according to wiki Zale bought it back from Borland in 1989. 386's were still very,very costly then, and even in 1993 Borland would bring out a 16-bit (though with both real and 16-bit protected mode compiler) version of their flagship product (Turbo Pascal).

They cut multiple small products in the late eighties. Turbo Modula2 had been axed in 1987-1988 (sources differ over exact date)

Of course, on the long term 32-bit would win out, but 32-bit product was not sellable in 1989. Too small audience, and even 16-bit dos extenders were limited back then (Pharlab only afaik, no dos4gw etc yet) or had to be done yourself.
jj2007
Posts: 1161
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: 64-bit asm woes

Postby jj2007 » Oct 03, 2018 17:37

dodicat wrote:My average now is 15.93136094674556
If you uncomment line 98 it will show the valid results left.

Looks unsuspicious, and the "E" cells are not the problem. One explanation is that the UN worked on the database in the last few days. I put my zip here, perhaps you will get identical results with that one. However, with the database downloaded right now, I still get the same result:

Code: Select all

14 ms for loading 44553 rows
17 ms for processing 234 matches (of which 171 valid) with an average of 15.996

In contrast, your program reports this with that latest version:

Code: Select all

total readings  234
average  15.93136094674556   169  valid reading
Time taken  0.5938593397639806
Check e.g. the Korea value in line 18822, it should be 37.9. The problem here is that column 2 is Korea, Democratic People's Republic of - csv is a tricky format ;-)
MrSwiss
Posts: 3083
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: 64-bit asm woes

Postby MrSwiss » Oct 03, 2018 18:57

jj2007 wrote:One should distinguish, though, short sequences of inline asm vs subroutines written entirely in asm.
Yes, as well as you should start to realize, that 32 bit assembly is only
a 'good thing', 'improvement' if used with a 32 bit compiler: FBC 32.

With FBC 64, it more than likely, causes a 'slow down'. Reasons are:
1) it disables certain compiler optimizations (cache issues mainly)
2) 32 bit code in and of itself (inside 64 bit code) is likely a 'slow down',
compared to the same FB code.

Unless you are able to provide the same ASM routines, optimized for:
32 bit, as well as 64 bit, equally!
caseih
Posts: 1340
Joined: Feb 26, 2007 5:32

Re: 64-bit asm woes

Postby caseih » Oct 03, 2018 19:20

marcov wrote:( I as much hate the O() platitude as the mindless assembler references. In reality the cardinality (number of items) of many problems never are really infinite, if only because of limited memory, so constant factors do have to be weighted against the cardinality factor.

This is specially true because not just memory matters. A simple problem with a more complex algorithm can become slower because the additional data (per item) makes the working set larger and you get cache effects.

So be aware of all oversimplifications, both "assembler is lower level so faster", but also "higher level always wins in the end".
)
Sorry to sound like a broken record. I don't feel it's platitudes at all. Big O and complexity was probably the most important thing I've ever been taught in computer science.

All I know is that there have been more than a few times recently where simple techniques that trade memory for speed gave me a far greater payoff in terms of optimization. Shrug.
jj2007
Posts: 1161
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: 64-bit asm woes

Postby jj2007 » Oct 03, 2018 21:10

MrSwiss wrote:2) 32 bit code in and of itself (inside 64 bit code)
That's an interesting idea. Can you provide a test case?

Return to “General”

Who is online

Users browsing this forum: No registered users and 7 guests