I need someone to defeat my string splitting algo

General FreeBASIC programming questions.
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 0:13

The algorithm must be able to split a string over multiple delimiters (not just one). It outputs the split strings as an array. The algorithm works well for a standard text of 1 million chars or less. But when it gets around 16 million, well.. Hours of my life are robbed from me.

The call goes like this:

splitOver( massive_string, !" \r\n !?.,:;", array )

The result should be an array of strings carved over each character in the delimiter string. The function should eat all the delimiters. Any takers?

Code: Select all

Sub SplitOver ( byref txt as string, byref delim as string, outArr() as string )
   
    txt &= chr(delim[0])
   
    dim as integer nFields=0, bgn=0, lentxt = len(txt)-1, lendelim = len(delim)-1, nxt10k=0
    dim as string newstring = ""
   
    redim outArr(10000)
    outArr(0) = txt
   
    for i as integer = 0 to lentxt
        for ii as integer = 0 to lendelim
                if txt[i] = delim[ii] OR i = lentxt then
                    newstring = RIGHT ( LEFT ( txt, i ), i-bgn )
                    if newstring <> "" then
                        nFields+=1
                        nxt10k += 1
                        if nxt10k = 10000 then
                            redim preserve outArr(nFields+10000)
                            nxt10k = 0
                        end if
                        outArr(nFields) = newstring
                    endif
                    bgn = i+1
                endif
        next
    next
   
    redim preserve outArr( nFields )
   
end sub
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 1:21

This is much faster. Pre processes the spots where delims are then uses mid to carve up the string. Array is then copied to eliminate empty indexes.

Code: Select all

sub splitOver2( byval txt as string, byval delim as string, outArr() as string )
   
   txt &= chr( delim[0] )
   
   dim as integer spots(), lentxt = len(txt)-1, lendelim = len(delim)-1, nfields = 0
   dim as string tmpArr()
   
   redim spots(0)
   spots(0) = 0
   
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = delim[ii] then
            redim preserve spots( nfields )
            spots( nfields ) = i
            nfields += 1
            exit for
         end if
      next ii
   next i
   
   ? "Spots assigned"
   
   var lwords = ubound(spots)
   redim tmpArr( lwords )
   
   for i as integer = 0 to lwords - 1
      var newstring = mid( txt, spots(i)+2, spots(i+1)-spots(i)-1 )
      tmpArr(i) = newstring
      ? i & " of " & lwords
   next i
   
   dim as integer nwords = 0
   
   for i as integer = 0 to lwords
      redim preserve outArr(nwords)
      if tmpArr(i) <> "" then
         outArr(nWords) = tmpArr(i)
         nwords += 1
      end if
   next i

end sub
leopardpm
Posts: 1597
Joined: Feb 28, 2009 20:58

Re: I need someone to defeat my string splitting algo

Postby leopardpm » Apr 03, 2016 1:28

you defeated yourself! Congrats rollie!
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 3:14

yay! lol
sancho2
Posts: 547
Joined: May 17, 2015 6:41

Re: I need someone to defeat my string splitting algo

Postby sancho2 » Apr 03, 2016 4:16

You can gain some speed by not redim'ing every iteration. This is a speed test that you can plop into your sub:

Code: Select all

Dim As Double sTime, eTime, aTime, bTime
   
   sTime = Timer
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = delim[ii] then
            redim preserve spots( nfields )
            spots( nfields ) = i
            nfields += 1
            exit for
         end if
      next ii
   next i
   eTime = Timer
   aTime = eTime - sTime
   Print "nfields: "; nfields
   Print "aaa: "; aTime
   Print
   'Sleep
   Dim As Integer n, count
   
   n = 1000
   nfields = 0
   ReDim spots(0 To n)
   sTime = Timer   
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = delim[ii] Then
            'redim preserve spots( nfields )
            spots( nfields ) = i
            nfields += 1
           
            If nfields = n Then
               n += 1000
               ReDim Preserve spots(0 To n)
            EndIf
            exit for
         end if
      next ii
   next i
   ReDim Preserve spots(0 To nfields)   '<---- EDIT: I forgot to down size the array
      eTime = Timer
   bTime = eTime - sTime

   Print "nfields: "; nfields
   Print "bbb: "; bTime
   Sleep
sancho2
Posts: 547
Joined: May 17, 2015 6:41

Re: I need someone to defeat my string splitting algo

Postby sancho2 » Apr 03, 2016 6:34

You can gain further speed by sorting the delimeter in reverse and then short circuiting the test in the for loop. ie if the test character > the delim char there is no need to continue checking delimeters.
Apparantly the insertion sort is best for small data sets so here is an implementation:

Code: Select all

Sub RevInsertionSort(ByRef txt As String)
   '
   Dim As Byte temp    ', j

   For i As Integer = 0 to Len(txt) - 1
      temp = txt[i]
       Dim As Integer j = i - 1
       while j >= 0 and txt[j] <= temp
        txt[j+1] = txt[j]
        j = j - 1
       Wend
       txt[j+1] = temp
   Next
   
End Sub

And here is how to implement with the timing checks (you will have to add dTime to the dim'd doubles:

Code: Select all

   n = 1000
   nfields = 0
   ReDim spots(0 To n)
   sTime = Timer   
   Dim As String sss = delim
   revinsertionsort(sss)
   
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = sss[ii] Then
            spots( nfields ) = i
            nfields += 1
           
            If nfields = n Then
               n += 1000
               ReDim Preserve spots(0 To n)
            EndIf
            exit For
         ElseIf txt[i] > sss[ii] Then   ' short circuit
            Exit For
         end If
         
      next ii
   next i
   ReDim Preserve spots(0 To nfields)
   eTime = Timer
   dTime = eTime - sTime
   
   Print "nfields: "; nfields
   Print "ccc: "; dTime
   Print
   Sleep
fxm
Posts: 8347
Joined: Apr 22, 2009 12:46
Location: Paris (suburb), FRANCE

Re: I need someone to defeat my string splitting algo

Postby fxm » Apr 03, 2016 7:15

Some problems with your original code (first sub-string not taken into account, last element of the output array is empty).
Tested with this main program:

Code: Select all

dim as string txt = "123,456,,789,ABC,,,DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
print
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

Code: Select all

Spots assigned
0 of 7
1 of 7
2 of 7
3 of 7
4 of 7
5 of 7
6 of 7

'456'
'789'
'ABC'
'DEF'
''

The first thing before optimizing is to start with a simplified code which works (and string passed byref to avoid useless copy):

Code: Select all

sub splitOver2( byref txt as string, byref delim as string, outArr() as string )
   
   dim as integer startSubString = 0
   
   erase outArr
   
   for i as integer = 0 to len(txt)
      for ii as integer = 0 to len(delim) - 1
         if i = len(txt) orelse txt[i] = delim[ii] then
            if i > startSubString then
               redim preserve outArr(ubound(outArr)+1)
               outArr(ubound(outArr)) = mid(txt, startSubString+1, i-startSubString)
            end if
            startSubString = i + 1
            exit for
         end if
      next ii
   next i

end sub


dim as string txt = "123,456,-789,ABC,-;DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

sleep

Code: Select all

'123'
'456'
'789'
'ABC'
'DEF'
fxm
Posts: 8347
Joined: Apr 22, 2009 12:46
Location: Paris (suburb), FRANCE

Re: I need someone to defeat my string splitting algo

Postby fxm » Apr 03, 2016 8:40

Now, optimization by redim'ing the array by big step:

Code: Select all

sub splitOver2( byref txt as string, byref delim as string, outArr() as string )
   
   dim as integer startSubString = 0
   dim as integer numberSubString = 0
   dim as integer sizeAllocationBlock = 100  '' for example
   
   
   erase outArr
   
   for i as integer = 0 to len(txt)
      for ii as integer = 0 to len(delim) - 1
         if i = len(txt) orelse txt[i] = delim[ii] then
            if i > startSubString then
               if ubound(outArr) < numberSubString then
                  redim preserve outArr(ubound(outArr)+sizeAllocationBlock)
               end if
               outArr(numberSubString) = mid(txt, startSubString+1, i-startSubString)
               numberSubString += 1
            end if
            startSubString = i + 1
            exit for
         end if
      next ii
   next i
   
   if numberSubString > 0 then
      redim preserve outArr(numberSubString-1)
   end if

end sub


dim as string txt = "123,456,-789,ABC,-;DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

sleep


[edit]
- add code to take into account the case where txt contains no sub-string
Last edited by fxm on Apr 03, 2016 14:49, edited 2 times in total.
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 13:33

I'll take a look through later when I'm back home. But under -EXX (compiled under Linux LXLE) I'm getting an illegal function call on the line:

redim preserve outArr(numberSubString-1)

(Windows too. It does not want to compile under -exx.)
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 13:34

Never mind. Text was empty.
fxm
Posts: 8347
Joined: Apr 22, 2009 12:46
Location: Paris (suburb), FRANCE

Re: I need someone to defeat my string splitting algo

Postby fxm » Apr 03, 2016 14:42

See my code update.
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 17:53

I'm getting a 33% increase in speed from my second attempt. Excellent. Thank you fxm.
fxm
Posts: 8347
Joined: Apr 22, 2009 12:46
Location: Paris (suburb), FRANCE

Re: I need someone to defeat my string splitting algo

Postby fxm » Apr 03, 2016 18:45

I replaced MID() by a C runtime function.
Can you test if there is speed improvement with this variant:

Code: Select all

#include "crt/string.bi"

sub splitOver2( byref txt as string, byref delim as string, outArr() as string )
   
   dim as integer startSubString = 0
   dim as integer numberSubString = 0
   dim as integer sizeAllocationBlock = 100  '' for example
   
   
   erase outArr
   
   for i as integer = 0 to len(txt)
      for ii as integer = 0 to len(delim) - 1
         if i = len(txt) orelse txt[i] = delim[ii] then
            if i > startSubString then
               if ubound(outArr) < numberSubString then
                  redim preserve outArr(ubound(outArr)+sizeAllocationBlock)
               end if
               outArr(numberSubString) = space(i-startSubString)
               memcpy(strptr(outArr(numberSubString)), strptr(txt)+startSubString, i-startSubString)
               numberSubString += 1
            end if
            startSubString = i + 1
            exit for
         end if
      next ii
   next i
   
   if numberSubString > 0 then
      redim preserve outArr(numberSubString-1)
   end if

end sub


dim as string txt = "123,456,-789,ABC,-;DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

sleep
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Postby rolliebollocks » Apr 03, 2016 22:51

There does not appear to be a significant increase. I ran timer test about a dozen times and the results were inconsistent but average out to about even.
fxm
Posts: 8347
Joined: Apr 22, 2009 12:46
Location: Paris (suburb), FRANCE

Re: I need someone to defeat my string splitting algo

Postby fxm » Apr 04, 2016 7:25

Thank you.
This shows that the execution speed of MID() is not so bad as seems to say some users.

Return to “General”

Who is online

Users browsing this forum: deltarho[1859], Gablea and 2 guests