I need someone to defeat my string splitting algo

General FreeBASIC programming questions.
Post Reply
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

I need someone to defeat my string splitting algo

Post by rolliebollocks »

The algorithm must be able to split a string over multiple delimiters (not just one). It outputs the split strings as an array. The algorithm works well for a standard text of 1 million chars or less. But when it gets around 16 million, well.. Hours of my life are robbed from me.

The call goes like this:

splitOver( massive_string, !" \r\n !?.,:;", array )

The result should be an array of strings carved over each character in the delimiter string. The function should eat all the delimiters. Any takers?

Code: Select all

Sub SplitOver ( byref txt as string, byref delim as string, outArr() as string ) 
    
    txt &= chr(delim[0])
    
    dim as integer nFields=0, bgn=0, lentxt = len(txt)-1, lendelim = len(delim)-1, nxt10k=0
    dim as string newstring = ""
    
    redim outArr(10000)
    outArr(0) = txt
    
    for i as integer = 0 to lentxt
        for ii as integer = 0 to lendelim
                if txt[i] = delim[ii] OR i = lentxt then
                    newstring = RIGHT ( LEFT ( txt, i ), i-bgn )
                    if newstring <> "" then
                        nFields+=1
                        nxt10k += 1
                        if nxt10k = 10000 then 
                            redim preserve outArr(nFields+10000)
                            nxt10k = 0
                        end if
                        outArr(nFields) = newstring
                    endif
                    bgn = i+1
                endif
        next
    next
    
    redim preserve outArr( nFields )
    
end sub
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Post by rolliebollocks »

This is much faster. Pre processes the spots where delims are then uses mid to carve up the string. Array is then copied to eliminate empty indexes.

Code: Select all

sub splitOver2( byval txt as string, byval delim as string, outArr() as string )
	
	txt &= chr( delim[0] )
	
	dim as integer spots(), lentxt = len(txt)-1, lendelim = len(delim)-1, nfields = 0
	dim as string tmpArr()
	
	redim spots(0)
	spots(0) = 0
	
	for i as integer = 0 to lentxt
		for ii as integer = 0 to lendelim
			if txt[i] = delim[ii] then
				redim preserve spots( nfields )
				spots( nfields ) = i
				nfields += 1
				exit for
			end if
		next ii
	next i 
	
	? "Spots assigned"
	
	var lwords = ubound(spots)
	redim tmpArr( lwords )
	
	for i as integer = 0 to lwords - 1
		var newstring = mid( txt, spots(i)+2, spots(i+1)-spots(i)-1 )
		tmpArr(i) = newstring
		? i & " of " & lwords
	next i 
	
	dim as integer nwords = 0
	
	for i as integer = 0 to lwords
		redim preserve outArr(nwords)
		if tmpArr(i) <> "" then
			outArr(nWords) = tmpArr(i)
			nwords += 1
		end if
	next i

end sub
leopardpm
Posts: 1795
Joined: Feb 28, 2009 20:58

Re: I need someone to defeat my string splitting algo

Post by leopardpm »

you defeated yourself! Congrats rollie!
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Post by rolliebollocks »

yay! lol
sancho2
Posts: 547
Joined: May 17, 2015 6:41

Re: I need someone to defeat my string splitting algo

Post by sancho2 »

You can gain some speed by not redim'ing every iteration. This is a speed test that you can plop into your sub:

Code: Select all

Dim As Double sTime, eTime, aTime, bTime
   
   sTime = Timer
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = delim[ii] then
            redim preserve spots( nfields )
            spots( nfields ) = i
            nfields += 1
            exit for
         end if
      next ii
   next i
   eTime = Timer
   aTime = eTime - sTime
   Print "nfields: "; nfields
   Print "aaa: "; aTime
   Print
   'Sleep
	Dim As Integer n, count
   
   n = 1000
   nfields = 0
   ReDim spots(0 To n)
	sTime = Timer	
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = delim[ii] Then
            'redim preserve spots( nfields )
            spots( nfields ) = i
            nfields += 1
            
            If nfields = n Then
            	n += 1000
            	ReDim Preserve spots(0 To n)
            EndIf
            exit for
         end if
      next ii
   next i
   ReDim Preserve spots(0 To nfields)   '<---- EDIT: I forgot to down size the array 
   	eTime = Timer
	bTime = eTime - sTime

   Print "nfields: "; nfields
   Print "bbb: "; bTime
   Sleep
sancho2
Posts: 547
Joined: May 17, 2015 6:41

Re: I need someone to defeat my string splitting algo

Post by sancho2 »

You can gain further speed by sorting the delimeter in reverse and then short circuiting the test in the for loop. ie if the test character > the delim char there is no need to continue checking delimeters.
Apparantly the insertion sort is best for small data sets so here is an implementation:

Code: Select all

Sub RevInsertionSort(ByRef txt As String)
	'
	Dim As Byte temp    ', j

	For i As Integer = 0 to Len(txt) - 1
		temp = txt[i]
    	Dim As Integer j = i - 1
    	while j >= 0 and txt[j] <= temp
        txt[j+1] = txt[j]
        j = j - 1
    	Wend
    	txt[j+1] = temp
	Next
	
End Sub
And here is how to implement with the timing checks (you will have to add dTime to the dim'd doubles:

Code: Select all

   n = 1000
   nfields = 0
   ReDim spots(0 To n)
	sTime = Timer	
	Dim As String sss = delim
	revinsertionsort(sss) 
	
   for i as integer = 0 to lentxt
      for ii as integer = 0 to lendelim
         if txt[i] = sss[ii] Then
            spots( nfields ) = i
            nfields += 1
            
            If nfields = n Then
            	n += 1000
            	ReDim Preserve spots(0 To n)
            EndIf
            exit For
         ElseIf txt[i] > sss[ii] Then	' short circuit
         	Exit For
         end If
         
      next ii
   next i
   ReDim Preserve spots(0 To nfields)
	eTime = Timer
	dTime = eTime - sTime
	
   Print "nfields: "; nfields
   Print "ccc: "; dTime
	Print
   Sleep
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: I need someone to defeat my string splitting algo

Post by fxm »

Some problems with your original code (first sub-string not taken into account, last element of the output array is empty).
Tested with this main program:

Code: Select all

dim as string txt = "123,456,,789,ABC,,,DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
print
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

Code: Select all

Spots assigned
0 of 7
1 of 7
2 of 7
3 of 7
4 of 7
5 of 7
6 of 7

'456'
'789'
'ABC'
'DEF'
''
The first thing before optimizing is to start with a simplified code which works (and string passed byref to avoid useless copy):

Code: Select all

sub splitOver2( byref txt as string, byref delim as string, outArr() as string )
   
   dim as integer startSubString = 0
   
   erase outArr
   
   for i as integer = 0 to len(txt)
      for ii as integer = 0 to len(delim) - 1
         if i = len(txt) orelse txt[i] = delim[ii] then
            if i > startSubString then
               redim preserve outArr(ubound(outArr)+1)
               outArr(ubound(outArr)) = mid(txt, startSubString+1, i-startSubString)
            end if
            startSubString = i + 1
            exit for
         end if
      next ii
   next i

end sub


dim as string txt = "123,456,-789,ABC,-;DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

sleep

Code: Select all

'123'
'456'
'789'
'ABC'
'DEF'
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: I need someone to defeat my string splitting algo

Post by fxm »

Now, optimization by redim'ing the array by big step:

Code: Select all

sub splitOver2( byref txt as string, byref delim as string, outArr() as string )
   
   dim as integer startSubString = 0
   dim as integer numberSubString = 0
   dim as integer sizeAllocationBlock = 100  '' for example
   
   
   erase outArr
   
   for i as integer = 0 to len(txt)
      for ii as integer = 0 to len(delim) - 1
         if i = len(txt) orelse txt[i] = delim[ii] then
            if i > startSubString then
               if ubound(outArr) < numberSubString then
                  redim preserve outArr(ubound(outArr)+sizeAllocationBlock)
               end if
               outArr(numberSubString) = mid(txt, startSubString+1, i-startSubString)
               numberSubString += 1
            end if
            startSubString = i + 1
            exit for
         end if
      next ii
   next i
   
   if numberSubString > 0 then
      redim preserve outArr(numberSubString-1)
   end if 

end sub


dim as string txt = "123,456,-789,ABC,-;DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

sleep
[edit]
- add code to take into account the case where txt contains no sub-string
Last edited by fxm on Apr 03, 2016 14:49, edited 2 times in total.
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Post by rolliebollocks »

I'll take a look through later when I'm back home. But under -EXX (compiled under Linux LXLE) I'm getting an illegal function call on the line:

redim preserve outArr(numberSubString-1)

(Windows too. It does not want to compile under -exx.)
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Post by rolliebollocks »

Never mind. Text was empty.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: I need someone to defeat my string splitting algo

Post by fxm »

See my code update.
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Post by rolliebollocks »

I'm getting a 33% increase in speed from my second attempt. Excellent. Thank you fxm.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: I need someone to defeat my string splitting algo

Post by fxm »

I replaced MID() by a C runtime function.
Can you test if there is speed improvement with this variant:

Code: Select all

#include "crt/string.bi"

sub splitOver2( byref txt as string, byref delim as string, outArr() as string )
   
   dim as integer startSubString = 0
   dim as integer numberSubString = 0
   dim as integer sizeAllocationBlock = 100  '' for example
   
   
   erase outArr
   
   for i as integer = 0 to len(txt)
      for ii as integer = 0 to len(delim) - 1
         if i = len(txt) orelse txt[i] = delim[ii] then
            if i > startSubString then
               if ubound(outArr) < numberSubString then
                  redim preserve outArr(ubound(outArr)+sizeAllocationBlock)
               end if
               outArr(numberSubString) = space(i-startSubString)
               memcpy(strptr(outArr(numberSubString)), strptr(txt)+startSubString, i-startSubString)
               numberSubString += 1
            end if
            startSubString = i + 1
            exit for
         end if
      next ii
   next i
   
   if numberSubString > 0 then
      redim preserve outArr(numberSubString-1)
   end if 

end sub


dim as string txt = "123,456,-789,ABC,-;DEF"
dim as string outArr()
splitOver2( txt, "-,;", outArr())
for i as integer = lbound(outArr) to ubound(outArr)
  print "'" & outArr(i) & "'"
next i

sleep
rolliebollocks
Posts: 2655
Joined: Aug 28, 2008 10:54
Location: new york

Re: I need someone to defeat my string splitting algo

Post by rolliebollocks »

There does not appear to be a significant increase. I ran timer test about a dozen times and the results were inconsistent but average out to about even.
fxm
Moderator
Posts: 12081
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: I need someone to defeat my string splitting algo

Post by fxm »

Thank you.
This shows that the execution speed of MID() is not so bad as seems to say some users.
Post Reply