I need someone to defeat my string splitting algo

General FreeBASIC programming questions.
dodicat
Posts: 3774
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: I need someone to defeat my string splitting algo

Postby dodicat » Jan 07, 2017 12:56

Thanks Marpon.
I have looked through all the posts here, and nobody had invoked strtok.
(Everything but it, I think)
IMHO, a strtok of genius on your part!
marpon
Posts: 190
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: I need someone to defeat my string splitting algo

Postby marpon » Jan 08, 2017 10:31

A faster version improved 25% with 1 loop only
and adaptative redim of the array

I think i will win the challenge !

but remember that split function it is not a split function as vb was giving

for me it is a sort of split_any_delimiter with no empty string token
if you want an split function with all the tokens (even empty string) its a different story ( and slower of course)


Code: Select all

#include "crt/string.bi"   'needed for strtok function
'using strtok(s1 as zstring ptr, s2 as zstring ptr) as zstring ptr only 1 loop
'you can adapt to expected number of tokens to have less redim adaptation, changing the initial i1 value (set to 10 here)
Function String_strtok2(s_in As String , chars As String , result() As String) As Long
   Dim As Long          ctr   
   dim as zstring ptr   p, p1, p2
   dim as long            i1 = 10        'initial value to redim result
   dim as string s2 = s_in             'copy because strtok will alter the input zstring
   
   p = strptr(s2)                        'get pointers
   p2= strptr(chars)
   
   p1 = strtok(p , p2)
   while p1 <> NULL          
      ctr +=1
      if ctr = 1  THEN
         Redim result(1 To i1)     'redim the array first time
      elseif ctr > i1 then
         i1 *= 2
         redim preserve result(1 To i1)     'redim the array to give more space (preserved existing content)
      END IF
      result(ctr)= *p1
      p1 = strtok(NULL, p2)
   WEND
   If ctr = 0 Then Return 0
   Redim preserve result(1 To ctr)     'redim the array to fit exact need (preserved existing content)
   Return Ubound(result)
End Function
'====================================================================================
marpon
Posts: 190
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: I need someone to defeat my string splitting algo

Postby marpon » Jan 10, 2017 16:22

last optimization: with byval s_in As String to avoid to make a copy before using strtok

remind : the strtok function from the clib is very fast but not thread-safe ,
that function incorporates a static value to store the initial pointer and uses it on the consecutive executions,
so that means if you create a multi thread prog using that function it will surely fail.

In most case not a problem, but if you need an other one : thread-safe, contact me,
i 've done 1 some time ago, not as faster as this one (almost 2 x times slower) but even faster than all the versions i've seen here.

Code: Select all

'warning not thread-safe using strtok(s1 as zstring ptr, s2 as zstring ptr) as zstring ptr only 1 loop
Function String_strtok4(byval s_in As String , byref chars As String , result() As String) As Long
   Dim As Long          ctr   
   dim as zstring ptr   p, p1, p2
   dim as long          i1 = 10              'initial value to redim result
      
   p = strptr(s_in)                          'get pointers
   p2= strptr(chars)
   p1 = strtok(p , p2)
   while p1 <> NULL          
      ctr +=1
      if ctr = 1  THEN
         Redim result(1 To i1)              'redim the array first time
      elseif ctr > i1 then
         i1 *= 2
         redim preserve result(1 To i1)     'redim the array to give more space
      END IF
      result(ctr)= *p1
      p1 = strtok(NULL, p2)
   WEND
   If ctr = 0 Then Return 0
   'redim the array to fit exact need
   if Ubound(result) > ctr THEN  Redim preserve result(1 To ctr)        
   Return Ubound(result)
End Function
'====================================================================================
WQ1980
Posts: 23
Joined: Sep 25, 2015 12:04
Location: Russia

Re: I need someone to defeat my string splitting algo

Postby WQ1980 » Jan 10, 2017 16:36

Marpon
remind : the strtok function from the clib is very fast but not thread-safe ,
that function incorporates a static value to store the initial pointer and uses it on the consecutive executions,
so that means if you create a multi thread prog using that function it will surely fail.

1) compiling in dll
2) call from dll
marpon
Posts: 190
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: I need someone to defeat my string splitting algo

Postby marpon » Jan 12, 2017 10:13

here my thread-safe version

not as fast as the strtok one but faster than all others found here

in fact it mimics the strtok_r existing in some c lib distributions ( _r for re-entrant)

notice also
Function String_strtok_r(Byval s_in As String , Byref chars As String , result() As String, Byval size1 As Long = 1) As Long
Byval is needed because we need a copy of the string , the f_strtok_r function will alter the input string (putting 0 to delimit the elements)

and in function f_strtok_r : Byref ps1 As Byte Ptr , is mandatory because the function will return back the modified ps1 pointer
Function f_strtok_r(Byref ps1 As Byte Ptr, Byref ps2 As Byte Ptr, Byval i As Long )As Byte Ptr

and the adaptative Redim , you can change the size1 optional parameter to try optimizing litle more according your split estimation

'with that version you can avoid dll and any work-arround , it will work safely!

Code: Select all

#include "crt/string.bi"   'needed for memchr function

Declare Function f_strtok_r(Byref ps1 As Byte Ptr, Byref ps2 As Byte Ptr, Byval i As Long )As Byte Ptr

'thread-safe function
Function String_strtok_r(Byval s_in As String , Byref chars As String ,  result() As String, Byval size1 As Long = 1) As Long
   Dim ptemp As Zstring Ptr
   Dim ctr As Long
   Dim As Byte Ptr p = Strptr(s_in)   'get pointers
   Dim As Byte Ptr d = Strptr(chars)
   Dim As Long ilen = Len(chars)
   ptemp = p
   While ptemp <> NULL
      ptemp = f_strtok_r(p, d, ilen)  ' p is sent byref and it will be modified on the f_strtok_r function
      if ptemp <> NULL Then
         ctr += 1
         If ctr = 1  Then
            Redim result(1 To size1)     'redim the array
         Elseif ctr > size1 then
            size1 *= 2
            Redim Preserve result(1 To size1)     'redim the array
         End If
         result(ctr)= *ptemp
      End If
   Wend
   If ctr = 0 Then Return 0
   If size1 > ctr Then Redim Preserve result(1 To ctr)     'redim the array
   Return ctr
End Function

Function f_strtok_r(Byref ps1 As Byte Ptr, Byref ps2 As Byte Ptr, Byval i As Long )As Byte Ptr
   Dim p As Byte Ptr

   While memchr(ps2, *ps1,  i) <> NULL
      ps1 += 1
   Wend
   If *ps1 = 0 Then
      ps1 = NULL
      Return NULL
   End If
   p = ps1
   While memchr(ps2, *ps1, i) = NULL 
      ps1 += 1
      if *ps1 = 0 Then Exit While
   Wend
   If *ps1 <> 0 Then
      *ps1 = 0
      ps1 += 1
   End If
   Return p
End Function
marpon
Posts: 190
Joined: Dec 28, 2012 13:31
Location: Paris - France

Re: I need someone to defeat my string splitting algo

Postby marpon » Jan 13, 2017 17:30

i've collected from my own archives the different versions of split functions

each of them is optimized for speed

Split : vb6 equivalent function , splits the input string using sub-string as delimiter, consecutive delimiters produce empty elements

Split_noempty : variation of Split, but the empty elements are not collected

Split_any : splits the input string using every char as char delimiter, consecutive char delimiters produce empty elements

Split_tok : variation of Split_any, but the empty elements are not collected, using strtok from c lib , not thread-safe but very fast

Split_tok_r : works as Split_tok , but thread-safe and bit slower

some of them do not need any include, but some need
#include "crt/string.bi" 'needed for strtok/memchr functions

i've put the indication on the source

source code and test comparaison

Code: Select all

'====================================================================================
'splits TEXT using DELIMIT as sub-string delimiter,
'1 splitted element for each delimiter found or +1  if delimiter does not finish TEXT
' elements can be empty
' vb6 Split equivalent function
'no include needed and thread-safe
'uses array  on the heap, so no problem of capacity with the number of elements < 2147483647
Private Function Split(byref TEXT As String , byref DELIMIT As String , RET() As String) As long
   Dim As long DMAX = 0
   Dim         As long I1
   Dim         As long I2
   Dim         As ulong size = 1
   Dim         As ZString Ptr p
   Dim         As ZString Ptr p1
   Dim         As ZString Ptr p2
   Dim         As ZString Ptr ptemp
   Dim          As long LDelimit = Len(DELIMIT)
   Dim          As long LT = Len(TEXT)
   Dim         As uinteger ptr Posi
   Dim         As uinteger ptr Posi2

   If LT = 0 or LDelimit = 0 Or LDelimit > LT Then
      ReDim RET(1 to 1)
      RET(1) = TEXT                              ' copy the full TEXT
      Return 1
   End If
   ptemp = allocate(Len(TEXT) + 1)
   If ptemp = 0 Then Print "Error Allocating p": End
   *ptemp = TEXT                             ' copy the full text
   p1 = ptemp
   p = ptemp
   ' counts the number of element and stores the position on the string
   Do While *p
      I2 = 0
      If p[0] = DELIMIT[0] Then
         p2 = p
         If LDelimit > 1 Then                 ' more than 1 character as delimiter
            For I1 = 1 To LDelimit - 1
               I2 = 0
               If p[1] <> DELIMIT[I1] Then Exit For
               p += 1
               I2 = 1
            Next
         Else
            I2 = 1
         End If
         If I2 = 1 Then
            DMAX += 1
            if DMAX = 1 THEN
               Posi = allocate((size + 1) * sizeof(uinteger))
               If Posi = 0 Then Print "Error Allocating Posi": End
            elseif DMAX > size THEN          ' need more space to store the position
               size *= 2
               Posi2 = reallocate(Posi, (size + 1) * sizeof(uinteger))
               If Posi2 = 0 Then Print "Error ReAllocating Posi": End
               Posi = Posi2
            END IF
            Posi[DMAX - 1] = cast(uinteger, p1)
            *p2 = 0                           ' put null byte to finish the string element
            p1 = p2 + LDelimit
         End If
      End If
      p += 1
   Loop
   if p1 < p THEN
      DMAX += 1
      if DMAX = 1 THEN
         size = 1
         Posi = allocate( sizeof(uinteger) )
         If Posi = 0 Then Print "Error Allocating Posi": End
      end if
      Posi[DMAX - 1] = cast(uinteger, p1)
   END IF
   ' dimention the array
   ReDim RET(1 To DMAX)
   ' step through the posi array, setting pointers for each element
   For I1 = 1 To DMAX
      RET(I1) = *cptr(zstring ptr, Posi[I1 - 1])    ' get element, by retrieving the pointer from Posi   array
   Next
   Deallocate (ptemp)
   Deallocate (Posi)
   Return DMAX
End Function
'====================================================================================


'====================================================================================
'splits TEXT using DELIMIT as sub-string delimiter, almost as Split,
' but only the non-empty elements will be on the RET array
'no include needed and thread-safe
'uses array on the heap, so no problem of capacity with the number of elements < 2147483647
Private Function Split_noempty(byref TEXT As String , byref DELIMIT As String , RET() As String) As long
   Dim As long DMAX = 0
   Dim         As long I1
   Dim         As long I2
   Dim         As ulong size = 1
   Dim         As ZString Ptr p
   Dim         As ZString Ptr p1
   Dim         As ZString Ptr p2
   Dim         As ZString Ptr ptemp
   Dim          As long LDelimit = Len(DELIMIT)
   Dim          As long LT = Len(TEXT)
   Dim         As uinteger ptr Posi
   Dim         As uinteger ptr Posi2

   If LT = 0 Then Return 0
   if size < 1 THEN size = 1
   If LDelimit = 0 Or LDelimit > LT Then
      ReDim RET(1 to 1)
      RET(1) = TEXT                              ' copy the full TEXT
      Return 1
   End If
   ptemp = allocate(Len(TEXT) + 1)
   If ptemp = 0 Then Print "Error Allocating p": End
   *ptemp = TEXT                             ' copy the full text
   p1 = ptemp
   p = ptemp
   ' counts the number of element and stores the position on the string
   Do While *p
      I2 = 0
      If p[0] = DELIMIT[0] Then
         p2 = p
         If LDelimit > 1 Then                 ' more than 1 character as delimiter
            For I1 = 1 To LDelimit - 1
               I2 = 0
               If p[1] <> DELIMIT[I1] Then Exit For
               p += 1
               I2 = 1
            Next
         Else
            I2 = 1
         End If
         If I2 = 1 Then
            *p2 = 0                           ' put null byte to finish the string element
            if *p1 <> "" then
               DMAX += 1
               if DMAX = 1 THEN
                  Posi = allocate((size + 1) * sizeof(uinteger))
                  If Posi = 0 Then Print "Error Allocating Posi": End
               elseif DMAX > size THEN          ' need more space to store the position
                  size *= 2
                  Posi2 = reallocate(Posi, (size + 1) * sizeof(uinteger))
                  If Posi2 = 0 Then Print "Error ReAllocating Posi": End
                  Posi = Posi2
               END IF
               Posi[DMAX - 1] = cast(uinteger, p1)
            end if
            p1 = p2 + LDelimit
         End If
      End If
      p += 1
   Loop
   if p1 < p and *p1 <> "" THEN
      DMAX += 1
      if DMAX = 1 THEN
         size = 1
         Posi = allocate( sizeof(uinteger) )
         If Posi = 0 Then Print "Error Allocating Posi": End
      end if
      Posi[DMAX - 1] = cast(uinteger, p1)
   END IF
   ' dimention the array
   ReDim RET(1 To DMAX)
   ' step through the posi array, setting pointers for each element
   For I1 = 1 To DMAX
      RET(I1) = *cptr(zstring ptr, Posi[I1 - 1])    ' get element, by retrieving the pointer from Posi   array
   Next
   Deallocate (ptemp)
   Deallocate (Posi)
   Return DMAX
End Function
'====================================================================================


'====================================================================================
#include "crt/string.bi"   'needed for memchr function
'splits TEXT using every char of DELIMIT as delimiter, 
'1 splitted element for each char delimiter found or +1  if char delimiter does not finish TEXT
' elements can be empty
'uses array on the heap, so no problem of capacity with the number of elements < 2147483647
Private Function Split_any(byref TEXT As String , byref DELIMIT As String , RET() As String) As long
   Dim As long DMAX = 0
   Dim         As long I1
   Dim         As long I2
   Dim         As long L1 = len(TEXT)
   Dim         As long L2 = Len(DELIMIT)
   Dim         As ZString Ptr ptemp
   Dim         As byte Ptr p
   Dim         As byte Ptr p1
   Dim         As byte Ptr p2
   Dim         As ulong size = 1
   Dim         As uinteger ptr Posi
   Dim         As uinteger ptr Posi2
   Dim          as zstring ptr pdelim = strptr(DELIMIT)

   If L2 = 0 or L1 = 0 Then
      ReDim RET(1 to 1)
      RET(1) = TEXT                              ' copy the full TEXT
      Return 1
   End If
   ptemp = allocate(L1 + 1)
   *ptemp = TEXT                              ' copy the full TEXT
   p1 = ptemp
   p = ptemp
   ' counts the number of element and stores the position on the string
   Do While p[0]
      I2 = 0
      p2 = pdelim
      if l2 = 1 THEN
         if p2[0] = p[0] then I2 = 1
      else
         if memchr(pdelim, p[0], L2)<> 0 THEN I2 = 1
      END IF
      If I2 = 1 Then
         DMAX += 1
         if DMAX = 1 THEN
            Posi = allocate( (size + 1) * sizeof(uinteger) )
            If Posi = 0 Then Print "Error Allocating Posi": End
         elseif DMAX > size THEN          ' need more space to store the position
            size *= 2
            Posi2 = reallocate( Posi, (size + 1) * sizeof(uinteger))
            If Posi2 = 0 Then Print "Error ReAllocating Posi": End
            Posi = Posi2
         END IF
         Posi[DMAX - 1] = cast(uinteger, p1)
         *p = 0                           ' put null byte to finish the string element
         p1 = p + 1
      End If
      p += 1
   Loop
   if p1 < p THEN
      DMAX += 1
      if DMAX = 1 THEN
         size = 1
         Posi = allocate( sizeof(uinteger) )
         If Posi = 0 Then Print "Error Allocating Posi": End
      end if
      Posi[DMAX - 1] = cast(uinteger, p1)
   END IF
   ' dimention the array
   ReDim RET(1 To DMAX)
   For I1 = 1 To DMAX
      RET(I1) = *cptr(zstring ptr, Posi[I1 - 1])    ' get element, by retrieving the pointer from Posi   array
   Next
   Deallocate (ptemp)
   Deallocate (Posi)
   Return DMAX
End Function


'====================================================================================
#include "crt/string.bi"   'needed for strtok function

'splits TEXT using every char of DELIMIT as delimiter, 
'works as Split_any but only the non-empty elements will be on the RET array 
'
'warning not thread-safe using strtok(s1 as zstring ptr, s2 as zstring ptr) as zstring ptr only 1 loop
Private Function Split_tok(byref TEXT As String , byref DELIMIT As String , RET() As String) As Long
   Dim As Long          ctr
   dim as zstring ptr   p, p1, p2
   dim as ulong         i1 = 1           'initial value to redim RET
   
   if(len(TEXT) = 0 or len(DELIMIT) = 0) then
      Redim RET(1 To 1)
      RET(1) = TEXT
      return 1
   end if
   dim as string         s1 = TEXT       'copy TEXT to avoid alteration of the input
   p = strptr(s1)                        'get pointers
   p2= strptr(DELIMIT)
   p1 = strtok(p , p2)
   while p1 <> 0
      ctr +=1
      if ctr = 1  THEN
         Redim RET(1 To i1)              'redim the array first time
      elseif ctr > i1 then
         i1 *= 2
         redim preserve RET(1 To i1)     'redim the array to give more space
      END IF
      RET(ctr)= *p1
      p1 = strtok(0, p2)
   WEND
   If ctr = 0 Then Return 0
   'redim the array to fit exact need
   if i1 > ctr THEN  Redim preserve RET(1 To ctr)
   Return ctr
End Function
'====================================================================================


'====================================================================================
#include "crt/string.bi"   'needed for memchr function

'helper function for String_tok_r
Private Function h_tok_r(byref ps1 as byte ptr, byref ps2 as byte ptr, byval i as long )as byte ptr
   dim p as byte ptr
   if i > 1 THEN
      while memchr(ps2, *ps1 ,  i) <> 0
         ps1 += 1
      wend
   else
      while *ps1 = *ps2
         ps1 += 1
      wend   
   END IF   
   if *ps1 = 0 THEN
      ps1 = 0
      return 0
   END IF
   p = ps1
   if i > 1 THEN
      while memchr(ps2, *ps1 , i) = 0 
         ps1 += 1
         if *ps1 = 0 then exit while
      wend
   else
      while *ps1 <> *ps2
         ps1 += 1
         if *ps1 = 0 then exit while
      wend   
   end if      
   if *ps1 <> 0 THEN
      *ps1 = 0
      ps1 += 1
   end if
   return p
end function 

'------------------------------------------------------------------------------------
'#include "crt/string.bi"   'needed for memchr function
'thread-safe alternative to Split_tok
'splits TEXT using every char of DELIMIT as delimiter, 
'works as Split_any but only the non-empty elements will be on the RET array 
Private Function Split_tok_r(byref TEXT As String , byref DELIMIT As String ,  RET() As String) As Long
   Dim ptemp As Zstring Ptr
   Dim ctr As Long
   Dim size1 As uLong
   
   dim as long ilen = len(DELIMIT)
   if(len(TEXT) = 0 or ilen = 0) then
      Redim RET(1 To 1)
      RET(1) = TEXT
      return 1
   end if
   dim as string s1 = TEXT            'copy TEXT to avoid alteration of the input
   dim As byte Ptr p = strptr(s1)   'get pointers
   dim As byte Ptr d = strptr(DELIMIT)
   ptemp = p
   while ptemp <> 0
      ptemp = h_tok_r(p, d, ilen)  ' p is sent byref and it will be modified on the f_strtok_r function
      if ptemp <> 0 THEN
         ctr += 1
         if ctr = 1  THEN
            size1 = 1
            Redim RET(1 To size1)     'redim the array
         elseif ctr > size1 then
            size1 *= 2
            redim preserve RET(1 To size1)     'redim the array
         END IF
         RET(ctr)= *ptemp
      end if
   WEND
   If ctr = 0 Then Return 0
   if size1 > ctr then Redim preserve RET(1 To ctr)     'redim the array
   Return ctr
end function
'====================================================================================


'====================================================================================
   'tests
'====================================================================================

'####################################################################################
dim as string s1 = "fortests fforforfor verifications how it is f"
dim as string s2 = "for"
'####################################################################################


dim as long icount
dim as long x
dim as string a()


icount = Split(s1 , s2 , a())
'====================================================================================
print "Split"
print   "initial = >" & s1 & "<" & chr(10) & "delim = >"& s2 & "<" & chr(10)
If LBound( a ) > UBound( a ) Then
   Print "array is empty"
Else
   if ubound(a) <> - 1 then
      print "icount = " & icount & "   lbound = " & lbound(a) & "   ubound = " & ubound(a)
      for x = lbound(a) to icount
         print " x= " & x , ">"& a(x)&"<"
      NEXT
      erase(a)
   end if
End If

Print : print


icount =  Split_noempty(s1 , s2 , a())
'====================================================================================
print "Split_noempty"
print   "initial = >" & s1 & "<" & chr(10) & "delim = >"& s2 & "<" & chr(10)
If LBound( a ) > UBound( a ) Then
   Print "array is empty"
Else
   if ubound(a) <> - 1 then
      print "icount = " & icount & "   lbound = " & lbound(a) & "   ubound = " & ubound(a)
      for x = lbound(a) to icount
         print " x= " & x , ">"& a(x)&"<"
      NEXT
      erase(a)
   end if
End If
Print : print


icount = Split_any(s1 , s2 , a())
'====================================================================================
print "Split_any"
print   "initial = >" & s1 & "<" & chr(10) & "delim = >"& s2 & "<" & chr(10)
If LBound( a ) > UBound( a ) Then
   Print "array is empty"
Else
   if ubound(a) <> - 1 then
      print "icount = " & icount & "   lbound = " & lbound(a) & "   ubound = " & ubound(a)
      for x = lbound(a) to icount
         print " x= " & x , ">"& a(x)&"<"
      NEXT
      erase(a)
   end if
End If
Print : print


icount = Split_tok(s1 , s2 , a())
'====================================================================================
print "Split_tok"
print   "initial = >" & s1 & "<" & chr(10) & "delim = >"& s2 & "<" & chr(10)
If LBound( a ) > UBound( a ) Then
   Print "array is empty"
Else
   if ubound(a) <> - 1 then
      print "icount = " & icount & "   lbound = " & lbound(a) & "   ubound = " & ubound(a)
      for x = lbound(a) to icount
         print " x= " & x , ">"& a(x)&"<"
      NEXT
      erase(a)
   end if
End If
Print : print


icount = Split_tok_r(s1 , s2 , a())
'====================================================================================
print "Split_tok_r"
print   "initial = >" & s1 & "<" & chr(10) & "delim = >"& s2 & "<" & chr(10)
If LBound( a ) > UBound( a ) Then
   Print "array is empty"
Else
   if ubound(a) <> - 1 then
      print "icount = " & icount & "   lbound = " & lbound(a) & "   ubound = " & ubound(a)
      for x = lbound(a) to icount
         print " x= " & x , ">"& a(x)&"<"
      NEXT
      erase(a)
   end if
End If
Print : print

sleep

feel free to test and use...
waiting for comments/remarks or faster solutions
Tourist Trap
Posts: 1957
Joined: Jun 02, 2015 16:24

Re: I need someone to defeat my string splitting algo

Postby Tourist Trap » Jan 14, 2017 13:50

marpon wrote:i've collected from my own archives the different versions of split functions
each of them is optimized for speed
feel free to test and use...
waiting for comments/remarks or faster solutions

Thanks, that looks like the ultimate splitter collection.

Return to “General”

Who is online

Users browsing this forum: No registered users and 2 guests