Search for Text and Replace

New to FreeBASIC? Post your questions here.
Alexa
Posts: 56
Joined: May 01, 2007 20:22

Search for Text and Replace

Postby Alexa » May 06, 2007 12:20

What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Re: Search for Text and Replace

Postby Tigra » May 06, 2007 13:38

Alexa wrote:What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"


If you have memory to spare, you could load the entire file into memory, searching & changing the string, and finally writing it out.

If you don't have memory to space, pretty much the same think but read/search/save blocks at a time. Just remember that if "Good" has been split by the block read to rewind the file pointer by the length of "Good" and start the next read from there. As well, the portion saved must exclude that portion skipped.

If you're looking for more of a code sample, let me/us know.
Alexa
Posts: 56
Joined: May 01, 2007 20:22

Postby Alexa » May 06, 2007 13:55

it's not loaded into memory, program should search in file "file.txt" (This file maybe a large file, ~ 600MB )and replace. yes, i need code sample too. :D, but i need fast method and i want to search a list,

like this

Search - Replace
Good -> Bad
Dark -> Light
White -> black
... -> ...
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Postby Tigra » May 06, 2007 14:22

Alexa wrote:it's not loaded into memory, program should search in file "file.txt" (This file maybe a large file, ~ 600MB )and replace. yes, i need code sample too. :D, but i need fast method and i want to search a list,

like this

Search - Replace
Good -> Bad
Dark -> Light
White -> black
... -> ...


Well I think you are going to have to load it into memory, and you did not mention the 'list' before - it complicates things but it is not imposible.

Do you have any experience with the regular expression library? It may be the best/fastest way to do the search. There is a example in the regex sub-folder of fbc samples.

Something like:

open output
load file into memory
create regex variable
execute it once
while the regex found something
determine which item of the list it found
write to the output everything before what was found (ie. from last found position to the new position)
write out the replacement string
execute the regex again from the last position+length of found string
loop
close output
Alexa
Posts: 56
Joined: May 01, 2007 20:22

Postby Alexa » May 06, 2007 15:29

Thanks, I'm in Job now, when i back home, i'll try to make it, but Source Code is Much Effective :D
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Postby Tigra » May 06, 2007 17:07

Alexa wrote:Thanks, I'm in Job now, when i back home, i'll try to make it, but Source Code is Much Effective :D


(Edited to implement buffer-based search & replace - but it is not working for large files, so for now use this without specifying a buffer size)

Now I'm not helping with homework I hope :-)

I've tested this with a small file and three replacement pairs. The pairs are embedded within the program as Data statements and this can be replaced :-) with an external file or some such. Passing the pairs via two arrays is a choice, it could be done with a delimited string or a read/data block that is reset every pass.

It uses the regex library, and assumes you want case-insensitive searches - this can be changed by changing REG_ICASE to 0 (zero)

I know there are no comments, so sue me.

Tigra

Code: Select all

'   snr.bas

#include once "crt.bi"
#include once "regex.bi"

Declare Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
End main( __FB_ARGC__, __FB_ARGV__ )

Function SearchAndReplace(ByVal Source As String, ByVal Destination As String, ByVal BufferSize As Integer, Searches() As String, Replaces() As String) As Integer
   FUNCTION = EXIT_FAILURE

   Dim As Integer hSource
   Dim As Integer hDestination
   Dim as regex_t re
   Dim as regmatch_t rmm
   Dim as zstring ptr pBuffer
   Dim As String SearchPattern
   Dim As String Buffer
   Dim As Integer re_flag = 0
   Dim As Integer longestSearch

   longestSearch = -1
   For i as integer = 0 to ubound(Searches)
      If i > 0 Then SearchPattern &= "|"
      SearchPattern &= "(" & Searches(i) & ")"
      If len(Searches(i)) > longestSearch Then longestSearch = len(Searches(i))
   Next

   If regcomp( @re, SearchPattern, REG_EXTENDED or REG_ICASE ) <> 0 Then
      print "Could not compile regular expression """ & SearchPattern & """"
      Exit Function
   End If

   hSource = FreeFile()
   Open Source For Binary Access Read As #hSource

   If BufferSize <= 0 Then
      BufferSize = LOF(hSource)
      longestSearch = 0
   Else
   End If

   hDestination = FreeFile()
   Open Destination For Binary Access Write As #hDestination

   Do While Seek(hSource) < LOF(hSource)
      If Seek(hSource) + BufferSize-1 > LOF(hSource) Then BufferSize = LOF(hSource) - Seek(hSource) + 1
      Buffer = Space(BufferSize)
      Get #hSource, , Buffer
      pBuffer = StrPtr(Buffer)
      re_flag = 0
      Do While regexec( @re, pBuffer, 1, @rmm, re_flag ) = 0
         re_flag = REG_NOTBOL
         Dim As String Replacement
         For i as integer = 0 to ubound(Searches)
            If UCase(Searches(i)) = UCase(mid(*pBuffer, 1 + rmm.rm_so, rmm.rm_eo - rmm.rm_so)) Then
               Replacement = Replaces(i)
               Exit For
            End If
         Next
         Put #hDestination, , Left$(*pBuffer, 1 + rmm.rm_so-1)
         Put #hDestination, , Replacement
         pBuffer += rmm.rm_eo
      Loop
      If longestSearch > 0 And Seek(hSource) < LOF(hSource) Then
         Put #hDestination, , Left(*pBuffer, len(*pBuffer) - longestSearch)
         Seek #hSource, Seek(hSource) - longestSearch
      Else
         Put #hDestination, , *pBuffer
      End If
   Loop

   regfree( @re )

   Close #hSource
   Close #hDestination

   FUNCTION = EXIT_SUCCESS

End Function

Data 3 ' number of search/replace pairs
Data "Function", "function"
Data "Declare", "declare"
Data " Sub ", " sub "

Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
   Dim As String Source
   Dim As String Destination
   Dim As Integer BufferSize
   Dim As String Searches()
   Dim As String Replaces()

   If argc < 3 Then
      print *argv[0] & " Source Destination [buffer]"
      print "Source is the file to search"
      print "Destination is the output file"
      print "buffer is the size of the buffer to use; if omitted or zero or less then the entire file is read into memory"
   Else
      Dim As Integer nPairs

      Read nPairs
      ReDim Searches(0 to nPairs-1)
      ReDim Replaces(0 to nPairs-1)

      For i as integer = 0 to nPairs-1
         Read Searches(i)
         Read Replaces(i)
      Next

      Source = *argv[1]
      Destination = *argv[2]
      If argc >= 4 Then BufferSize = val(*argv[3]) Else BufferSize = 0
      Return SearchAndReplace(Source, Destination, BufferSize, Searches(), Replaces())
   End If
   Return EXIT_SUCCESS
End Function
Alexa
Posts: 56
Joined: May 01, 2007 20:22

Postby Alexa » May 07, 2007 10:13

Thanks Tigra =)

your code is working fine, but i have a big problem, i have to search and replace on a big file for ~10MB, and then i have to search for 200 words and replace those with another 200 words, at this case it's working very slow. i split every 10 words to a new code and a new executable to make it work faster, so i have 20 executable right now and each one contain 10 different words to search.

is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?
Jerry Fielden
Posts: 165
Joined: May 27, 2005 14:14
Location: Marshall, Oklahoma, USA
Contact:

Postby Jerry Fielden » May 07, 2007 13:46

If I wanted to replace a bunch of words in a string with other words, I would probably do something like Tigra mentionded earlier.

I'd load the file into a String using GET and Binary method

Put all my words to be changed into DATA Like

DATA Good, Bad, Night, Day, up, down, left, right, etc

Put something like the following in a loop.

READ A$
pos = INSTR(TEXT$, A$)
READ B$
MID(TEXT$, pos, LEN(A$)) = B$

I haven't tried it but it looks like it would work.

Later: oops
I tried it out later and the lenth of A$ and B$ would have to be the same. So MID statements doesn't work like I thought.
Last edited by Jerry Fielden on May 08, 2007 19:01, edited 2 times in total.
Alexa
Posts: 56
Joined: May 01, 2007 20:22

Postby Alexa » May 07, 2007 15:01

I need a Search and Replace Engine something like Winhex ( http://www.winhex.com/winhex/index-m.html ), it replace millions bytes as a second =)
maximum file size for this project is ~200MB
i don't understand how winhex doing this at this speed !, i forgot to tell you i can pay for this code...
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Postby Tigra » May 08, 2007 1:31

Alexa wrote:is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?


With that many search-replace pairs it may be that the regular expressions are killing it. I know how to use them but I do not know what the "cost" is. I'll see if I can re-write this to use something besides regex.

With such a large file, why is speed such an issue? I would think it would be updated only occasionally. You did not say how long it takes it to process the 10 words in one .exe

Also, if you are, as I suggested, running the app without specifying a block size then the file is being loaded into memory - that can be the slow down. Can you in the mean time add some code to measure the time to load the file and the time to process the file.
axipher
Posts: 891
Joined: Dec 27, 2005 16:37
Location: Sudbury,Ontario
Contact:

Postby axipher » May 08, 2007 2:52

Well if you want speed, instead of using strings, use arrays of bytes and use ASM. That's your best bet for speed.
Antoni
Posts: 1393
Joined: May 27, 2005 15:40
Location: Barcelona, Spain

Postby Antoni » May 08, 2007 6:17

Perhaps the best bet would be a good algorithm.
I have been searching for one an I have found only this one http://en.wikipedia.org/wiki/Rabin-Karp ... _algorithm
It looks promising, unfortunately I have no time to implement it.
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Postby Tigra » May 11, 2007 20:01

Alexa, I've attached a revised version at this bottom of this message.

Changes are:
- fixed the buffer-size code
- tweaked some code
- loads search/replace pairs from a file
- displays a running counter on the screen, disabled to improve performance; a final duration is also show - all in milliseconds
- detects and handles utf-8 files - this is still partially tested
- detects unicode big & little endian files - these cannot be processed at this time

So now instead of having 20 versions of the program you can use one version and merely run it with different files of search/replace pairs.

I'd like you to try an experiment - put all of your pairs of words into one file and run the program using the buffer size parameter:
snr source destination file-of-pairs ####
where #### can be any number. Try using a large number such as 1MB (1048576) or medium sized, 131072, or small 4096. I got the best performance using 4096!

On my PC (Windows) I only have files of 3MB in size. It took seven seconds to process it with five pairs without using the buffer parameter and 1.2 seconds with a block size of 4096.

I've examined the website Antoni mentioned and it looks promising, but it'll take some time to put something together.

Code: Select all

'   snr.bas

#include once "crt.bi"
#include once "regex.bi"
#include once "sdl/sdl_timer.bi"

Declare Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
End main( __FB_ARGC__, __FB_ARGV__ )

Function SearchAndReplace(ByVal Source As String, ByVal Destination As String, ByVal BufferSize As Integer, Searches() As String, Replaces() As String) As Integer
   FUNCTION = EXIT_FAILURE

   Dim As Integer hSource
   Dim As Integer hDestination
   Dim as regex_t re
   Dim as regmatch_t rmm
   Dim as zstring ptr pBuffer
   Dim As String SearchPattern
   Dim As String Buffer
   Dim As Integer re_flag = 0
   Dim As Integer longestSearch
   Dim As Integer UBoundSearches = UBound(Searches)
   Dim As Integer RemainingBufferSize
   Dim As Integer IsUCS_little_endian = 0
   Dim As Integer IsUCS_big_endian = 0
   Dim As Integer IsUTF8 = 0

   Const TRUE = Not 0
   Const FALSE = 0

   hSource = FreeFile()
   Open Source For Binary Access Read As #hSource

   Buffer = space(3)
   Get #hSource, , Buffer
   If Buffer[0] = &hFF And Buffer[1] = &hFE Then
      IsUCS_little_endian = TRUE
      Seek hSource, 3
      Buffer = Left(Buffer, 2)
   ElseIf Buffer[0] = &hFE And Buffer[1] = &hFF Then
      IsUCS_big_endian = TRUE
      Seek hSource, 3
      Buffer = Left(Buffer, 2)
   ElseIf Buffer[0] = &hEF And Buffer[1] = &hBB And Buffer[2] = &hBF Then
      IsUTF8 = TRUE
   Else
      Seek hSource, 1
   End If

   If IsUCS_little_endian Or IsUCS_big_endian Then
      If IsUCS_little_endian Then
         print "UCS-2 little endian files cannot be processed"
      ElseIf IsUCS_big_endian Then
         print "UCS-2 big endian files cannot be processed"
      End If
      Close #hSource
      Exit Function
   End If

   hDestination = FreeFile()
   Open Destination For Binary Access Write As #hDestination

   ' in case the program can handle these file types in the future...
   If IsUTF8 Or IsUCS_little_endian Or IsUCS_big_endian Then
      Put #hDestination, , Buffer
   End If

   longestSearch = -1
   For i as integer = 0 to UBoundSearches
      If i > 0 Then SearchPattern &= "|"
      If IsUCS_little_endian Then
         SearchPattern &= "(" & WStr(Searches(i)) & ")"
      Else
         SearchPattern &= "(" & Searches(i) & ")"
      End If
      If len(Searches(i)) > longestSearch Then longestSearch = len(Searches(i))
      Searches(i) = UCase(Searches(i))
   Next

   If BufferSize <= 0 Then
      BufferSize = LOF(hSource)
      longestSearch = 0
   ElseIf BufferSize < longestSearch Then
      BufferSize = longestSearch * 1.5
   End If

   If regcomp( @re, SearchPattern, REG_EXTENDED or REG_ICASE ) <> 0 Then
      print "Could not compile regular expression """ & SearchPattern & """"
      Close #hSource
      Close #hDestination
      Exit Function
   End If

   cls

   Print "Processing "; Source; " to "; Destination; " in "; (LOF(hSource) \ BufferSize + sgn(LOF(hSource) Mod BufferSize)); " blocks of "; BufferSize

   Dim As UInt32 uiOverall = SDL_GetTicks()
   Dim As UInt32 uiNow = uiOverall
   Dim As UInt32 uiThen
   Dim As UInt32 uiTimes(10)

   Do While Seek(hSource) < LOF(hSource)
      If Seek(hSource) + BufferSize-1 > LOF(hSource) Then BufferSize = LOF(hSource) - Seek(hSource) + 1
      Buffer = Space(BufferSize): RemainingBufferSize = BufferSize
      Get #hSource, , Buffer
      uiThen = SDL_GetTicks(): uiTimes(0) = uiThen - uiNow: uiNow = uiThen
      pBuffer = StrPtr(Buffer)
      re_flag = 0
      Do While regexec( @re, pBuffer, 1, @rmm, re_flag ) = 0
         re_flag = REG_NOTBOL
         Dim As String Replacement = ""
         Dim As String RawSource = mid(*pBuffer, 1 + rmm.rm_so, rmm.rm_eo - rmm.rm_so)
         Dim As String UCaseRawSource = UCase(RawSource)
         For i as integer = 0 to UBoundSearches
            If Searches(i) = UCaseRawSource Then
               Replacement = Replaces(i)
               Exit For
            End If
         Next
         Put #hDestination, , Left$(*pBuffer, 1 + rmm.rm_so-1)
         Put #hDestination, , Replacement
         pBuffer += rmm.rm_eo
         RemainingBufferSize -= rmm.rm_eo
      Loop
      If pBuffer = StrPtr(Buffer) Then
         uiTimes(3) += 1
      Else
         uiTimes(4) += 1
      End If
      uiThen = SDL_GetTicks(): uiTimes(1) = uiThen - uiNow: uiNow = uiThen
      If longestSearch > 0 And RemainingBufferSize >= longestSearch And Seek(hSource) < LOF(hSource) Then
         uiTimes(5) += 1
         Put #hDestination, , Left(*pBuffer, len(*pBuffer) - longestSearch)
         Seek #hSource, Seek(hSource) - longestSearch
      Else
         uiTimes(6) += 1
         Put #hDestination, , *pBuffer
      End If
      uiThen = SDL_GetTicks(): uiTimes(2) = uiThen - uiNow: uiNow = uiThen

      ''~ locate 2, 0: print using "Load ####, process ####, output ####, (###### / ######), (###### / ######)"; uiTimes(0), uiTimes(1), uiTimes(2), uiTimes(3), uiTimes(4), uiTimes(5), uiTimes(6)
    Loop

   uiThen = SDL_GetTicks(): print "To process the file " & ( uiThen - uiOverall )

   regfree( @re )

   Close #hSource
   Close #hDestination

   FUNCTION = EXIT_SUCCESS

End Function

Data 44 ' number of search/replace pairs

Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
   Dim As String Source
   Dim As String Destination
   Dim As String Strings
   Dim As Integer BufferSize
   Dim As String Searches()
   Dim As String Replaces()

   If argc < 4 Then
      print *argv[0] & " Source Destination Strings [buffer]"
      print "Source is the file to search"
      print "Destination is the output file"
      print "Strings is the file with search-replace pairs"
      print "buffer is the size of the buffer to use; if omitted or zero or less then the entire file is read into memory"
   Else
      Dim As String Searches()
      Dim As String Replaces()
      Dim As Integer nPairs = 0
      Dim As Integer hPairs = FreeFile()

      Source = *argv[1]
      Destination = *argv[2]
      Strings = *argv[3]

      Open Strings For Input As #hPairs
      Do While Not EOF(hPairs)
         ReDim Preserve Searches(0 To nPairs)
         ReDim Preserve Replaces(0 To nPairs)
         Input #hPairs, Searches(nPairs), Replaces(nPairs)
         nPairs += 1
      Loop
      Close #hPairs

      If argc >= 5 Then BufferSize = val(*argv[4]) Else BufferSize = 0
      Return SearchAndReplace(Source, Destination, BufferSize, Searches(), Replaces())
   End If
   Return EXIT_SUCCESS
End Function
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Re: Search for Text and Replace

Postby Tigra » May 11, 2007 22:05

Alexa wrote:What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"


Alexa, I just had a revelation, usefullness is questionable.

Do you have any experience with linux/unix utilities, specifically sed, the stream editor?

From what I know of sed it is designed to run though a file once, processing it as it goes. I'm confident that it can quickly do what you want. Of course if you have not experience with it, well, sorry.

Tigra
Tigra
Posts: 155
Joined: Jan 07, 2007 17:21

Postby Tigra » May 17, 2007 12:51

Alexa, I've written an alternate version of my program, with a simple list and a 1MB file is runs twice as fast as the previous version.

I've got some ideas that may improve it moreso, but I'd like to hear back from you if you are interested.

I've thought of a short cut it can take if the search word is eight or less characters, and another short cut if all of the search words are more than eight characters.

Tigra

Return to “Beginners”

Who is online

Users browsing this forum: No registered users and 3 guests