Search for Text and Replace
Search for Text and Replace
What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"
for example search for "Good" in whole of the file and replace it with "BAD"
Re: Search for Text and Replace
If you have memory to spare, you could load the entire file into memory, searching & changing the string, and finally writing it out.Alexa wrote:What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"
If you don't have memory to space, pretty much the same think but read/search/save blocks at a time. Just remember that if "Good" has been split by the block read to rewind the file pointer by the length of "Good" and start the next read from there. As well, the portion saved must exclude that portion skipped.
If you're looking for more of a code sample, let me/us know.
Well I think you are going to have to load it into memory, and you did not mention the 'list' before - it complicates things but it is not imposible.Alexa wrote:it's not loaded into memory, program should search in file "file.txt" (This file maybe a large file, ~ 600MB )and replace. yes, i need code sample too. :D, but i need fast method and i want to search a list,
like this
Search - Replace
Good -> Bad
Dark -> Light
White -> black
... -> ...
Do you have any experience with the regular expression library? It may be the best/fastest way to do the search. There is a example in the regex sub-folder of fbc samples.
Something like:
open output
load file into memory
create regex variable
execute it once
while the regex found something
determine which item of the list it found
write to the output everything before what was found (ie. from last found position to the new position)
write out the replacement string
execute the regex again from the last position+length of found string
loop
close output
(Edited to implement buffer-based search & replace - but it is not working for large files, so for now use this without specifying a buffer size)Alexa wrote:Thanks, I'm in Job now, when i back home, i'll try to make it, but Source Code is Much Effective :D
Now I'm not helping with homework I hope :-)
I've tested this with a small file and three replacement pairs. The pairs are embedded within the program as Data statements and this can be replaced :-) with an external file or some such. Passing the pairs via two arrays is a choice, it could be done with a delimited string or a read/data block that is reset every pass.
It uses the regex library, and assumes you want case-insensitive searches - this can be changed by changing REG_ICASE to 0 (zero)
I know there are no comments, so sue me.
Tigra
Code: Select all
' snr.bas
#include once "crt.bi"
#include once "regex.bi"
Declare Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
End main( __FB_ARGC__, __FB_ARGV__ )
Function SearchAndReplace(ByVal Source As String, ByVal Destination As String, ByVal BufferSize As Integer, Searches() As String, Replaces() As String) As Integer
FUNCTION = EXIT_FAILURE
Dim As Integer hSource
Dim As Integer hDestination
Dim as regex_t re
Dim as regmatch_t rmm
Dim as zstring ptr pBuffer
Dim As String SearchPattern
Dim As String Buffer
Dim As Integer re_flag = 0
Dim As Integer longestSearch
longestSearch = -1
For i as integer = 0 to ubound(Searches)
If i > 0 Then SearchPattern &= "|"
SearchPattern &= "(" & Searches(i) & ")"
If len(Searches(i)) > longestSearch Then longestSearch = len(Searches(i))
Next
If regcomp( @re, SearchPattern, REG_EXTENDED or REG_ICASE ) <> 0 Then
print "Could not compile regular expression """ & SearchPattern & """"
Exit Function
End If
hSource = FreeFile()
Open Source For Binary Access Read As #hSource
If BufferSize <= 0 Then
BufferSize = LOF(hSource)
longestSearch = 0
Else
End If
hDestination = FreeFile()
Open Destination For Binary Access Write As #hDestination
Do While Seek(hSource) < LOF(hSource)
If Seek(hSource) + BufferSize-1 > LOF(hSource) Then BufferSize = LOF(hSource) - Seek(hSource) + 1
Buffer = Space(BufferSize)
Get #hSource, , Buffer
pBuffer = StrPtr(Buffer)
re_flag = 0
Do While regexec( @re, pBuffer, 1, @rmm, re_flag ) = 0
re_flag = REG_NOTBOL
Dim As String Replacement
For i as integer = 0 to ubound(Searches)
If UCase(Searches(i)) = UCase(mid(*pBuffer, 1 + rmm.rm_so, rmm.rm_eo - rmm.rm_so)) Then
Replacement = Replaces(i)
Exit For
End If
Next
Put #hDestination, , Left$(*pBuffer, 1 + rmm.rm_so-1)
Put #hDestination, , Replacement
pBuffer += rmm.rm_eo
Loop
If longestSearch > 0 And Seek(hSource) < LOF(hSource) Then
Put #hDestination, , Left(*pBuffer, len(*pBuffer) - longestSearch)
Seek #hSource, Seek(hSource) - longestSearch
Else
Put #hDestination, , *pBuffer
End If
Loop
regfree( @re )
Close #hSource
Close #hDestination
FUNCTION = EXIT_SUCCESS
End Function
Data 3 ' number of search/replace pairs
Data "Function", "function"
Data "Declare", "declare"
Data " Sub ", " sub "
Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
Dim As String Source
Dim As String Destination
Dim As Integer BufferSize
Dim As String Searches()
Dim As String Replaces()
If argc < 3 Then
print *argv[0] & " Source Destination [buffer]"
print "Source is the file to search"
print "Destination is the output file"
print "buffer is the size of the buffer to use; if omitted or zero or less then the entire file is read into memory"
Else
Dim As Integer nPairs
Read nPairs
ReDim Searches(0 to nPairs-1)
ReDim Replaces(0 to nPairs-1)
For i as integer = 0 to nPairs-1
Read Searches(i)
Read Replaces(i)
Next
Source = *argv[1]
Destination = *argv[2]
If argc >= 4 Then BufferSize = val(*argv[3]) Else BufferSize = 0
Return SearchAndReplace(Source, Destination, BufferSize, Searches(), Replaces())
End If
Return EXIT_SUCCESS
End Function
Thanks Tigra =)
your code is working fine, but i have a big problem, i have to search and replace on a big file for ~10MB, and then i have to search for 200 words and replace those with another 200 words, at this case it's working very slow. i split every 10 words to a new code and a new executable to make it work faster, so i have 20 executable right now and each one contain 10 different words to search.
is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?
your code is working fine, but i have a big problem, i have to search and replace on a big file for ~10MB, and then i have to search for 200 words and replace those with another 200 words, at this case it's working very slow. i split every 10 words to a new code and a new executable to make it work faster, so i have 20 executable right now and each one contain 10 different words to search.
is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?
-
- Posts: 165
- Joined: May 27, 2005 14:14
- Location: Marshall, Oklahoma, USA
- Contact:
If I wanted to replace a bunch of words in a string with other words, I would probably do something like Tigra mentionded earlier.
I'd load the file into a String using GET and Binary method
Put all my words to be changed into DATA Like
DATA Good, Bad, Night, Day, up, down, left, right, etc
Put something like the following in a loop.
READ A$
pos = INSTR(TEXT$, A$)
READ B$
MID(TEXT$, pos, LEN(A$)) = B$
I haven't tried it but it looks like it would work.
Later: oops
I tried it out later and the lenth of A$ and B$ would have to be the same. So MID statements doesn't work like I thought.
I'd load the file into a String using GET and Binary method
Put all my words to be changed into DATA Like
DATA Good, Bad, Night, Day, up, down, left, right, etc
Put something like the following in a loop.
READ A$
pos = INSTR(TEXT$, A$)
READ B$
MID(TEXT$, pos, LEN(A$)) = B$
I haven't tried it but it looks like it would work.
Later: oops
I tried it out later and the lenth of A$ and B$ would have to be the same. So MID statements doesn't work like I thought.
Last edited by Jerry Fielden on May 08, 2007 19:01, edited 2 times in total.
I need a Search and Replace Engine something like Winhex ( http://www.winhex.com/winhex/index-m.html ), it replace millions bytes as a second =)
maximum file size for this project is ~200MB
i don't understand how winhex doing this at this speed !, i forgot to tell you i can pay for this code...
maximum file size for this project is ~200MB
i don't understand how winhex doing this at this speed !, i forgot to tell you i can pay for this code...
With that many search-replace pairs it may be that the regular expressions are killing it. I know how to use them but I do not know what the "cost" is. I'll see if I can re-write this to use something besides regex.Alexa wrote:is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?
With such a large file, why is speed such an issue? I would think it would be updated only occasionally. You did not say how long it takes it to process the 10 words in one .exe
Also, if you are, as I suggested, running the app without specifying a block size then the file is being loaded into memory - that can be the slow down. Can you in the mean time add some code to measure the time to load the file and the time to process the file.
Perhaps the best bet would be a good algorithm.
I have been searching for one an I have found only this one http://en.wikipedia.org/wiki/Rabin-Karp ... _algorithm
It looks promising, unfortunately I have no time to implement it.
I have been searching for one an I have found only this one http://en.wikipedia.org/wiki/Rabin-Karp ... _algorithm
It looks promising, unfortunately I have no time to implement it.
Alexa, I've attached a revised version at this bottom of this message.
Changes are:
- fixed the buffer-size code
- tweaked some code
- loads search/replace pairs from a file
- displays a running counter on the screen, disabled to improve performance; a final duration is also show - all in milliseconds
- detects and handles utf-8 files - this is still partially tested
- detects unicode big & little endian files - these cannot be processed at this time
So now instead of having 20 versions of the program you can use one version and merely run it with different files of search/replace pairs.
I'd like you to try an experiment - put all of your pairs of words into one file and run the program using the buffer size parameter:
snr source destination file-of-pairs ####
where #### can be any number. Try using a large number such as 1MB (1048576) or medium sized, 131072, or small 4096. I got the best performance using 4096!
On my PC (Windows) I only have files of 3MB in size. It took seven seconds to process it with five pairs without using the buffer parameter and 1.2 seconds with a block size of 4096.
I've examined the website Antoni mentioned and it looks promising, but it'll take some time to put something together.
Changes are:
- fixed the buffer-size code
- tweaked some code
- loads search/replace pairs from a file
- displays a running counter on the screen, disabled to improve performance; a final duration is also show - all in milliseconds
- detects and handles utf-8 files - this is still partially tested
- detects unicode big & little endian files - these cannot be processed at this time
So now instead of having 20 versions of the program you can use one version and merely run it with different files of search/replace pairs.
I'd like you to try an experiment - put all of your pairs of words into one file and run the program using the buffer size parameter:
snr source destination file-of-pairs ####
where #### can be any number. Try using a large number such as 1MB (1048576) or medium sized, 131072, or small 4096. I got the best performance using 4096!
On my PC (Windows) I only have files of 3MB in size. It took seven seconds to process it with five pairs without using the buffer parameter and 1.2 seconds with a block size of 4096.
I've examined the website Antoni mentioned and it looks promising, but it'll take some time to put something together.
Code: Select all
' snr.bas
#include once "crt.bi"
#include once "regex.bi"
#include once "sdl/sdl_timer.bi"
Declare Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
End main( __FB_ARGC__, __FB_ARGV__ )
Function SearchAndReplace(ByVal Source As String, ByVal Destination As String, ByVal BufferSize As Integer, Searches() As String, Replaces() As String) As Integer
FUNCTION = EXIT_FAILURE
Dim As Integer hSource
Dim As Integer hDestination
Dim as regex_t re
Dim as regmatch_t rmm
Dim as zstring ptr pBuffer
Dim As String SearchPattern
Dim As String Buffer
Dim As Integer re_flag = 0
Dim As Integer longestSearch
Dim As Integer UBoundSearches = UBound(Searches)
Dim As Integer RemainingBufferSize
Dim As Integer IsUCS_little_endian = 0
Dim As Integer IsUCS_big_endian = 0
Dim As Integer IsUTF8 = 0
Const TRUE = Not 0
Const FALSE = 0
hSource = FreeFile()
Open Source For Binary Access Read As #hSource
Buffer = space(3)
Get #hSource, , Buffer
If Buffer[0] = &hFF And Buffer[1] = &hFE Then
IsUCS_little_endian = TRUE
Seek hSource, 3
Buffer = Left(Buffer, 2)
ElseIf Buffer[0] = &hFE And Buffer[1] = &hFF Then
IsUCS_big_endian = TRUE
Seek hSource, 3
Buffer = Left(Buffer, 2)
ElseIf Buffer[0] = &hEF And Buffer[1] = &hBB And Buffer[2] = &hBF Then
IsUTF8 = TRUE
Else
Seek hSource, 1
End If
If IsUCS_little_endian Or IsUCS_big_endian Then
If IsUCS_little_endian Then
print "UCS-2 little endian files cannot be processed"
ElseIf IsUCS_big_endian Then
print "UCS-2 big endian files cannot be processed"
End If
Close #hSource
Exit Function
End If
hDestination = FreeFile()
Open Destination For Binary Access Write As #hDestination
' in case the program can handle these file types in the future...
If IsUTF8 Or IsUCS_little_endian Or IsUCS_big_endian Then
Put #hDestination, , Buffer
End If
longestSearch = -1
For i as integer = 0 to UBoundSearches
If i > 0 Then SearchPattern &= "|"
If IsUCS_little_endian Then
SearchPattern &= "(" & WStr(Searches(i)) & ")"
Else
SearchPattern &= "(" & Searches(i) & ")"
End If
If len(Searches(i)) > longestSearch Then longestSearch = len(Searches(i))
Searches(i) = UCase(Searches(i))
Next
If BufferSize <= 0 Then
BufferSize = LOF(hSource)
longestSearch = 0
ElseIf BufferSize < longestSearch Then
BufferSize = longestSearch * 1.5
End If
If regcomp( @re, SearchPattern, REG_EXTENDED or REG_ICASE ) <> 0 Then
print "Could not compile regular expression """ & SearchPattern & """"
Close #hSource
Close #hDestination
Exit Function
End If
cls
Print "Processing "; Source; " to "; Destination; " in "; (LOF(hSource) \ BufferSize + sgn(LOF(hSource) Mod BufferSize)); " blocks of "; BufferSize
Dim As UInt32 uiOverall = SDL_GetTicks()
Dim As UInt32 uiNow = uiOverall
Dim As UInt32 uiThen
Dim As UInt32 uiTimes(10)
Do While Seek(hSource) < LOF(hSource)
If Seek(hSource) + BufferSize-1 > LOF(hSource) Then BufferSize = LOF(hSource) - Seek(hSource) + 1
Buffer = Space(BufferSize): RemainingBufferSize = BufferSize
Get #hSource, , Buffer
uiThen = SDL_GetTicks(): uiTimes(0) = uiThen - uiNow: uiNow = uiThen
pBuffer = StrPtr(Buffer)
re_flag = 0
Do While regexec( @re, pBuffer, 1, @rmm, re_flag ) = 0
re_flag = REG_NOTBOL
Dim As String Replacement = ""
Dim As String RawSource = mid(*pBuffer, 1 + rmm.rm_so, rmm.rm_eo - rmm.rm_so)
Dim As String UCaseRawSource = UCase(RawSource)
For i as integer = 0 to UBoundSearches
If Searches(i) = UCaseRawSource Then
Replacement = Replaces(i)
Exit For
End If
Next
Put #hDestination, , Left$(*pBuffer, 1 + rmm.rm_so-1)
Put #hDestination, , Replacement
pBuffer += rmm.rm_eo
RemainingBufferSize -= rmm.rm_eo
Loop
If pBuffer = StrPtr(Buffer) Then
uiTimes(3) += 1
Else
uiTimes(4) += 1
End If
uiThen = SDL_GetTicks(): uiTimes(1) = uiThen - uiNow: uiNow = uiThen
If longestSearch > 0 And RemainingBufferSize >= longestSearch And Seek(hSource) < LOF(hSource) Then
uiTimes(5) += 1
Put #hDestination, , Left(*pBuffer, len(*pBuffer) - longestSearch)
Seek #hSource, Seek(hSource) - longestSearch
Else
uiTimes(6) += 1
Put #hDestination, , *pBuffer
End If
uiThen = SDL_GetTicks(): uiTimes(2) = uiThen - uiNow: uiNow = uiThen
''~ locate 2, 0: print using "Load ####, process ####, output ####, (###### / ######), (###### / ######)"; uiTimes(0), uiTimes(1), uiTimes(2), uiTimes(3), uiTimes(4), uiTimes(5), uiTimes(6)
Loop
uiThen = SDL_GetTicks(): print "To process the file " & ( uiThen - uiOverall )
regfree( @re )
Close #hSource
Close #hDestination
FUNCTION = EXIT_SUCCESS
End Function
Data 44 ' number of search/replace pairs
Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
Dim As String Source
Dim As String Destination
Dim As String Strings
Dim As Integer BufferSize
Dim As String Searches()
Dim As String Replaces()
If argc < 4 Then
print *argv[0] & " Source Destination Strings [buffer]"
print "Source is the file to search"
print "Destination is the output file"
print "Strings is the file with search-replace pairs"
print "buffer is the size of the buffer to use; if omitted or zero or less then the entire file is read into memory"
Else
Dim As String Searches()
Dim As String Replaces()
Dim As Integer nPairs = 0
Dim As Integer hPairs = FreeFile()
Source = *argv[1]
Destination = *argv[2]
Strings = *argv[3]
Open Strings For Input As #hPairs
Do While Not EOF(hPairs)
ReDim Preserve Searches(0 To nPairs)
ReDim Preserve Replaces(0 To nPairs)
Input #hPairs, Searches(nPairs), Replaces(nPairs)
nPairs += 1
Loop
Close #hPairs
If argc >= 5 Then BufferSize = val(*argv[4]) Else BufferSize = 0
Return SearchAndReplace(Source, Destination, BufferSize, Searches(), Replaces())
End If
Return EXIT_SUCCESS
End Function
Re: Search for Text and Replace
Alexa, I just had a revelation, usefullness is questionable.Alexa wrote:What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"
Do you have any experience with linux/unix utilities, specifically sed, the stream editor?
From what I know of sed it is designed to run though a file once, processing it as it goes. I'm confident that it can quickly do what you want. Of course if you have not experience with it, well, sorry.
Tigra
Alexa, I've written an alternate version of my program, with a simple list and a 1MB file is runs twice as fast as the previous version.
I've got some ideas that may improve it moreso, but I'd like to hear back from you if you are interested.
I've thought of a short cut it can take if the search word is eight or less characters, and another short cut if all of the search words are more than eight characters.
Tigra
I've got some ideas that may improve it moreso, but I'd like to hear back from you if you are interested.
I've thought of a short cut it can take if the search word is eight or less characters, and another short cut if all of the search words are more than eight characters.
Tigra