Search for Text and Replace

Alexa · Post by **Alexa** » May 06, 2007 12:20

What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"

Tigra · Post by **Tigra** » May 06, 2007 13:38

Alexa wrote:What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"

If you have memory to spare, you could load the entire file into memory, searching & changing the string, and finally writing it out.

If you don't have memory to space, pretty much the same think but read/search/save blocks at a time. Just remember that if "Good" has been split by the block read to rewind the file pointer by the length of "Good" and start the next read from there. As well, the portion saved must exclude that portion skipped.

If you're looking for more of a code sample, let me/us know.

Alexa · Post by **Alexa** » May 06, 2007 13:55

it's not loaded into memory, program should search in file "file.txt" (This file maybe a large file, ~ 600MB )and replace. yes, i need code sample too. :D, but i need fast method and i want to search a list,

like this

Search - Replace
Good -> Bad
Dark -> Light
White -> black
... -> ...

Tigra · Post by **Tigra** » May 06, 2007 14:22

Alexa wrote:it's not loaded into memory, program should search in file "file.txt" (This file maybe a large file, ~ 600MB )and replace. yes, i need code sample too. :D, but i need fast method and i want to search a list,

like this

Search - Replace
Good -> Bad
Dark -> Light
White -> black
... -> ...

Well I think you are going to have to load it into memory, and you did not mention the 'list' before - it complicates things but it is not imposible.

Do you have any experience with the regular expression library? It may be the best/fastest way to do the search. There is a example in the regex sub-folder of fbc samples.

Something like:

open output
load file into memory
create regex variable
execute it once
while the regex found something
determine which item of the list it found
write to the output everything before what was found (ie. from last found position to the new position)
write out the replacement string
execute the regex again from the last position+length of found string
loop
close output

Alexa · Post by **Alexa** » May 06, 2007 15:29

Thanks, I'm in Job now, when i back home, i'll try to make it, but Source Code is Much Effective :D

Tigra · Post by **Tigra** » May 06, 2007 17:07

Alexa wrote:Thanks, I'm in Job now, when i back home, i'll try to make it, but Source Code is Much Effective :D

(Edited to implement buffer-based search & replace - but it is not working for large files, so for now use this without specifying a buffer size)

Now I'm not helping with homework I hope :-)

I've tested this with a small file and three replacement pairs. The pairs are embedded within the program as Data statements and this can be replaced :-) with an external file or some such. Passing the pairs via two arrays is a choice, it could be done with a delimited string or a read/data block that is reset every pass.

It uses the regex library, and assumes you want case-insensitive searches - this can be changed by changing REG_ICASE to 0 (zero)

I know there are no comments, so sue me.

Tigra

Code: Select all

'	snr.bas

#include once "crt.bi"
#include once "regex.bi"

Declare Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
End main( __FB_ARGC__, __FB_ARGV__ )

Function SearchAndReplace(ByVal Source As String, ByVal Destination As String, ByVal BufferSize As Integer, Searches() As String, Replaces() As String) As Integer
	FUNCTION = EXIT_FAILURE

	Dim As Integer hSource
	Dim As Integer hDestination
	Dim as regex_t re
	Dim as regmatch_t rmm
	Dim as zstring ptr pBuffer
	Dim As String SearchPattern
	Dim As String Buffer
	Dim As Integer re_flag = 0
	Dim As Integer longestSearch

	longestSearch = -1
	For i as integer = 0 to ubound(Searches)
		If i > 0 Then SearchPattern &= "|"
		SearchPattern &= "(" & Searches(i) & ")"
		If len(Searches(i)) > longestSearch Then longestSearch = len(Searches(i))
	Next

	If regcomp( @re, SearchPattern, REG_EXTENDED or REG_ICASE ) <> 0 Then
		print "Could not compile regular expression """ & SearchPattern & """"
		Exit Function
	End If

	hSource = FreeFile()
	Open Source For Binary Access Read As #hSource

	If BufferSize <= 0 Then
		BufferSize = LOF(hSource)
		longestSearch = 0
	Else
	End If

	hDestination = FreeFile()
	Open Destination For Binary Access Write As #hDestination

	Do While Seek(hSource) < LOF(hSource)
		If Seek(hSource) + BufferSize-1 > LOF(hSource) Then BufferSize = LOF(hSource) - Seek(hSource) + 1
		Buffer = Space(BufferSize)
		Get #hSource, , Buffer
		pBuffer = StrPtr(Buffer)
		re_flag = 0
		Do While regexec( @re, pBuffer, 1, @rmm, re_flag ) = 0
			re_flag = REG_NOTBOL
			Dim As String Replacement
			For i as integer = 0 to ubound(Searches)
				If UCase(Searches(i)) = UCase(mid(*pBuffer, 1 + rmm.rm_so, rmm.rm_eo - rmm.rm_so)) Then
					Replacement = Replaces(i)
					Exit For
				End If
			Next
			Put #hDestination, , Left$(*pBuffer, 1 + rmm.rm_so-1)
			Put #hDestination, , Replacement
			pBuffer += rmm.rm_eo
		Loop
		If longestSearch > 0 And Seek(hSource) < LOF(hSource) Then
			Put #hDestination, , Left(*pBuffer, len(*pBuffer) - longestSearch)
			Seek #hSource, Seek(hSource) - longestSearch
		Else
			Put #hDestination, , *pBuffer
		End If
	Loop

	regfree( @re )

	Close #hSource
	Close #hDestination

	FUNCTION = EXIT_SUCCESS

End Function

Data 3 ' number of search/replace pairs
Data "Function", "function"
Data "Declare", "declare"
Data " Sub ", " sub "

Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
	Dim As String Source
	Dim As String Destination
	Dim As Integer BufferSize
	Dim As String Searches()
	Dim As String Replaces()

	If argc < 3 Then
		print *argv[0] & " Source Destination [buffer]"
		print "Source is the file to search"
		print "Destination is the output file"
		print "buffer is the size of the buffer to use; if omitted or zero or less then the entire file is read into memory"
	Else
		Dim As Integer nPairs

		Read nPairs
		ReDim Searches(0 to nPairs-1)
		ReDim Replaces(0 to nPairs-1)

		For i as integer = 0 to nPairs-1
			Read Searches(i)
			Read Replaces(i)
		Next

		Source = *argv[1]
		Destination = *argv[2]
		If argc >= 4 Then BufferSize = val(*argv[3]) Else BufferSize = 0
		Return SearchAndReplace(Source, Destination, BufferSize, Searches(), Replaces())
	End If
	Return EXIT_SUCCESS
End Function

Alexa · Post by **Alexa** » May 07, 2007 10:13

Thanks Tigra =)

your code is working fine, but i have a big problem, i have to search and replace on a big file for ~10MB, and then i have to search for 200 words and replace those with another 200 words, at this case it's working very slow. i split every 10 words to a new code and a new executable to make it work faster, so i have 20 executable right now and each one contain 10 different words to search.

is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?

Jerry Fielden · Post by **Jerry Fielden** » May 07, 2007 13:46

If I wanted to replace a bunch of words in a string with other words, I would probably do something like Tigra mentionded earlier.

I'd load the file into a String using GET and Binary method

Put all my words to be changed into DATA Like

DATA Good, Bad, Night, Day, up, down, left, right, etc

Put something like the following in a loop.

READ A$
pos = INSTR(TEXT$, A$)
READ B$
MID(TEXT$, pos, LEN(A$)) = B$

I haven't tried it but it looks like it would work.

Later: oops
I tried it out later and the lenth of A$ and B$ would have to be the same. So MID statements doesn't work like I thought.

Alexa · Post by **Alexa** » May 07, 2007 15:01

I need a Search and Replace Engine something like Winhex ( http://www.winhex.com/winhex/index-m.html ), it replace millions bytes as a second =)
maximum file size for this project is ~200MB
i don't understand how winhex doing this at this speed !, i forgot to tell you i can pay for this code...

Tigra · Post by **Tigra** » May 08, 2007 1:31

Alexa wrote:is there a way to make it more faster ?! i think this code don't use full system resources, because my HDD LED is not flashing for a longtime. so maybe we can use more system resources for this job , what's your idea about it ?

With that many search-replace pairs it may be that the regular expressions are killing it. I know how to use them but I do not know what the "cost" is. I'll see if I can re-write this to use something besides regex.

With such a large file, why is speed such an issue? I would think it would be updated only occasionally. You did not say how long it takes it to process the 10 words in one .exe

Also, if you are, as I suggested, running the app without specifying a block size then the file is being loaded into memory - that can be the slow down. Can you in the mean time add some code to measure the time to load the file and the time to process the file.

axipher · Post by **axipher** » May 08, 2007 2:52

Well if you want speed, instead of using strings, use arrays of bytes and use ASM. That's your best bet for speed.

Antoni · Post by **Antoni** » May 08, 2007 6:17

Perhaps the best bet would be a good algorithm.
I have been searching for one an I have found only this one http://en.wikipedia.org/wiki/Rabin-Karp ... _algorithm
It looks promising, unfortunately I have no time to implement it.

Tigra · Post by **Tigra** » May 11, 2007 20:01

Alexa, I've attached a revised version at this bottom of this message.

Changes are:
- fixed the buffer-size code
- tweaked some code
- loads search/replace pairs from a file
- displays a running counter on the screen, disabled to improve performance; a final duration is also show - all in milliseconds
- detects and handles utf-8 files - this is still partially tested
- detects unicode big & little endian files - these cannot be processed at this time

So now instead of having 20 versions of the program you can use one version and merely run it with different files of search/replace pairs.

I'd like you to try an experiment - put all of your pairs of words into one file and run the program using the buffer size parameter:
snr source destination file-of-pairs ####
where #### can be any number. Try using a large number such as 1MB (1048576) or medium sized, 131072, or small 4096. I got the best performance using 4096!

On my PC (Windows) I only have files of 3MB in size. It took seven seconds to process it with five pairs without using the buffer parameter and 1.2 seconds with a block size of 4096.

I've examined the website Antoni mentioned and it looks promising, but it'll take some time to put something together.

Code: Select all

'	snr.bas

#include once "crt.bi"
#include once "regex.bi"
#include once "sdl/sdl_timer.bi"

Declare Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
End main( __FB_ARGC__, __FB_ARGV__ )

Function SearchAndReplace(ByVal Source As String, ByVal Destination As String, ByVal BufferSize As Integer, Searches() As String, Replaces() As String) As Integer
	FUNCTION = EXIT_FAILURE

	Dim As Integer hSource
	Dim As Integer hDestination
	Dim as regex_t re
	Dim as regmatch_t rmm
	Dim as zstring ptr pBuffer
	Dim As String SearchPattern
	Dim As String Buffer
	Dim As Integer re_flag = 0
	Dim As Integer longestSearch
	Dim As Integer UBoundSearches = UBound(Searches)
	Dim As Integer RemainingBufferSize
	Dim As Integer IsUCS_little_endian = 0
	Dim As Integer IsUCS_big_endian = 0
	Dim As Integer IsUTF8 = 0

	Const TRUE = Not 0
	Const FALSE = 0

	hSource = FreeFile()
	Open Source For Binary Access Read As #hSource

	Buffer = space(3)
	Get #hSource, , Buffer
	If Buffer[0] = &hFF And Buffer[1] = &hFE Then
		IsUCS_little_endian = TRUE
		Seek hSource, 3
		Buffer = Left(Buffer, 2)
	ElseIf Buffer[0] = &hFE And Buffer[1] = &hFF Then
		IsUCS_big_endian = TRUE
		Seek hSource, 3
		Buffer = Left(Buffer, 2)
	ElseIf Buffer[0] = &hEF And Buffer[1] = &hBB And Buffer[2] = &hBF Then
		IsUTF8 = TRUE
	Else
		Seek hSource, 1
	End If

	If IsUCS_little_endian Or IsUCS_big_endian Then
		If IsUCS_little_endian Then
			print "UCS-2 little endian files cannot be processed"
		ElseIf IsUCS_big_endian Then
			print "UCS-2 big endian files cannot be processed"
		End If
		Close #hSource
		Exit Function
	End If

	hDestination = FreeFile()
	Open Destination For Binary Access Write As #hDestination

	' in case the program can handle these file types in the future...
	If IsUTF8 Or IsUCS_little_endian Or IsUCS_big_endian Then
		Put #hDestination, , Buffer
	End If

	longestSearch = -1
	For i as integer = 0 to UBoundSearches
		If i > 0 Then SearchPattern &= "|"
		If IsUCS_little_endian Then
			SearchPattern &= "(" & WStr(Searches(i)) & ")"
		Else
			SearchPattern &= "(" & Searches(i) & ")"
		End If
		If len(Searches(i)) > longestSearch Then longestSearch = len(Searches(i))
		Searches(i) = UCase(Searches(i))
	Next

	If BufferSize <= 0 Then
		BufferSize = LOF(hSource)
		longestSearch = 0
	ElseIf BufferSize < longestSearch Then
		BufferSize = longestSearch * 1.5
	End If

	If regcomp( @re, SearchPattern, REG_EXTENDED or REG_ICASE ) <> 0 Then
		print "Could not compile regular expression """ & SearchPattern & """"
		Close #hSource
		Close #hDestination
		Exit Function
	End If

	cls

	Print "Processing "; Source; " to "; Destination; " in "; (LOF(hSource) \ BufferSize + sgn(LOF(hSource) Mod BufferSize)); " blocks of "; BufferSize

	Dim As UInt32 uiOverall = SDL_GetTicks()
	Dim As UInt32 uiNow = uiOverall
	Dim As UInt32 uiThen
	Dim As UInt32 uiTimes(10)

	Do While Seek(hSource) < LOF(hSource)
		If Seek(hSource) + BufferSize-1 > LOF(hSource) Then BufferSize = LOF(hSource) - Seek(hSource) + 1
		Buffer = Space(BufferSize): RemainingBufferSize = BufferSize
		Get #hSource, , Buffer
		uiThen = SDL_GetTicks(): uiTimes(0) = uiThen - uiNow: uiNow = uiThen
		pBuffer = StrPtr(Buffer)
		re_flag = 0
		Do While regexec( @re, pBuffer, 1, @rmm, re_flag ) = 0
			re_flag = REG_NOTBOL
			Dim As String Replacement = ""
			Dim As String RawSource = mid(*pBuffer, 1 + rmm.rm_so, rmm.rm_eo - rmm.rm_so)
			Dim As String UCaseRawSource = UCase(RawSource)
			For i as integer = 0 to UBoundSearches
				If Searches(i) = UCaseRawSource Then
					Replacement = Replaces(i)
					Exit For
				End If
			Next
			Put #hDestination, , Left$(*pBuffer, 1 + rmm.rm_so-1)
			Put #hDestination, , Replacement
			pBuffer += rmm.rm_eo
			RemainingBufferSize -= rmm.rm_eo
		Loop
		If pBuffer = StrPtr(Buffer) Then
			uiTimes(3) += 1
		Else
			uiTimes(4) += 1
		End If
		uiThen = SDL_GetTicks(): uiTimes(1) = uiThen - uiNow: uiNow = uiThen
		If longestSearch > 0 And RemainingBufferSize >= longestSearch And Seek(hSource) < LOF(hSource) Then
			uiTimes(5) += 1
			Put #hDestination, , Left(*pBuffer, len(*pBuffer) - longestSearch)
			Seek #hSource, Seek(hSource) - longestSearch
		Else
			uiTimes(6) += 1
			Put #hDestination, , *pBuffer
		End If
		uiThen = SDL_GetTicks(): uiTimes(2) = uiThen - uiNow: uiNow = uiThen

		''~ locate 2, 0: print using "Load ####, process ####, output ####, (###### / ######), (###### / ######)"; uiTimes(0), uiTimes(1), uiTimes(2), uiTimes(3), uiTimes(4), uiTimes(5), uiTimes(6)
	 Loop

	uiThen = SDL_GetTicks(): print "To process the file " & ( uiThen - uiOverall )

	regfree( @re )

	Close #hSource
	Close #hDestination

	FUNCTION = EXIT_SUCCESS

End Function

Data 44 ' number of search/replace pairs

Function main( ByVal argc As Integer, ByVal argv As ZString Ptr Ptr ) As Integer
	Dim As String Source
	Dim As String Destination
	Dim As String Strings
	Dim As Integer BufferSize
	Dim As String Searches()
	Dim As String Replaces()

	If argc < 4 Then
		print *argv[0] & " Source Destination Strings [buffer]"
		print "Source is the file to search"
		print "Destination is the output file"
		print "Strings is the file with search-replace pairs"
		print "buffer is the size of the buffer to use; if omitted or zero or less then the entire file is read into memory"
	Else
		Dim As String Searches()
		Dim As String Replaces()
		Dim As Integer nPairs = 0
		Dim As Integer hPairs = FreeFile()

		Source = *argv[1]
		Destination = *argv[2]
		Strings = *argv[3]

		Open Strings For Input As #hPairs
		Do While Not EOF(hPairs)
			ReDim Preserve Searches(0 To nPairs)
			ReDim Preserve Replaces(0 To nPairs)
			Input #hPairs, Searches(nPairs), Replaces(nPairs)
			nPairs += 1
		Loop
		Close #hPairs

		If argc >= 5 Then BufferSize = val(*argv[4]) Else BufferSize = 0
		Return SearchAndReplace(Source, Destination, BufferSize, Searches(), Replaces())
	End If
	Return EXIT_SUCCESS
End Function

Tigra · Post by **Tigra** » May 11, 2007 22:05

Alexa wrote:What's the Fast way for Search and Replace in Huge Files.
for example search for "Good" in whole of the file and replace it with "BAD"

Alexa, I just had a revelation, usefullness is questionable.

Do you have any experience with linux/unix utilities, specifically sed, the stream editor?

From what I know of sed it is designed to run though a file once, processing it as it goes. I'm confident that it can quickly do what you want. Of course if you have not experience with it, well, sorry.

Tigra

Tigra · Post by **Tigra** » May 17, 2007 12:51

Alexa, I've written an alternate version of my program, with a simple list and a 1MB file is runs twice as fast as the previous version.

I've got some ideas that may improve it moreso, but I'd like to hear back from you if you are interested.

I've thought of a short cut it can take if the search word is eight or less characters, and another short cut if all of the search words are more than eight characters.

Tigra

Search for Text and Replace

Search for Text and Replace

Re: Search for Text and Replace

Re: Search for Text and Replace