Duplicates

General FreeBASIC programming questions.
Provoni
Posts: 386
Joined: Jan 05, 2014 12:33
Location: Belgium

Duplicates

Postby Provoni » Jun 15, 2020 16:46

Hey all,

I have a very large text file (near 1 TB) and on each line there is some text with a minimum length of twenty bytes and a maximum length of perhaps thousands of bytes. From this file I want to remove the duplicates entries. How would one approach this in FreeBASIC?

Thanks
jj2007
Posts: 1692
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Duplicates

Postby jj2007 » Jun 15, 2020 17:45

Looks like a case for hashing
Provoni
Posts: 386
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 16, 2020 17:48

jj2007 wrote:Looks like a case for hashing

Thanks jj2007. I was thinking of using a checksum for each line of text.

Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value.

A good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize duplication of output values (collisions).
badidea
Posts: 2148
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Duplicates

Postby badidea » Jun 16, 2020 22:37

Isn't a checksum just a 'low quality' hash?
See for example first paragraph of https://en.wikipedia.org/wiki/MD5
Anyway, I would also chose a hash function and a smart tree structure to find a hash quickly.
counting_pine
Site Admin
Posts: 6225
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: Duplicates

Postby counting_pine » Jun 17, 2020 18:43

So we’re talking roughly a billion lines of text?
One question that occurs to me is, how many unique lines are there likely to be?
I think any hash-based method will need a table with at least that many entries, which could easily be gigabytes in size and might not easily fit in memory. And I’m not sure how easy collision checking will be.

A good strategy might be to sort the array, then duplicate checking is easy.
Instead of creating temporary copies of the entire file, you can create an index of “pointers” into the file for each line. For less than 1TiB, 5-byte pointers will fit nicely.
grindstone
Posts: 752
Joined: May 05, 2015 5:35
Location: Germany

Re: Duplicates

Postby grindstone » Jun 17, 2020 19:12

From the coding point of view it's not a big deal:

Code: Select all

Dim As String src, dst, oldfile, newfile
Dim As Boolean duplicate

oldfile = "C:\oldfile.txt" 'replace with the desired file names
newfile = "C:\newfile.txt"

Open oldfile For Input As #1
Open newfile For Output As #2

'copy the 1st entry from source to destination file
Line Input #1, src
Print #2, src
Close 2

Do Until Eof(1)
   Line Input #1, src 'read the next entry from the source file
   Open newfile For Input As #2
   duplicate = FALSE
   Do 'scan the destination file if entry already exists
      Line Input #2, dst
      If dst = src Then 'duplication found
         duplicate = TRUE 'set flag
         Exit Do 'terminate scan
      EndIf
   Loop Until Eof(2)
   Close 2
   If duplicate = FALSE Then 'if no duplication then write entry into destination file
      Open newfile For Append As #2
      Print #2, src
      Close 2
   EndIf
Loop

Close
But my guess is, that Provoni does not have enough free disk space for a copy of that file and therefore is looking for a way to remove the duplicated data from this existing file. AFAIK there's no trivial way to do so, although it should be possible in principle.
grindstone
Posts: 752
Joined: May 05, 2015 5:35
Location: Germany

Re: Duplicates

Postby grindstone » Jun 17, 2020 21:47

Ok, I think that's a passable way to solve that task (if you find any bugs, please let me know):

Code: Select all

Dim As ZString*2 h
Dim As String src, dst, testfile, g
Dim As Integer filepointer1, filepointer2, readfrom, writeto, x, y, endofdata

testfile = "C:\testfile.txt"

'create testfile
Open testfile For Output As #1
Print #1, "one"
Print #1, "two"
Print #1, "three"
Print #1, "three"
Print #1, "three"
Print #1, "four"
Print #1, "five"
Print #1, "six"
Print #1, "seven"
Print #1, "three"
Print #1, "eight"
Print #1, "two"
Print #1, "nine"
Print #1, "ten"
Print #1, "one"

Close
Print "Open the original testfile in an editor for purpose of comparison"
Print
Print "Then press any key to continue"
Print

Sleep

Open testfile For Binary As #1
Print "Length of original file: "; Lof(1)
endofdata = Lof(1) 'pointer to 'real' end of valid data

Do
   filepointer1 = Seek(1) 'remind file pointer before reading the source data
   Line Input #1, src
   filepointer2 = Seek(1) 'remind file pointer after reading the source data
   
   Do Until Seek(1) >= endofdata 'until the end of valid data
      writeto = Seek(1) 'pointer to the beginnig of the area where the rest data will be written if a
                        ' duplication was found
      Line Input #1, dst
      readfrom = Seek(1) 'pointer to the beginning of the rest data if a duplication was found
      If src = dst Then 'duplication
         filepointer2 = filepointer1 'reset file pointer to check the same entry again in case of multiple duplications
         'shift rest data overwriting the duplication (to be improved)
         For x = 0 To endofdata - readfrom
            Get #1, readfrom + x, h
            Put #1, writeto + x, h
         Next
         'overwrite the invalid data at the end of the file
         For y = writeto + x To writeto + x + Len(dst) + 1
            Put #1, y, "#"
         Next
         endofdata -= Len(dst) + 2 'set the new end of valid data
      EndIf
   Loop
   Seek 1, filepointer2 'reset the file pointer to the beginning of the entry to check next
Loop Until Seek(1) >= endofdata

Print "Amount of valid data (should be the length of the revised file)"; endofdata
Print
Print "Now open the same file in the editor to see the result of the revision"

Close

Sleep

The only issue that remains is how to cut off and release the empty space at the end of the file and actualize the file information on the hard disk. If anyone has a clue how to do that, please let us share your wisdom.
jj2007
Posts: 1692
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Duplicates

Postby jj2007 » Jun 17, 2020 22:11

counting_pine wrote:So we’re talking roughly a billion lines of text?
...
A good strategy might be to sort the array, then duplicate checking is easy.
Have you ever tried to sort a billion lines of text? Any idea how much time it will take?
MrSwiss
Posts: 3605
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: Duplicates

Postby MrSwiss » Jun 17, 2020 22:13

grindstone wrote:The only issue that remains is how to cut off and release the empty space at the end of the file and actualize the file information on the hard disk.
In the upcoming FBC 1.08.0 (currently DEV state) that functionality is implemented.

See the numbers #567 and #568 in changelog:
1. Add new statement "FLUSH" - flush file buffers to disk (567)
---
1. rtlib: FileSetEof, rename FileTruncate to FileSetEof (568)

For testing, download latest build (2020-01-13) last change #601: 2020-01-12
Tourist Trap
Posts: 2922
Joined: Jun 02, 2015 16:24

Re: Duplicates

Postby Tourist Trap » Jun 17, 2020 22:25

jj2007 wrote:Have you ever tried to sort a billion lines of text? Any idea how much time it will take?

I think this kind of files is the reason why people use database systems.
dodicat
Posts: 6687
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: Duplicates

Postby dodicat » Jun 17, 2020 23:55

1 TB is a bit big for me.
Here is one about 3 million digits, (Grindstones list doubled up a few times)

Code: Select all


Function tally(somestring As String,partstring As String) As Integer
    Dim As Integer i,j,ln,lnp,count,num
    ln=Len(somestring)
    lnp=Len(partstring)
    count=0
    i=-1
    Do
        i+=1
        If somestring[i] <> partstring[0] Then continue do
        If somestring[i] = partstring[0] Then
            For j=0 To lnp-1
                If somestring[j+i]<>partstring[j] then continue do
            Next j
        End If
        count+=1
        i=i+lnp-1
    Loop Until i>=ln-1
    Return count
End Function

Function findAndReplace(original As String , find As String , replace As String) As String
    If Len(find) = 0 Then Return original
    Var t = tally(original,find)               'find occurencies of find
    Dim As long found, n, staid,m
    Var Lf = Len(find) , Lr = Len(replace) , Lo = Len(original)
    Dim As long x = Len(original) - t * Lf + t * Lr 'length of output string
    dim As String res = String(x,0)            'output string
    Do
        st:
        If original[n] = find[0] Then            'got a possible
            For m = 0 To Lf - 1
                If original[n + m] <> find[m] Then Goto lbl 'nope
            Next m
            found = 1                            'Bingo
        End If
        If found Then
            For m = 0 To Lr - 1
                res[staid] = replace[m]          'load the replacerment
                staid += 1
            Next m
            n += Lf
            found = 0
            goto fin
        End If
        lbl:
        res[staid] = original[n]
        staid += 1
        n += 1
        fin:
    Loop Until n >= Lo
    Return res
End Function

sub insert(s as string,char as string,position as long) 
If position > 0 then
var part1=Mid(s,1,position-1)
var part2=trim(Mid(s,position),chr(13,10))
s=part1+char+chr(13,10)+part2
End if
end sub

 #Include "file.bi"

Function loadfile overload(file as string) as String
   If FileExists(file)=0 Then Print file;" not found":Sleep:end
   dim as long  f=freefile
    Open file For Binary Access Read As #f
    Dim As String text
    If Lof(f) > 0 Then
      text = String(Lof(f), 0)
      Get #f, , text
    End If
    Close #f
    return text
end Function

sub savefile overload(filename As String,p As String)
    Dim As long n=freefile
    If Open (filename For Binary Access Write As #n)=0 Then
        Put #n,,p
        Close
    Else
        Print "Unable to save " + filename:sleep:end
    End If
End sub



Open "testfile" For Output As #1
Print #1, "one"
Print #1, "two"
Print #1, "three"
Print #1, "three"
Print #1, "three"
Print #1, "four"
Print #1, "five"
Print #1, "six"
Print #1, "seven"
Print #1, "three"
Print #1, "eight"
Print #1, "two"
Print #1, "nine"
Print #1, "ten"
Print #1, "one"
Close

var L=loadfile("testfile")
for n as long=1 to 15  'increase size and re save
  L+=L
next

savefile("testfile",L)
print "File length = ";filelen("testfile")
L=loadfile("testfile") '' re load bigger file
'print "Sample"
'print mid(L,1,1000)


Open "testfile" For input As #2
dim as string s

while not eof(2)
  line input #2,s
 var p=instr(L,s)
  L=findandreplace(L,s,"")
  insert(L,s,p)
wend
close

print L
sleep
kill "testfile"

 
integer
Posts: 391
Joined: Feb 01, 2007 16:54
Location: usa

Re: Duplicates

Postby integer » Jun 18, 2020 1:52

Provoni wrote:Hey all,
I have a very large text file (near 1 TB) and on each line there is some text with a minimum length of twenty bytes and a maximum length of perhaps thousands of bytes. From this file I want to remove the duplicates entries. How would one approach this in FreeBASIC?
Thanks

@Provoni
Thank you for this question.

I do not have an answer, but it meshes with a problem that vexed me.
The longest common substring problem.
I used a 4-byte index pointer to the first byte in the string and a 2-byte short for the length of the string.
A "slightly modified" merge short was used to sort and maintain the StringLength, Pointers, (and alphanumeric sequence)
Only those lines with same byte length were byte compared for duplication.
Interesting "little" problem.
grindstone
Posts: 752
Joined: May 05, 2015 5:35
Location: Germany

Re: Duplicates

Postby grindstone » Jun 18, 2020 4:06

MrSwiss wrote:In the upcoming FBC 1.08.0 (currently DEV state) that functionality is implemented.
Thank you, that's good news.
counting_pine
Site Admin
Posts: 6225
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

Re: Duplicates

Postby counting_pine » Jun 18, 2020 14:34

jj2007 wrote:
counting_pine wrote:So we’re talking roughly a billion lines of text?
...
A good strategy might be to sort the array, then duplicate checking is easy.
Have you ever tried to sort a billion lines of text? Any idea how much time it will take?
OK, fair enough..
Running the numbers, I guess a terabyte file of a billion lines would have to be iterated through roughly 30 times. That's a lot.

I was weighing it up against the alternative of a very large hash table. But I guess "very large" is basically just gigabytes. Too much to easily fit in RAM, but pretty easy in an era of terabyte hard disks.

So I guess a disk-based hash table of around a billion, plus a second table for collisions. Each hash table entry contains an index into the second table. The row of the second table contains the index of the matching line, and also a future link into the same table for any hash collisions.
jj2007
Posts: 1692
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Duplicates

Postby jj2007 » Jun 18, 2020 16:42

You can't load a terabyte file into memory. So I would:
- read 2 MB chunks from source file into memory (roughly what the L2 cache can hold)
- write 32-bit hashes to a file
- write 64-bit values to another file, with length (16 bits) and start positions (48 bits) of the strings in the source file
- reopen the hash file and start with the first hash, checking all others for identity
- if so, check for collisions, then mark the duplicate as bad in the hash file and give the string in the positions file length -1
- repeat this for all hashes
- finally, open a destination file, and (based on positive length in the positions file) write all valid strings to this file.

Return to “General”

Who is online

Users browsing this forum: No registered users and 6 guests