Duplicates

General FreeBASIC programming questions.
Lost Zergling
Posts: 324
Joined: Dec 02, 2011 22:51
Location: France

Re: Duplicates

Postby Lost Zergling » Jun 18, 2020 17:00

I would suggest a procedure : compute a "short" hash on each line and store this short value in a list (must be short enought to allow RAM use), and write duplicates into a file, something like this :
While not eof(hugefile)
key=ComputeShortHash(FileLine)
If MyList.HashTag(key)=1 Then
Write to new file key
End If
Wend
MyList.Recycle
Then loading duplicate candidate into renewed memory slots:
While not eof(new file)
MyList.HashTag(shorthash)
Wend
To this point, possible duplicates are stored in a list in memory
Then parse again huge file :
While not eof(huge file)
If MyList.HasHashTag(ComputeShortHash(FileLine)) Then
If MyList2.HashTag(ComputeFullHash)=1 Then
duplicate, do nothin
Else ' just a candidate, not real duplicate
Write Line to resultFile
End If
Else
Write Line to resultFile
End If
Wend
Indeed a second index (MyList2) will require Ram space, but only on a subset of duplicate candidate (first filter with firt list)
Two cascaded index in mem, meaning the ComputeShortHash function is very precisely tuned.
Sounds heavy and random results in terms of balance between the cascaded indexes...
badidea
Posts: 2122
Joined: May 24, 2007 22:10
Location: The Netherlands

Re: Duplicates

Postby badidea » Jun 18, 2020 17:30

How many duplicates do you expect? If there are only 30 duplicate lines in a TB file (rest unique), it is more complicated (heavy) then if there are only 30 unique lines in a TB. What kind of data does the text file contain? If you can discard certain data directly after reading, that also make the task more manageable.

In the past I made something similar (in C) for firewall log-files, containing many duplicates (with timestamps and/or certain IP-addresses removed), but that tool processed files of a few GB not TB.
Provoni
Posts: 380
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 18, 2020 19:01

Thanks everyone for all the replies. Still need to catch up.

badidea wrote:Isn't a checksum just a 'low quality' hash?

No clue, it could be.

counting_pine wrote:So we’re talking roughly a billion lines of text?
One question that occurs to me is, how many unique lines are there likely to be?

I never counted the lines, will follow up with a line count. Not sure about unique lines amount. Each line is a comment from a forum. My machine has 64 GB RAM so I guess if the line count is manageable then a low quality hash/checksum could do. Perhaps the lines can be split up in smaller files by length!

grindstone wrote:From the coding point of view it's not a big deal:

How long will that have to run?

integer wrote:Only those lines with same byte length were byte compared for duplication.

Great idea, thanks.
marcov
Posts: 3004
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Duplicates

Postby marcov » Jun 18, 2020 20:00

Provoni wrote:I have a very large text file (near 1 TB) and on each line there is some text with a minimum length of twenty bytes and a maximum length of perhaps thousands of bytes. From this file I want to remove the duplicates entries. How would one approach this in FreeBASIC?


(just my first idea, before reading the rest: Use a machine with lots of memory (32GB or +) and a 64-bit compiler.

Walk the file, for every line
- store the offset of the line into an index that for each linelength has a list of checksums and then a record with the offsets and if it is a duplicate or not (crc16 should be fine).
(so a map of map of filerecords <filesize,<checksum,offset+duplicate flag>>).
- then start walking over the filesizes. If within a list<checksum,offset+duplicate flag> that contains lines of a constant size, there one of the lists<checksum,offset> have multiple items then, seek the file, and compare them to make sure they are duplicates. If they are duplicates, mark the relevant items as such.
- After this is complete, add all entries that are NOT duplicates to a list ordered on offset. (this is harder than it sounds for large lists)
- walk the file, and at the same time maintain a cursor in the list with offsets.
- If a line has an offset in the list with offsets, write it to a new file.

The new file has a list with offsets.

If your duplicates are not random (e.g. might appear in clusters), then having a block level caching system for the comparison might help. You can even run multiple sizes in different threads.

If you need extra memory, write the filtering of the <filesize,<checksum,offset>> list to disk. Then read them while inserting them into the <offset> list. This will avoid having to have two lists in memory. So in all this means two passes of the file, plus the various seeks+reads due to comparing (which will depend on your duplicate%)

Yes, you can do it with less memory, but it will more work to program all the merge sorting and much slower, and both will be magnitudes, not just a bit.

Ideally you want the index to fit into memory twice. If you really are interested, I'm willing to detail it, but in all scenarios, you will want to optimize the hardware to as modern as reasonable first. Borrow it if needed, specially if it is a one off. :-)

IT is as much soft skills (convincing your boss to buy better hardware) as hard skills (programming) :-)

p.s. I have own pascal code for a duplicate file finder based on some of these principles. Since it uses whole files in the multi megabyte scale, it uses MD5, but it builds a list for <int64,<filerectlist>> where <filerectlist> is a simple list with checksum and some other params.
p.p.s. when replying please specify if this is for a one-off case, or that you have to do this e.g. every month/year etc for new measurement data or so.
Last edited by marcov on Jun 18, 2020 20:29, edited 7 times in total.
marcov
Posts: 3004
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Duplicates

Postby marcov » Jun 18, 2020 20:05

badidea wrote:Isn't a checksum just a 'low quality' hash?


No, it is usually considered to be a hash to detect errors introduced by mis-sampling of bits in hardware, and those are often of the type that bits close to eachother differ. (since caused by the same disruption that causes multiple bits to toggle), so these checksums try to avoid that multiple bits varying close to eachother lead to the same checksum.

In ye old days, it was also to be implementable in low clocked hardware. (though that is debatable, nowadays with Intel processors have SHA hardware, and microprocessors reaching 100MIPS and having CRC32 hardware)

Usually any hash has a data length where the number of collisions start rising. You want to use the hash that _just_ fits the data so that you are in the area of relative low collisions.

See for example first paragraph of https://en.wikipedia.org/wiki/MD5
Anyway, I would also chose a hash function and a smart tree structure to find a hash quickly.


MD5 is a hash meant for 32-bit files, so overkill for lines of a few thousands of bytes. Note that collisions are never 100% excluded so for nearly all real world applications, hashing is used to decrease the number of compares, not to do them away completely.

If a cheap hash removed 99.99% of the compares, using a safer hash just eats CPU time. For the data range up to thousands bytes (a few sectors worth) CRC16 is fine, specially since in the above algorithm it is combined with the exact size of the data. So the actual hash is <size,crc16> not just crc16. But since we have a list based on linesize, we don't need to store the size for each line extra.

Tourist Trap wrote:
jj2007 wrote:Have you ever tried to sort a billion lines of text? Any idea how much time it will take?

I think this kind of files is the reason why people use database systems.


No. Database systems are very bad for operations that touch all data, IOW queries must have at least one major condition indexed, and the number of queries must be much larger than the original insertion/preparation/indexing of the data to make such a thing worthwhile.
grindstone
Posts: 744
Joined: May 05, 2015 5:35
Location: Germany

Re: Duplicates

Postby grindstone » Jun 18, 2020 22:42

One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?
Tourist Trap
Posts: 2901
Joined: Jun 02, 2015 16:24

Re: Duplicates

Postby Tourist Trap » Jun 18, 2020 22:49

marcov wrote:
Tourist Trap wrote:I think this kind of files is the reason why people use database systems.


No. Database systems are very bad for operations that touch all data, IOW queries must have at least one major condition indexed, and the number of queries must be much larger than the original insertion/preparation/indexing of the data to make such a thing worthwhile.

What I meant was, the data should be stored in a database system from the very start, so it would have been easier to maintain from the very beginning and less prone to some file disaster. I was not suggesting to convert the file in a database now that it became so huge. It would take a while on my machine anyway.

Marcov, by the way, do you know how to quantify the error made when matching 2 photos shot from 2 different angles. I think it's about image registration, but any clue would help me for some business at work.
MrSwiss
Posts: 3573
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: Duplicates

Postby MrSwiss » Jun 18, 2020 23:15

grindstone wrote:One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?

It is already faster, to compare Len(any_string), instead of comparing strings themselfs.
(instead of a hash for example, as a first step)
Second step: do full string comparison only, if they have equal length.
jj2007
Posts: 1625
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Duplicates

Postby jj2007 » Jun 19, 2020 0:17

grindstone wrote:One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?
The point is that you have to compare string #123 to one Billion strings that follow this one. Feasible with a 32-bit integer comparison, but not with a direct comparison.
Lost Zergling
Posts: 324
Joined: Dec 02, 2011 22:51
Location: France

Re: Duplicates

Postby Lost Zergling » Jun 19, 2020 8:20

Bottlenecks are : 1) number of unique keyvalues to check at a time (meaning max index size must fit available mem space), 2) processing speed. To this purpose, main data could be first dispatched between several files according to a computed value so as to be sure there is no possible duplicate between dispatched files(ie : first ascii+30, (second ascii+30)*10, and so on). This just requires time and disk space. Then, on each file apply a procedure (custom cascaded index as describe or other) to create result file. Finally concatenate result files.
Last edited by Lost Zergling on Jun 19, 2020 9:11, edited 2 times in total.
marcov
Posts: 3004
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Duplicates

Postby marcov » Jun 19, 2020 8:36

grindstone wrote:One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?


If you have n lines in a bucket (all lines of same linesize), you need to compare all n lines to all n-1 other lines. Which means you have (n*(n-1))/2 comparisons (divide by two since arguments swap count as same comparison).

Now, you compute n hashes, and only compare only within lines that have the same hash, which will only be a few, and rarely quadratic.

In short, unless nearly all linesize buckets have only a very short (2,3) strings in them, it is nearly certainly faster, even if you could fit them all in memory in the first place (you can store the CRC16 on the initial run, rather than reloading the lines, but then you do CRC16 on all lines).
Last edited by marcov on Jun 19, 2020 8:52, edited 1 time in total.
marcov
Posts: 3004
Joined: Jun 16, 2005 9:45
Location: Eindhoven, NL
Contact:

Re: Duplicates

Postby marcov » Jun 19, 2020 8:51

Tourist Trap wrote:
marcov wrote:
Tourist Trap wrote:I think this kind of files is the reason why people use database systems.


No. Database systems are very bad for operations that touch all data, IOW queries must have at least one major condition indexed, and the number of queries must be much larger than the original insertion/preparation/indexing of the data to make such a thing worthwhile.

What I meant was, the data should be stored in a database system from the very start, so it would have been easier to maintain from the very beginning and less prone to some file disaster. I was not suggesting to convert the file in a database now that it became so huge. It would take a while on my machine anyway.


Still, if the kind of query you want to do will typically touch everything, that is still hopeless. Also be very careful with very large datasets, DBMS costs and hardware have a tendency to increase non-linear in such cases (indicator: instead of naming a price, they want to make an appointment to discuss a "solution").

Marcov, by the way, do you know how to quantify the error made when matching 2 photos shot from 2 different angles. I think it's about image registration, but any clue would help me for some business at work.


If you can match it, variance or stddev of the pixels on a LAB or HSV basis would be my initial hunch.

If your registration is not 100%, you might do this with a kernel, iow calc the difference o the color but calc an averge color with a kernel operation first, e.g. multiply the pixes with a 3x3 matrix of weights centered on a pixel e.g.

.025 .1 .025
.1 .5 .1
.025 .1 .025

(so middle pixel counts for 50%, horizontel/vertical for a total of 4*.1 = 40%, and diagonals for 4*0.025=10%)

It also depends on your speed requirements and magnitude of precision you expect. I did a test if a beer-bottle label was well placed (no folds) and didn't have errors (parts were printed on), and for that just summing the differences of the channel |r-r2|+|g-g2| + |b-b2| was already quite ok. But that was not angled, just comparing to a master with a rough sense of registration, using an AND mask to avoid very big faults near places of dramatic color changes due to small variations in registration.
dodicat
Posts: 6644
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: Duplicates

Postby dodicat » Jun 19, 2020 12:11

Tourist Trap wrote:
marcov wrote:
Tourist Trap wrote:I think this kind of files is the reason why people use database systems.



. . .
Marcov, by the way, do you know how to quantify the error made when matching 2 photos shot from 2 different angles. I think it's about image registration, but any clue would help me for some business at work.

TT
Keeping it as simple as possible, sorry I am not marcov, maybe he is involved in this kind of thing.

Code: Select all

Screen 20,32

#define tolerance 20
Sub getsize(picture As String,Byref dimensionx As Long,Byref dimensiony As Long)
  Open picture For Binary Access Read As #1
  Get #1, 19, dimensionx
  Get #1, 23, dimensiony
  Close #1
End Sub

Function Filter(Byref tim As Ulong Pointer,_
    Byval rad As Single,_
    Byval destroy As Long=1,_
    Byval fade As Long=0) As Ulong Pointer
    #define map(a,b,_x_,c,d) ((d)-(c))*((_x_)-(a))/((b)-(a))+(c)
    If fade<0 Then fade=0:If fade>100 Then fade=100
    Type p2
        As Long x,y
        As Ulong col
    End Type
    #macro _point(_x,_y,colour)
    pixel=row+pitch*(_y)+(_x)*4
    (colour)=*pixel
    #endmacro
    #macro ppset(_x,_y,colour)
    pixel=row+pitch*(_y)+(_x)*4
    *pixel=(colour)
    #endmacro
    #macro average()
    ar=0:ag=0:ab=0:inc=0
    xmin=x:If xmin>rad Then xmin=rad
    xmax=rad:If x>=(_x-1-rad) Then xmax=_x-1-x
    ymin=y:If ymin>rad Then ymin=rad
    ymax=rad:If y>=(_y-1-rad) Then ymax=_y-1-y
    For y1 As Long=-ymin To ymax
        For x1 As Long=-xmin To xmax
            inc=inc+1
            ar=ar+(NewPoints(x+x1,y+y1).col Shr 16 And 255)
            ag=ag+(NewPoints(x+x1,y+y1).col Shr 8 And 255)
            ab=ab+(NewPoints(x+x1,y+y1).col And 255)
        Next x1
    Next y1
    If fade=0 Then
        averagecolour=Rgb(ar/(inc),ag/(inc),ab/(inc))
    Else
        averagecolour=Rgb(fd*ar/(inc),fd*ag/(inc),fd*ab/(inc))
    End If
    #endmacro
    Dim As Single fd=map(0,100,fade,1,0)
    Dim As Integer _x,_y
    Imageinfo tim,_x,_y
    Dim  As Ulong Pointer im=Imagecreate(_x,_y)
    Dim As Integer pitch
    Dim  As Any Pointer row
    Dim As Ulong Pointer pixel
    Dim As Ulong col
    Imageinfo tim,,,,pitch,row
    Dim As p2 NewPoints(_x-1,_y-1)
    For y As Long=0 To (_y)-1
        For x As Long=0 To (_x)-1
            _point(x,y,col)
            NewPoints(x,y)=Type<p2>(x,y,col)
        Next x
    Next y
    Dim As Ulong averagecolour
    Dim As Long ar,ag,ab
    Dim As Long xmin,xmax,ymin,ymax,inc
    Imageinfo im,,,,pitch,row
    For y As Long=0 To _y-1
        For x As Long=0 To _x-1
            average()
            ppset((NewPoints(x,y).x),(NewPoints(x,y).y),averagecolour)
        Next x
    Next y
    If destroy Then Imagedestroy tim: tim = 0
    Function= im
End Function

Sub sort(array() As Ulong,begin As Long,Finish As Long)
  Dim As Long i=begin,j=finish
  Dim As Ulong x =array(((I+J)\2))
  While I <= J
    While array(I) < X :I+=1:Wend
      While array(J) > X :J-=1:Wend
        If I<=J Then Swap array(I),array(J): I+=1:J-=1
      Wend
      If J >begin Then sort(array(),begin,J)
      If I <Finish Then sort(array(),I,Finish)
    End Sub
   
    Function compare(a1() As Ulong,a2() As Ulong) As Double
      sort(a1(),Lbound(a1),Ubound(a1))
      sort(a2(),Lbound(a2),Ubound(a2))
      Dim As Double x
      For n As Long=Lbound(a1) To Ubound(a1)
        if abs(cast(ubyte ptr,@a1(n))[0]-cast(ubyte ptr,@a2(n))[0]) <tolerance andalso _
           abs(cast(ubyte ptr,@a1(n))[1]-cast(ubyte ptr,@a2(n))[1]) <tolerance andalso _
           abs(cast(ubyte ptr,@a1(n))[2]-cast(ubyte ptr,@a2(n))[2]) <tolerance then  x+=1
      Next n
      Return x/(Ubound(a1)-Lbound(a1)+1)
    End Function
   
'=========================================================
    Dim As Any Ptr i1=Imagecreate(512, 500,Rgb(0,100,255))
    Dim As Any Ptr i2=Imagecreate(512, 500,Rgb(0,100,255))
   
    For n As Long=1 To 100000
      Var x=Rnd*512,y=Rnd*500,r=2+Rnd*5,c=Rnd*Rgb(255,255,255)
      Circle i1,(x,y),r,c,,,,f
      Circle i2,(x-1,y-1),r,c,,,,f ' <<<  slight difference
    Next n
   i1=filter(i1,10)
   i2=filter(i2,10)
    Put(0,0),i1,Pset
    Put(512,0),i2,Pset
    Bsave "bitmap1.bmp",i1
    Bsave "bitmap2.bmp",i2
   
    Dim As Long w,h
    getsize "bitmap1.bmp",w,h  'same as bitmap2
    Redim As Ulong a1((w+1)*(h+1)),a2((w+1)*(h+1))
    Bload "bitmap1.bmp",@a1(0)
    Bload "bitmap2.bmp",@a2(0)
   
    Locate 36
    Print compare(a1(),a2()),"(0 is completely different, 1 is exactly the same)"
    Sleep
   
    Kill "bitmap1.bmp"
    Kill "bitmap2.bmp"
   
     

Sorry it is not very hi tech.
Provoni
Posts: 380
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 20, 2020 7:29

Can't keep up with all the information but thanks allot!

marcov wrote:p.p.s. when replying please specify if this is for a one-off case, or that you have to do this e.g. every month/year etc for new measurement data or so.

Maybe about 1-3 times a year. I am creating letter n-grams frequencies for my solver http://zodiackillersite.com/viewtopic.php?f=81&t=3198 and want to test if removing redundancy from the corpus improves things. So for now not it is just a test.

Thanks for your run down marcov, it helps me allot. The file is only 0.5 TB and has 4,143,722,588 lines and the longest line has a length of 28,409.

I don't know enough about programming to put people's examples into practice so I will try something simple such as the following:

Have a 1-dim CRC32 array (2^32). And thus for each line calculate the CRC32 with the following code https://rosettacode.org/wiki/CRC-32#FreeBASIC and mark the CRC32 value in that array. If CRC32 exists than output to a collision file and if not exist output to non-dupe file. Still have to see about the collision file afterwards and I am curious about its size. May be able to do a second run on the collision file while reading the lines in a slightly different manner such that the CRC32 is different each time.
Provoni
Posts: 380
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 20, 2020 7:53

Can anyone make a CRC-n? For example CRC33 or CRC35.

Return to “General”

Who is online

Users browsing this forum: No registered users and 4 guests