DEV Community

Cover image for Find duplicate files
yamasita taisuke
yamasita taisuke

Posted on

Find duplicate files

Duplicate files in this case are files with different names but the same contents.
How do we find them?

TL;DR

  • Mostly good to use fdupes
  • There's a way to handle the process (it won't be exact).

How to find duplicate files

First, consider a file like the following

$ ls -l
total 20
-rw-r--r-- 1 root root    1 May  6 00:53 a.dat
-rw-r--r-- 1 root root    2 May  6 00:53 b.dat
-rw-r--r-- 1 root root 1024 May  6 01:00 c.dat
-rw-r--r-- 1 root root 1024 May  6 01:00 d.dat
-rw-r--r-- 1 root root 1024 May  6 01:00 e.dat
Enter fullscreen mode Exit fullscreen mode

For a small file like this, you can calculate the hash of all files as follows, but let's try to be creative.
The hash value calculation can be done with md5, sha-1 or whatever, but we use light md5 as well as fdupes.

$ md5sum * | perl -nalE 'push(@{$u{$F[0]}},$F[1])}{for(keys %u){say"@{$u{$_}}"if@{$u{$_}}>1}'
c.dat e.dat
Enter fullscreen mode Exit fullscreen mode

For a.dat and b.dat, the file size is 1 byte and 2 bytes, respectively, so we know they are unique without any calculation.
Next, before calculating the hash values of c.dat, d.dat, and e.dat, let's calculate the first 10 bytes.

$ head -c 10 c.dat | md5sum
f6513d74de14a6c81ef2e7c1c1de4ab1  -
$ head -c 10 d.dat | md5sum
660f54f5a7658cbf1462b2a91fbe7487  -
$ head -c 10 e.dat | md5sum
f6513d74de14a6c81ef2e7c1c1de4ab1  -
Enter fullscreen mode Exit fullscreen mode

This way we know that d.dat is unique!
In this way, it may be faster to calculate and compare the first Xbytes first before calculating the whole file.

For example, if c.dat and d.dat are 10Gbyte files, you need to read 20Gbyte to calculate the hash.
If there is a difference within the first 10 bytes, the data to be read will be only 20 bytes, which is much faster.

fdupes reads the first 4096 bytes

If there is a difference within 4096 bytes, the process will proceed at high speed.

fist4096byte

However, if they are the same size and the first 100kbytes or so are the same, there is a high probability that they are the same.
I didn't want it to be that accurate, but I wanted it to display quickly, so I made it myself.

There seems to be nothing in fdupes that terminates processing at the first X bytes.

https://github.com/adrianlopezroche/fdupes/issues/79

jdupes has an option (-TT option) to determine duplicate files by the first 4096 bytes, which fdupes cannot do.
However, in my environment, there were several files that were the same up to 4096 bytes.

Implementation

https://github.com/yaasita/chofuku

Use SQLite's in-memory mode.
First, record the name and file.
I have not considered hard links this time.
If you want to support hard links, you can record the i-node number in the i-node and give the same one without hash calculation.
symbolic links are skipped.

SELECT * FROM files;
Enter fullscreen mode Exit fullscreen mode
name size head100k_hash full_hash
a.dat 1
b.dat 2
c.dat 1024
d.dat 1024
e.dat 1024
f.dat 1150976
g.dat 1150976
h.dat 1150976

Count duplicates with the following SQL (I always use this query to count duplicates)

SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;
Enter fullscreen mode Exit fullscreen mode
size head100k_hash full_hash json_group_array(name)
1024 ["c.dat","d.dat","e.dat"]
1150976 ["f.dat","g.dat","h.dat"]

I got a list of files that are the same size.
If the -size-only option is specified, the process will be terminated.

The next step is to calculate the hash for files of the same size.
Files with 0 bytes will not be hashed.

name size head100k_hash full_hash
a.dat 1
b.dat 2
c.dat 1024 2f2bf74e24d26a2a159c4f130eec39ac
d.dat 1024 fc65c6cb47f6eed0aa6a34448a8bfcec
e.dat 1024 2f2bf74e24d26a2a159c4f130eec39ac
f.dat 1150976 595cd4e40357324cec2737e067d582b1
g.dat 1150976 595cd4e40357324cec2737e067d582b1
h.dat 1150976 595cd4e40357324cec2737e067d582b1

Counting duplicates.

SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;
Enter fullscreen mode Exit fullscreen mode
size head100k_hash full_hash json_group_array(name)
1024 2f2bf74e24d26a2a159c4f130eec39ac ["c.dat","e.dat"]
1150976 595cd4e40357324cec2737e067d582b1 ["f.dat","g.dat","h.dat"]

If the -100k-only option is specified, the process will be terminated here.

For files that are still duplicates, calculate the hash value of the entire file.

If the file is less than 100Kbytes, the calculation will be skipped.
The hash value calculation part can be parallelized, but you need to make sure that it does not exceed the limit of file opening (the value of ulimit -n).
In this case, it is done sequentially.

name size head100k_hash full_hash
a.dat 1
b.dat 2
c.dat 1024 2f2bf74e24d26a2a159c4f130eec39ac
d.dat 1024 fc65c6cb47f6eed0aa6a34448a8bfcec
e.dat 1024 2f2bf74e24d26a2a159c4f130eec39ac
f.dat 1150976 595cd4e40357324cec2737e067d582b1 ca2e51ae14747a1f1f0dcb81e982c287
g.dat 1150976 595cd4e40357324cec2737e067d582b1 067d1eed705e0f7756ceb37a10462665
h.dat 1150976 595cd4e40357324cec2737e067d582b1 ca2e51ae14747a1f1f0dcb81e982c287

Count and output duplicates.

SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;
Enter fullscreen mode Exit fullscreen mode
size head100k_hash full_hash json_group_array(name)
1024 2f2bf74e24d26a2a159c4f130eec39ac ["c.dat","e.dat"]
1150976 595cd4e40357324cec2737e067d582b1 ca2e51ae14747a1f1f0dcb81e982c287 ["f.dat","h.dat"]

If you want to go through the whole file hashing process, it's much faster to use fdupes.

If you don't specify the "-100k-only" or "-size-only" option, then there is no point in using this program.

conclusion

Even though the hash calculation was censored at 100kbytes, the result was the same as that of fdupes.

However, this is only for my environment.
There may be some files in the world where the first 100 kbytes are used as a common header.

I think it's better to change the method to suit the situation.
Most of the time, fdupes are good.

You can also use a file system like the following.

ipfs can look for uniqueness across the network.

Discussion (0)