yamasita taisuke

Posted on Jun 1, 2021

Find duplicate files

#fdupes #go

Duplicate files in this case are files with different names but the same contents.
How do we find them?

TL;DR

Mostly good to use fdupes
There's a way to handle the process (it won't be exact).

How to find duplicate files

First, consider a file like the following

$ ls -l
total 20
-rw-r--r-- 1 root root    1 May  6 00:53 a.dat
-rw-r--r-- 1 root root    2 May  6 00:53 b.dat
-rw-r--r-- 1 root root 1024 May  6 01:00 c.dat
-rw-r--r-- 1 root root 1024 May  6 01:00 d.dat
-rw-r--r-- 1 root root 1024 May  6 01:00 e.dat

For a small file like this, you can calculate the hash of all files as follows, but let's try to be creative.
The hash value calculation can be done with md5, sha-1 or whatever, but we use light md5 as well as fdupes.

$ md5sum * | perl -nalE 'push(@{$u{$F[0]}},$F[1])}{for(keys %u){say"@{$u{$_}}"if@{$u{$_}}>1}'
c.dat e.dat

For a.dat and b.dat, the file size is 1 byte and 2 bytes, respectively, so we know they are unique without any calculation.
Next, before calculating the hash values of c.dat, d.dat, and e.dat, let's calculate the first 10 bytes.

$ head -c 10 c.dat | md5sum
f6513d74de14a6c81ef2e7c1c1de4ab1  -
$ head -c 10 d.dat | md5sum
660f54f5a7658cbf1462b2a91fbe7487  -
$ head -c 10 e.dat | md5sum
f6513d74de14a6c81ef2e7c1c1de4ab1  -

This way we know that d.dat is unique!
In this way, it may be faster to calculate and compare the first Xbytes first before calculating the whole file.

For example, if c.dat and d.dat are 10Gbyte files, you need to read 20Gbyte to calculate the hash.
If there is a difference within the first 10 bytes, the data to be read will be only 20 bytes, which is much faster.

fdupes reads the first 4096 bytes

If there is a difference within 4096 bytes, the process will proceed at high speed.

However, if they are the same size and the first 100kbytes or so are the same, there is a high probability that they are the same.
I didn't want it to be that accurate, but I wanted it to display quickly, so I made it myself.

There seems to be nothing in fdupes that terminates processing at the first X bytes.

https://github.com/adrianlopezroche/fdupes/issues/79

jdupes has an option (-TT option) to determine duplicate files by the first 4096 bytes, which fdupes cannot do.
However, in my environment, there were several files that were the same up to 4096 bytes.

Implementation

https://github.com/yaasita/chofuku

Use SQLite's in-memory mode.
First, record the name and file.
I have not considered hard links this time.
If you want to support hard links, you can record the i-node number in the i-node and give the same one without hash calculation.
symbolic links are skipped.

SELECT * FROM files;

name	size	head100k_hash	full_hash
a.dat	1
b.dat	2
c.dat	1024
d.dat	1024
e.dat	1024
f.dat	1150976
g.dat	1150976
h.dat	1150976

Count duplicates with the following SQL (I always use this query to count duplicates)

SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;

size	head100k_hash	full_hash	json_group_array(name)
1024			["c.dat","d.dat","e.dat"]
1150976			["f.dat","g.dat","h.dat"]

I got a list of files that are the same size.
If the -size-only option is specified, the process will be terminated.

The next step is to calculate the hash for files of the same size.
Files with 0 bytes will not be hashed.

name	size	head100k_hash
a.dat	1
b.dat	2
c.dat	1024	2f2bf74e24d26a2a159c4f130eec39ac
d.dat	1024	fc65c6cb47f6eed0aa6a34448a8bfcec
e.dat	1024	2f2bf74e24d26a2a159c4f130eec39ac
f.dat	1150976	595cd4e40357324cec2737e067d582b1
g.dat	1150976	595cd4e40357324cec2737e067d582b1
h.dat	1150976	595cd4e40357324cec2737e067d582b1

Counting duplicates.

SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;

size	head100k_hash	full_hash	json_group_array(name)
1024	2f2bf74e24d26a2a159c4f130eec39ac		["c.dat","e.dat"]
1150976	595cd4e40357324cec2737e067d582b1		["f.dat","g.dat","h.dat"]

If the -100k-only option is specified, the process will be terminated here.

For files that are still duplicates, calculate the hash value of the entire file.

If the file is less than 100Kbytes, the calculation will be skipped.
The hash value calculation part can be parallelized, but you need to make sure that it does not exceed the limit of file opening (the value of ulimit -n).
In this case, it is done sequentially.

name	size	head100k_hash	full_hash
a.dat	1
b.dat	2
c.dat	1024	2f2bf74e24d26a2a159c4f130eec39ac
d.dat	1024	fc65c6cb47f6eed0aa6a34448a8bfcec
e.dat	1024	2f2bf74e24d26a2a159c4f130eec39ac
f.dat	1150976	595cd4e40357324cec2737e067d582b1	ca2e51ae14747a1f1f0dcb81e982c287
g.dat	1150976	595cd4e40357324cec2737e067d582b1	067d1eed705e0f7756ceb37a10462665
h.dat	1150976	595cd4e40357324cec2737e067d582b1	ca2e51ae14747a1f1f0dcb81e982c287

Count and output duplicates.

SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;

size	head100k_hash	full_hash	json_group_array(name)
1024	2f2bf74e24d26a2a159c4f130eec39ac		["c.dat","e.dat"]
1150976	595cd4e40357324cec2737e067d582b1	ca2e51ae14747a1f1f0dcb81e982c287	["f.dat","h.dat"]

If you want to go through the whole file hashing process, it's much faster to use fdupes.

If you don't specify the "-100k-only" or "-size-only" option, then there is no point in using this program.

conclusion

Even though the hash calculation was censored at 100kbytes, the result was the same as that of fdupes.

However, this is only for my environment.
There may be some files in the world where the first 100 kbytes are used as a common header.

I think it's better to change the method to suit the situation.
Most of the time, fdupes are good.

You can also use a file system like the following.

zfs
btrfs They use a lot of memory.

ipfs can look for uniqueness across the network.

DEV Community

Find duplicate files

TL;DR

How to find duplicate files

Implementation

conclusion

Top comments (0)

Read next

Using Clerk to authenticate users in a Go backend

Injeção de dependência em Go

Building a Basic HTTP Server in Go: A Step-by-Step Tutorial

When To Not Use Pointers in Golang?