DTF - Duplicate Thumbnail File

#systematic #review #literature #python

DTF

Duplicate Thumbnail File - A mathod to indetify duplicate articles when doing a Systematic review.

The problem

A systematic review could spawn countless articles, when running your search string on different data bases.
Many times articles were exported as PDF files and submited to diferent jornals or publications. The problem is, depending on how the PDF was exported, it will produce a different file, making it hard to check for duplicates by looking at their hash.

The experiment

Five articles were written and exported to PDF, each one with a small difference.

01-jpg-100.pdf - was exported as a JPG file with 100% quality.
02-lossless.pdf - was exported as a LOSSLESS file.
03-lossless-2newlines.pdf - was exported as a LOSSLESS file with two new lines after the end of the document.
04-jpg-80-2newlines.pdf - was exported as a JPG file with 80% quality and two new lines after the end of the document.
05-jpg-80.pdf - was exported as a JPG file with 100% quality.

Let's look at their hash:

Hash	File
1db6c720061004b740b87320f0d1d2a68bd9d312	01-jpg-100.pdf
c9ecd0d02f04637ab81f8f383a74929d8a9f1a40	02-lossless.pdf
dd8f35b79ded6b61f53eaeb0a9a2a7f1bf34eae6	03-lossless-2newlines.pdf
9b660130bba4c2c7b63be832aab08a48a2b2e6f5	04-jpg-80-2newlines.pdf
ffc4a3fcff380929f7c38379f5804e7bbd87ea44	05-jpg-80.pdf

According to the hash of each file, they're different from each other, when in fact, is the "same" document.

The solution

And what if instead of looking at the hash of each file, we look to their thumbnail? It might work a little better.

So a python script was written to generate the thumbnail of each file and measure how different those thumbnails ware.

Here's the result:

File 01	File 02	Difference
03-lossless-2newlines.png	02-lossless.png	0.0
03-lossless-2newlines.png	05-jpg-80.png	0.0
03-lossless-2newlines.png	01-jpg-100.png	0.0
03-lossless-2newlines.png	04-jpg-80-2newlines.png	0.0
02-lossless.png	05-jpg-80.png	0.0
02-lossless.png	01-jpg-100.png	0.0
02-lossless.png	04-jpg-80-2newlines.png	0.0
05-jpg-80.png	01-jpg-100.png	0.0
05-jpg-80.png	04-jpg-80-2newlines.png	0.0
01-jpg-100.png	04-jpg-80-2newlines.png	0.0

So, as we can see, there is 0.0 difference between them, so they must have the same content.

Conclusion

This experiment offers a diffent approach to a common problem when doing systemic review, and in this particular case, it worked better. This method could be used alongside with other methods to indentify and exclude duplicate files.

The dtf.py script could be used as base to a more robust script that compares any files that could be compared visually.

The code for the experiment is here