DEV Community

Cover image for How I archived 100 million PDF documents... - Part 3: Deduplication & Compression
Gyula Lakatos
Gyula Lakatos

Posted on

How I archived 100 million PDF documents... - Part 3: Deduplication & Compression

"How I archived 100 million PDF documents..." is a series about my experiences with data collecting and archival while working on the Library of Alexandria project. My local instance just hit 100 million documents. It's a good time to pop a 🍾 and remember how I got here.

The 1-st part of the series (Reasons & Beginning) is available here.
The 2-nd part of the series (Indexing, Search & UI) is available here.

The previous article discussed why I started using SQL to help with the communication between the newly split applications. Also, indexing and a basic search UI were added to the applications as well, so the users can easily browse the downloaded documents.

Saving space

Soon after I got the first couple of million documents I realized that I'll need some space to store them. A LOT of space, actually. Because of this realization, I started to look for ideas that could save me as much space as possible, so I can store as many documents as possible.


While searching using the web UI, I found a couple of duplicates in the dataset. This was not too problematic in the beginning, but more documents the program downloaded, the more duplicates I saw on the frontend. I had to do something.

My first idea was to use a Bloom filter. Initially, it felt like a good idea. I initialized a filter with an expectation of 50 million items and a 1% false positive probability. Like what fool would collect more than 50 million documents? Guess what, I ended up throwing the whole thing into the garbage bin after hitting 5 million files in a couple of weeks. Who would want to re-size the Bloom filter every time after a new max value is hit? Also, 1% of false positives felt way too high.

The next try was to calculate file checksums. A checksum is useful to verify file integrity after long time of storage, but it can also be used to detect duplicates. I started with MD5 as a hash function to generate the checksums. It is well known that albeit MD5 is super quick, it is broken for password hashing. Still I thought that it can work for files nevertheless. Unfortunately, there is a thing that's called hash-collision.

After learning that MD5 can have collisions, especially if we take the Birthday problem into consideration, I wanted something better. This is when I realized that by using SHA-256, the chance of a collision was significantly lower. Luckily, my code was quite well abstracted, so it was easy to replace the MD5 generation with SHA-256. In the final duplicate detection algorithm, a document can be considered a duplicate if its file type (extension), file size, and checksum is the same. After implementing the change, I had to re-crawl all the documents, but finally without duplication.

Hash collision probabilities

Hash collision probabilities.
The lighter fields in this table show the number of hashes needed to achieve the given probability of collision (column) given a hash space of a certain size in bits (row). Using the birthday analogy: the "hash space size" resembles the "available days", the "probability of collision" resembles the "probability of shared birthday", and the "required number of hashed elements" resembles the "required number of people in a group".
© Wikipedia - CC BY-SA 3.0


Removing duplicates saved a lot of space, but still, I kept acquiring more documents than what I had space for. This made me really desperate to lower the average document size, so I came up with an easy idea to cram more documents into the same space. I planned to compress them.

There are a couple of ways to compress a PDF document. Or rather a couple of lossless compression algorithms to be correct.

The first thing I looked into was the good old Deflate algorithm with GZIP as the file format. It had certain advantages. First of all, it was very mature supported by almost anything, including native Java (albeit later on, I switched to the Apache Compress library for usability reasons). Secondly, it was very fast and had an "okay" compression ratio.

GZIP was good enough for most of the time, but when I had spare CPU cycles to use, I wanted to re-compress the documents with something that had a better compression ratio. This was the time when I found out about the LZMA encoding. Unlike GZIP (which uses a combination of Z77 and Huffman coding), it does dictionary based compression. It is a mostly mature algorithm too, that has an excellent compression ratio and good decompression speed but abysmal compression speed. Ideal for long-term archival, especially when paired with an extensive amount of free CPU resources.

The final candidate for compression was Brotli, a relative new algorithm that was originally intended to replace deflate on the world wide web. It is mainly used by web servers and content delivery networks to serve web data. Unfortunately, I found just one library that supported it in Java (Brotli4J) and even that one was not a real Java rewrite but a wrapper around the native library provided by Google. It's feels very immature, mostly because it was released in 2015 (unlike Deflate which was released in 1951 and LZMA in 1998). But, it provides the best compression ratio out of the tree by far. Unfortunately, the resource usage is very high as well, it is the slowest one on the list. ALso, to function, it requires a native port for each and every operation system. A hustle to deal with.

Part four will describe how more challenges around the database scalability (replacing MySQL) and the application decoupling will be solved.

Top comments (0)