DEV Community

Cover image for How I archived 100 million PDF documents... - Part 1: Reasons & Beginning
Gyula Lakatos
Gyula Lakatos

Posted on • Updated on


How I archived 100 million PDF documents... - Part 1: Reasons & Beginning

"How I archived 100 million PDF documents..." is a series about my experiences with data collecting and archival while working on the Library of Alexandria project. My local instance just hit 100 million documents. It's a good time to pop a 🍾 and remember how I got here.

The beginning

On a Friday night, after work, most people usually watch football, go to the gym or do something useful with their life. Not everyone though. I was an exception to this rule. As an introvert, I spent the last part of my day sitting in my room, reading an utterly boring-sounding book called "Moral letters to Lucilius". It was written by some old dude thousands of years ago. Definitely not the most fun-sounding book for a Friday night. However, after reading it for about an hour, I realized that the title might be boring, but the contents are almost literally gold. Too bad that there were only a couple of these books that withstood the test of time.

Image description

Good ole' Seneca
(image thanks to Matas Petrikas)

After a quick Google search, I figured out that only less than 1% of ancient texts survived to the modern day. This unfortunate fact was my inspiration to start working on an ambitious web crawling and archival project, called the Library of Alexandria.

But how?

At this point, I had a couple (more like a dozen) failed projects under my belt, so I was not too fond to start working on a new one. I had to motivate myself. After I set the target of saving as many documents as possible, I wanted to have a more tangible but quite hard-to-achieve goal. I set 100 million documents as my initial goal and a billion documents as my ultimate target. Ohh how naive I was.

The next day, after waking up, I immediately started typing on my old and trustworthy PC. Because I have a very T-shaped programming knowledge that is centered around Java, the language of choice for this project was immediately determined. Also, because I like to create small prototypes to understand the problem I need to solve, I immediately started with one.

The goal of the prototype was simple. I "just" wanted to download 10.000 documents to understand how hard it is to collect and kind of archive them. The immediate problem was that I didn't know where can I get links for this many files. Sitemaps can be useful in similar scenarios. However, there are a couple of reasons why in this case they are not really a viable solution. Most of the time it doesn't contain links to the documents, or at least not to all of them. Also, I would need to get a domain list to download the sitemaps for, etc. The immediate thing that came into my mind was that it is a lot of hassle and there must be an easier way. This is when the Common Crawl project came into the view.

Common crawl

Common Crawl is a project that contains hundreds of terabytes of HTML source code from websites that were crawled by the project. They publish a new set of crawl data at the beginning of each month.

The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1 billion new URLs, not visited in any of our prior crawls.

Image description

Tiny little datasets...

It sounded exactly like the data that I needed. There was just one thing left to do. Grab the files and parse them with an HTML parser. This was the time when I realized that no matter what I do, it's not going to be an easy ride. When I downloaded the first entry provided by the Common Crawl project, I noticed that it was saved in a strange file format called WARC.

I found one Java library on Github (thanks Mixnode) that was able to read these files. Unfortunately, it was not maintained for the past couple of years. I picked it up and forked it to make it a little easier to use. (A couple of years later this repo was moved under the Bottomless Archive project as well.)

Finally, at this point, I was able to go through a bunch of webpages (parsing them in the process with JSoup), grab all the links that contained pdf files based on the file extension then download them. Unsurprisingly, most of the pages (~60-80%) ended up being unavailable (404 Not Found and friends). After a quick cup of coffee, I got the 10.000 documents on my hard drive. This is when I realized that I have one more problem to solve.

Unboxing & validation

So, when I started to view the documents, a lot of them simply failed to open. I had to look around for a library that could verify PDF documents. I had some experience with PDFBox in the past, so it seemed to be a good go-to solution. It had no way to verify documents by default, but it could open and parse them and that was enough to filter out the incorrect ones. It felt a little bit strange just to read the whole PDF into the memory to verify if it is correct or not, but hey I needed a simple fix for now and it worked really well.

Image description

Literally, half of the internet.

After doing a re-run, I concluded that 10.000 perfectly valid documents can fit on around 1.5 GB of space. That's not too bad I thought. Let's crawl more because it sounds like a lot of fun. I left my PC there for about half an hour, just to test the app a bit more.

Part two will describe how more challenges were solved like parallelizing the download requests, splitting up the application, making the documents searchable, and adding a web user interface.

Top comments (3)

hunghvu profile image
Hung Vu

As you are crawling information from the web like this, is there any security implication? Malicious files for example.

laxika profile image
Gyula Lakatos • Edited

Yep, as @codingchili says. While developing, I run the application in a Windows environment and sometimes the Windows Defender starts nagging me about malicious PDF files being downloaded. As long as you don't open them it is not a problem, but they should be opened carefully.

In the next parts of the series I'll show pictures of the web UI. I plan on adding a way in there to show the source of the PDFs, so the user can verify at least the doman it was crawled from.

Also, it is in my long term plan to modify the contents of the PDF files before moving them to the archive. So for example remove the script parts in the files. That would lower the risk of malvare infection significantly and I feel that this would be the ultimate solution to the malvare problem.

codingchili profile image
Robin Duda (they/them)

Definitely, in parsing an ocean of untrusted html and pdf files being notorious for carrying malware. Running in a sandbox is a good first step.

🌚 Friends don't let friends browse without dark mode.

Sorry, it's true.