"How I archived 100 million PDF documents..." is a series about my experiences with data collecting and archival while working on the Library of Alexandria project. My local instance just hit 100 million documents. It's a good time to pop a 🍾 and remember how I got here.
The 1-st part of the series (Reasons & Beginning) is available here.
The previous article dicussed why I started the project, how the URL collection was done using Common Crawl, and the ways the documents were verified if they are correct or not. In the end, we got an application that was able to collect 10.000 correct PDF documents that can be opened with a PDF viewer.
Okay, now what?
While manually smoke-testing the application, I quickly realized that I get suboptimal download speed because the documents were processed one-by-one on just one thread. It was time to parallelize our document handling logic, but to do that first it was needed to synchronize the URL link generation with the downloading of documents. It makes no sense to generate 10.000 URLs per second in the memory while we can only visit 10 locations per second. We will just fill our memory with a bunch of URLs and get an OutOfMemory error pretty quickly. It was time to split up our application and introduce a datastore that can act as an intermediate meeting place between the two applications. Let me introduce MySQL to you all.
Was splitting up the application a good idea? Absolutely! What about introducing MySQL? You can make a guess right now. What do you think, how can MySQL handle a couple of hundred million strings in one table? Let me help you. "Super badly" is an understatement compared to how awful the performance ended up long term. But I didn't know that at the time, so let's proceed with the integration of the said database. After the app was split into two, and the newly created "document location generator" application saved the URLs into a table (with a flag that can determine if the location was visited or not) the downloader application was able to visit them. Guess what? When I ran the whole app overnight, I got hundreds of thousands of documents saved next morning (my 500 Mbps connection was super awesome back then).
Now I got a bunch of documents. It was an awesome and inspiring feeling! This was the point when I realized that the original archiving idea can be done on a grand scale. It was good to see a couple of hundred gigabytes of documents on my hard disk, but you know what would be better? Indexing them into a search engine, then having a way to search and view them.
Initially I had little experience with indexing big datasets. I used Solr a while ago (like 7 years ago lol) so my initial idea came down to use that for the indexing. However, just by looking around for a bit longer before starting to work on the implementation I found Elasticsearch. It seemed to be superior over Solr in almost every way possible (except it was managed by a company but whatever). The major selling point was that it was easier to integrate with. As far as I know, both of them are just a wrapper around Lucene so the performance should be fairly similar. Maybe once it will be worthwhile to rewrite the application suite to use pure Lucene without actually doing premature optimization. However, until then, Elasticsearch is the name of the game.
After figuring out how indexing can be done, I immediately started to work. Extended the downloader application with code that indexed the downloaded and verified documents. Then I deleted the existing dataset to free up space, and started the whole downloading part yet again in the next night.
The indexing worked remarkably well, so I started to work on a web frontend that could be used to search and view documents. This was (controversially) called as the Backend Application in the beginning, then I quickly renamed it to the more meaningful name of Web Application. I'll use that name in this document to minimize the complexity.
Initially, the frontend code was written in AngularJS. Why? Why have I choosen an obsolete technology to create the frontend of my next dream project? Because it was something I already understood quite well, was familiar with, and had a lot of experience in. At this stage, I just wanted to progress with my proof of concept. Optimizations and cleanups can be done later. Also, I'm a backend guy, so the frontend code should be minimal right? Right?
It started out as minimal, that's for sure. Also, because it only used dependencies that can be served by cdnjs, it was easy to build and integrate into a Java application.
Soon the frontend was finished and I had some time to actually search and read the documents I collected. I remember that I wanted to search something obscure. I was studying gardening back then, so my first search was for the lichen "Xanthoria parietina".
To my surprise, I got back around a hundred documents from a 2.3 million sample size. Honestly, I was surprised. Some of them were quite interesting. Like whom wouldn't want to read the "Detection of polysaccharides and ultrastructural modification of the photobiont cell wall produced by two arginase isolectins from Xanthoria parietina"?
Part three will describe how more challenges around the storage of documents were solved like deduplication and compression.
Top comments (0)