DEV Community

Brian Onang'o
Brian Onang'o

Posted on • Updated on

Creating a searchable knowledge-base from several PDF files

Part 1 - Exploring the Power of Bash

In the current project we are going to work with several isolated PDF files, converting them into a searchable knowledge base. We are going to see how to:

  1. Download a file from a website using a script.
  2. Scrape an html table for links.
  3. Batch download several files in parallel without running out of memory.
  4. Combine pdf files using a script
  5. Batch rename files
  6. Create a website from pdf files.

Before we begin, let us give a little background for the project. We have quarterly Bible study guides that have faithfully been produced every year since 1886. All of these have been digitized and are available for all years since 1888 (also available here). Like in all areas, the faithful student is he that compares the current studies with what has been done in the past. This comparison will help in understanding the subject under consideration more deeply and in seeing if there are any errors that have been introduced in the new, or if there were errors in the old. And Christ said that "every scribe instructed concerning the kingdom of heaven is like a householder who brings out of his treasure things new and old."

To be able to make use of the old guides, the student currently has to go to the index of titles and check for the guides relevant to his subject of study, download all the pdf files individually, the reading each of them. But it would be more desirable to have a searchable knowledge-base having all these guides which the student can search whatever he needs at once, without having to manage 500+ individual files. It is such a system that we will be building.

In the first step we are going to:

1. Download a pdf file from a website using a script

The first lesson available is found at the following link: http://documents.adventistarchives.org/SSQ/SS18880101-01.pdf

Downloading this single file is pretty straightforward.

wget http://documents.adventistarchives.org/SSQ/SS18880101-01.pdf
Enter fullscreen mode Exit fullscreen mode

But since there are over 500 files and we cannot possibly copy and paste the link one by one from the site and download them, we are going to:

2. Scape the page for links to pdf files

All the pdf files we need are available in the directory being served at http://documents.adventistarchives.org/SSQ/. But trying to access that directory does not give a listing of the files in it, but rather redirects to a different page. We would use recursive wget to download all those files if accessing the directory gave a listing of the files it contains. But since this is not available to us, we have to use some other method.

Since we already know that all the files we are looking for have http://documents.adventistarchives.org/SSQ/ in their url. That is going to form the basis for our pattern. The final command that we have to extract both the url of the pdf file and the lesson title for that pdf is

echo -n '' > files.txt && curl -s https://www.adventistarchives.org/sabbathschoollessons | grep  -o 'http://documents.adventistarchives.org/SSQ/[^.]*.pdf\"[^\>]*[^\<]*' | while read line ; do echo "$line" | sed -e 's/\"[^\>]*./ /' | sed -e 's/  \+/ /g' >> files.txt; done
Enter fullscreen mode Exit fullscreen mode

grep -o will print only the matches found.

The result is saved in files.txt. Each line has the pdf url and the lesson title for that pdf.
files.txt

To see the number of files that we will be dealing with we can do wc -l files.txt. There are 516 files in total.

3. Downloading all 510 files

The files are in reality a little fewer than 516 because the list contains duplicates since it is designed for every quarter whereas some pdf files have two quarters. To remove the duplicated urls, we use sort:

sort -u -o files.txt files.txt
Enter fullscreen mode Exit fullscreen mode

To download the files, we need to get the url for each file from files.txt. We can do this by

cat files.txt |while read line ; do echo "$line" | cut -f 1 -d " " |xargs wget; done
Enter fullscreen mode Exit fullscreen mode

But this downloads one file at a time and is as a result very slow and inefficient.

To download the files in parallel, we use wget with the '-b' option so that it can fork itself. We sort the urls file to remove duplicated urls since files.txt has duplicated urls and not duplicated lines because of different lesson titles for the duplicated urls.

cut -d\  -f1  files.txt > urls && sort -u -o urls urls && cat urls |while read line ; do echo "$line" | cut -f 1 -d " " |xargs wget -b; done
Enter fullscreen mode Exit fullscreen mode

We are probably very lucky to download all the 510 files in 5 seconds without running out of 1GB of memory. It would be interesting to know how this would perform in node.

4. Combine 510 pdfs into 1

The first obvious step in making our knowledge-base is probably combining all the pdfs into one. This will result in a single big pdf (over 100mbs in size), but much more useful in study if it can be opened than 510 separate pdfs. For this we are going to use ghostscript:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=allLessons.pdf *.pdf
Enter fullscreen mode Exit fullscreen mode

Ghostscript quickly runs us out of memory. A better alternative for this task could be pdfunite.

pdfunite *.pdf allLessons.pdf
Enter fullscreen mode Exit fullscreen mode

pdfunite gets us a bigger file, but it is also runs us out of memory. Using NitroPro on windows in a system having 8GB of RAM gives us a file of 708.4Mb. It would be interesting to see if this would also be possible in a single core 1GB RAM system like the linux one we are trying to force to work for us.

Let's try pdfunite on 10 files at a time. We will use GNU parallel and let it do what it can to manage the memory for us. This process takes about 20 minutes and outputs files 1.pdf ... 51.pdf which we need to further combine.

ls -lha ./ |grep SS.*pdf | grep -o '[^ ]*$' |parallel -N10 pdfunite {} {#}.pdf
Enter fullscreen mode Exit fullscreen mode

6. Creating a site where pdfs can be downloaded.

We are going to do this using github pages. To do this we are first going to create a CNAME entry in our DNS server to point to github pages.

CNAME ttl points to
sslpdfs 900 gospelsounders.github.io.

Then we are going to add a CNAME file with the contents sslpdfs.gospelsounders.org, commit and push repo to github, and configure the github pages settings in the github repo.

GIT_SSH_COMMAND='ssh -i /tmp/gitKey -o IdentitiesOnly=yes' git push origin master
Enter fullscreen mode Exit fullscreen mode

Next we should create the index of the files that we have and put them in README.md. The following bash commands do this for us.

curl -s https://www.adventistarchives.org/sabbathschoollessons |grep -o ^[\<]td.* |sed -e 's/<td[^\>]*>//g' | sed -e 's/<sup>.*<\/sup>//g' | sed -e 's/<\/td>//g' |sed -e 's/\<a.*href="[^h][^t][^t][^p][^:][^>]*>.*<\/a><//' | sed -e 's/<a[^>]*><\/a>//g' |sed -e 's/<[^a][^>]*>//g' |grep -A 2 '^[0-9]\{4\}' | grep -v -- "^--$" |parallel -N3 echo {} | sed -e 's/\(<a[^<]*\)\(<.*\)/\1/g'| sed -e 's/<a.* href="\([^"]*\)"[^>]*>\(.*\)/[\2](\1)/' | sed -e 's/  */ /g' | sed -e 's/http:\/\/documents.adventistarchives.org\/SSQ\///' | sed -e 's/ \]/\]/g'
Enter fullscreen mode Exit fullscreen mode

It produces a list in the following format:

IndexofTitles

But we would desire to have all the quarters of a year in a single row in a table having the following header:

Year Quarter 1 Quarter 2 Quarter 3 Quarter 4

With the header already in the markdown file (README.md), the following commands create the table for us and add it to the markdown file.

curl -s https://www.adventistarchives.org/sabbathschoollessons |grep -o ^[\<]td.* |sed -e 's/<td[^\>]*>//g' | sed -e 's/<sup>.*<\/sup>//g' | sed -e 's/<\/td>//g' |sed -e 's/\<a.*href="[^h][^t][^t][^p][^:][^>]*>.*<\/a><//' | sed -e 's/<a[^>]*><\/a>//g' |sed -e 's/<[^a][^>]*>//g' |grep -A 2 '^[0-9]\{4\}' | grep -v -- "^--$" |parallel -N3 echo {} | sed -e 's/\(<a[^<]*\)\(<.*\)/\1/g'| sed -e 's/<a.* href="\([^"]*\)"[^>]*>\(.*\)/[\2](\1)/' | sed -e 's/  */ /g' | sed -e 's/http:\/\/documents.adventistarchives.org\/SSQ\///' | sed -e 's/ \]/\]/g' | parallel -N4 echo {} | sed -e 's/ [0-9]\{4\} [1-4][a-z]\{2\} / \|/g' |sed -e 's/ 1st /\| /' >> README.md 
Enter fullscreen mode Exit fullscreen mode

7. Continue combining the 51 files

We first have to add a leading zero to the first 9 files, then run pdfunite

rename.ul "" 0 ?.pdf

ls -lha ./ |grep " [0-9][0-9]\.pdf" | grep -o '[^ ]*$' |parallel -N10 pdfunite {} {#}-2.pdf

ls -lha ./ |grep " [0-9]-2\.pdf" | grep -o '[^ ]*$' |parallel -N6 pdfunite {} allLessons.pdf
Enter fullscreen mode Exit fullscreen mode

Last process still breaks because of limited memory. So we will upload the file that we created using NitroPro to our server.

scp All\ Lessons.pdf user@server:/tmp/pdfs/allLessons.pdf
Enter fullscreen mode Exit fullscreen mode

allLessons.pdf is greater than 100MBs (708.4MB) and we can't push to github. We need to find a way of reducing its size. So let's try to remove the images from the pdfs and see the result.

gs -sDEVICE=pdfwrite -dFILTERVECTOR -dFILTERIMAGE -o allLessonsSmall.pdf "allLessons.pdf"
Enter fullscreen mode Exit fullscreen mode

We are almost done as this gives us a file that is 68.5MBs. But the text is white on a white backgroundthe text is transparent.

Using Evince, the default pdf viewer for Ubuntu, the transparent-text pdf can be read by selecting all the text using CTRL+A which will make the text visible. But this method seems not to work on all pdf viewers including 'Foxit Reader' and Adobe pdf readers.
Image description
We will therefore convert the transparent-text pdf to html and change the color in the generated css file.

For this exercise we have chosen to use a digital ocean trial account to launch a 4 Core 8GB RAM 160 GB SSD server. The following command is executed unbelievably fast (2m8.701s):

time pdf2htmlEX --split-pages 1 --process-nontext 0 --process-outline 0 --process-annotation 0 --process-form 0 --embed-css 0 --embed-font 0 --embed-image 0 --embed-javascript 0 --embed-outline 0 --dest-dir html allLessonsSmall.pdf
Enter fullscreen mode Exit fullscreen mode

The result is 26672 files of total size 223MB. You can find these by running ls html |wc -l and du -h html. All text is still set to be transparent as seen from the following css file:
Image description
This is changed to black as follows:
Image description

7. Converting all the pdfs to html

We will now convert the pdfs to html files, which will allow us to be able to search their contents within github. This is also an easy task using our free trial 4 core server:

ls -lha ./ |grep SS.*pdf | grep -o '[^ ]*$' |parallel -N10 echo {} | sed 's/ /\n/g' | parallel -N1 pdf2htmlEX --split-pages 1 --dest-dir htmls {}
Enter fullscreen mode Exit fullscreen mode

8. Adding to git, commiting and pushing to github

For some reason, these command add all the files at once.

ls ./|head -n $((1*18135/5))|tail -n $((18135/5))|xargs git add```
{% endraw %}

{% raw %}


```bash
for file in $(ls ./|head -n $((1*18135/5))|tail -n $((18135/5)) ); do git add $file ; done
Enter fullscreen mode Exit fullscreen mode

So we move the htmls folder out of our folder, recreate it copy back the files in batches, add the files to git and commit. We run this command 5 times, incrementing the multiplier (1 in this command).

for file in $(ls ../htmls|head -n $((1*18135/5))|tail -n $((18135/5)) ); do cp "../htmls/$file" htmls/ ; done && git add . && git commit -m "added batch of files"
Enter fullscreen mode Exit fullscreen mode

9. Editing the index to add links to html files

sed -i '/SS.*pdf/s/(\([^)]*\))/(htmls\/\1.html) \\| [⇩](\1)/g' README.md
Enter fullscreen mode Exit fullscreen mode

10. Converting all the pdfs to text

In docs folder,

ls -lha ./ |grep SS.*pdf | grep -o '[^ ]*$' |parallel -N10 echo {} | sed 's/ /\n/g' | parallel -N1 pdftotext -layout {}

ls -lha ./ |grep "[0-9].txt" | grep -o '[^ ]*$' |parallel -N10 echo {} | sed 's/ /\n/g' | parallel -N1 mv {} texts/
Enter fullscreen mode Exit fullscreen mode

11. Remove transparency from htmls

The htmls contain transparent text which we wish to convert to black. In docs/htmls run:

sed -i '/fc0{color:transparent;/s/transparent/black/' *.html
sed -i '/fc0{color:transparent;/s/transparent/black/' *.css
Enter fullscreen mode Exit fullscreen mode

12. Add text files to index

sed -i '/SS.*pdf/s/(\([^(]*\).pdf)/(\1.pdf) \\| [txt](texts\/\1.txt)/g' README.md
Enter fullscreen mode Exit fullscreen mode

Top comments (0)