Today, we would like to talk about the challenges and principles of getting tabular data out of PDF files. UpsilonIT's developer team conducted a study to find out what's the best software for parsing PDF tables and extracting data from them.
Today PDF is used as the basis of communication between companies, systems, and individuals. It is regarded as the standard for finalized versions of documents as it is not easily editable except in fillable PDF forms. Most popular use cases for PDF documents in the business environment are:
- Purchase Orders
- Shipping Notes
- Price & Product Lists
- HR Forms
The sheer volume of information exchanged in PDF files means that the ability to extract data from PDF files quickly and automatically is essential. Spending time extracting data from PDFs to input into third party systems can be very costly for a company.
The main problem is that PDF was never really designed as a data input format, but rather, it was designed as an output format ensuring that data will look the same at any device and be printed correctly. A PDF file defines instructions to place characters (and other components) at precise x,y coordinates. Words are simulated by placing some characters closer than others. Spaces are simulated by placing words relatively far apart. As for tables — you are right — they are simulated by placing words as they would appear in a spreadsheet.
We see that the PDF format has no internal representation of a table structure, which makes it difficult to extract tables for analysis. Unfortunately, a lot of open data is stored in PDFs, which was not designed for tabular data in the first place. Luckily, different tools for extracting data from PDF tables are available in the market. Being somewhat similar to each other, they have their own advantages and disadvantages. In this article, we compare the most popular software that can help get tabular data out of PDFs and present it in an easy-to-read, editable, and searchable format.
Before choosing a tool, the first point is to understand what type of PDF files — text- or image-based — you will work with. It will impact on whether to use Optical Character Recognition (OCR) or not.
For example, we have a report generated as an output by a piece of software and imported in PDF format. Commonly, it is a text-based PDF.
If you work with image-based PDFs like scanned paper docs, or, for example, files captured by a digital camera — this is where OCR comes in. OCR enables machines to recognize written text or printed letters inside images. In other words, the technology helps to convert ‘text-as-an-image’ into an editable and searchable format.
So if your PDF is image-based, then the process of data extraction consists of two tasks: to recognize text and then recognize the table structure (i.e., how the text is placed in rows and columns). Some tools, like Amazon Textract, can complete both of them. But since text recognition is a separate task, it can be performed independently, with the help of pure OCR tools. There are dozens of them, but in this article, we will focus on the table structure recognition.
Let's assume that we have a text-based PDF document generated as an output by a piece of software. It contains tabular data, and we want to extract it and present in a digital format. There are two main ways to detect tables:
Manually, when you detect column borders by eye and mark table columns by hands
Automatically, when you rely on program algorithms
Some tools offer either manual or automatic detection, while others combine both variants. From our experience, we can say that most of our clients' cases require automatic recognition because they handle hundreds of documents with a slightly variable table structure. For example, columns' width can differ not only between two documents but also on different pages inside one document. Therefore, if we mark each column in each table on each page by hand, it would take a lot of time and effort.
For our study, we created a sample one-page document covering all typical difficulties of data extraction caused by an 'inconsistent' table structure. Here is it:
As you can see, in our sample document, we have multiple tables, and all of them have different widths of columns and inconsistent alignments. In such cases, manual detection can be tedious. For this reason, we are not going to use software that offers only manual table data detection. We choose an automatic way, and let’s see how good are algorithms for each of the tools.
In addition to the table structure's basic elements, we also deliberately included some ‘extra formatting’ that can potentially complicate the process of recognition:
- Multiple tables on a page — all of them should be detected
- Non-tabular data: headers and page number
- The multiline text inside cells
- Table column & row spanning — merged cells
- Small table cell margins inside the table and between the table and the header
From this study, you will learn about how six software tools perform their respective tasks of parsing PDF tables and how they stack up against each other. In the first part, we compare Tabula, PDFTron, and Amazon Textract.
Let’s see how libraries and tools mentioned above coped with this task of data recognition and extraction based on our sample document.
Tabula is a tool for liberating data tables locked inside PDF files. Tool overview:
Type of software available: web application, requires simple local server setup
Platforms: Windows, MacOS, open source (GitHub, MIT Licence)
Supported output formats: CSV, TSV, JSON
Notes: Tabula only works on text-based PDFs, not scanned documents
After uploading our sample file and parsing data from it via Tabula, we got the following output:
Zones marked in red are parts of the original file where Tabula detected tables. At this step, data recognition is captured correctly; all tables are detected, extraneous elements do not distort the result.
But if we go further, we will see that the first row of the first and last rows of the last two tables is missing:
Tabula offers two options: two data extraction methods. The automatic detection method is Lattice. The result of its usage you can see above. If we try Stream, results become better; all data is extracted correctly without missing rows:
But using the Stream method, we face another problem: cells with multiline text are split into multiple table rows. It seems this variant is the best that we can get from Tabula.
Summary: Tabula's automatic detection method is not the best choice, so I don't recommend relying on it. Both recognition methods provided by this tool have their own disadvantages. While data loss looks unacceptable in any case, split rows can be returned to their original state with additional processing — manually or with the script's help.
PDFTron is software with multiple basic and advanced features that facilitate the manipulation of PDFs. Tool overview:
Type of software available: desktop app, web browser app, mobile app
Platforms: Android, iOS, Windows, Linux, MacOS
Supported output formats: MS Word, SVG, HTML
After uploading our sample file and parsing data from it via PDFTron, we got the following output:
Red rectangles show the borders of detected tables. The number of tables and tabular data is recognized correctly, but headers of the third and fourth tables are captured as the table elements.
If we convert the original PDF into HTML format using PDFTron, we’ll see that the headers named ‘CATEGORY 2’ and ‘CATEGORY 3’ are included as separate cells inside the table. Also, there are bugs with merged cells of the first table: the last two columns are merged as a larger one, and the second line of text is separated into a separate cell.
Summary: In the output document we can see a piece of non-tabular data incorrectly included in the extraction result. It's not a big problem since it can be purged after processing, manually, or with the script's help. Also, I will not recommend this tool in cases when you have a lot of merged cells — they might be messed up.
Amazon Textract is a service that automatically extracts text and data from scanned documents that go beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Tool overview:
Type of software available: web application
Platforms: any modern web browser, all processing goes ‘in the Cloud’
Supported output formats: raw JSON, JSON for each page in the document, text, key/values exported as CSV, tables exported as CSV.
After uploading our sample file and parsing data from it via Amazon Textract, we got the following result:
Zones painted in gray are parts of the original file where Amazon Textract detected tables. As you can see, it didn’t recognize the first table on the page, detected two separate tables — the second and the third — as a single one, and also messed up the order of tables on the page.
Regarding the data inside the cells, the extraction result is quite satisfying with some exceptions:
Missing the data of the first table in the extraction result
Header string ‘CATEGORY 3 is included as a part of table in the extraction result
Summary: Amazon Textract looks to be the less suitable tool for data extraction applied to this particular case. A large amount of data loss, messed order of tables, and non-tabular data in the output document. The main advantage of Amazon Textract compared to all other tools is OCR, so that you can extract tabular data from images or image-based PDFs. In the case of text-based PDF, I would recommend choosing another tool.
Here is the first part of the article ‘How to Extract Tabular Data from PDF. In the second part, which is coming soon, we will analyze three more popular solutions for extracting and converting data from PDF and prepare a big сomparison table where each tool is rated according to the specific criteria.
Follow UpsilonIT on dev.to not to miss the great content! Your likes and comments are fuel for our future posts!