Upsilon

Posted on Sep 16, 2020

How to Extract Tabular Data from PDF [part 2]

#productivity #technology #webdev #tutorial

Here is the second part of the article ‘How to Extract Tabular Data from PDF.’ In the first part , we covered key challenges and explained the core principles of getting data out of PDF tables. Today, we finish our analysis of six software tools that are most often used for that purpose and provide a big comparative table where each tool is rated according to its ability to parse PDF tables and correctly extract data from them.

Excalibur

Excalibur is a web interface to extract tabular data from PDFs. Tool overview:

Type of software available: web application, needs local setup
Platforms: any modern web browser; local setup runs on Linux, MacOS, and Windows
Terms of use: free, open-source
Supported output formats: CSV, Excel, JSON, HTML
Notes: works only with text-based PDFs and not scanned documents

After uploading our sample file and parsing data from it via Excalibur, we got the following output:

Highlighted zones are parts of the original file detected by Excalibur as tables. At this step, data is captured correctly; extraneous elements as headers are not selected.

After the extraction procedure, we can find the only error in the tabular data preview: in the first row of the first table, two adjacent cells were mistakenly merged. But in comparison with the previous tools, we get the output closest to the original file.

Summary: Excalibur demonstrates the best result at this point. Accurate detection of tables, all non-tabular data is skipped, no problems with multiline text in cells. The only mistake refers to the recognition of merged cells: unfortunately, their content is messed up.

OCR.space

OCR.space is a service converting scans or (smartphone) images of text documents into editable files by using Optical Character Recognition (OCR) technology. Tool overview:

Type of software available: online web application
Platforms: any modern web browser - all processing goes ‘in the Cloud’
Terms of use: free (up to 250 000 conversions) and paid ($20 per each 100 000 conversions)
Supported output formats: TXT, JSON

For extracting data from tables, it is recommended to enable the ‘Table recognition’ option. After uploading our sample file and parsing data from it via OCR.space, we got the following output:

The result of the extraction is OCR'ed text sorted line by line - but we can not see a typical table structure with rows and columns. Data inside the cells are messed up with non-tabular data like headers and page numbers. The output format looks similar to TSV (tab-separated values). Also, the multiline text inside cells is split into rows.

Summary: The output document does not have a typical table structure: the data is presented as a sequence of text lines. All data from tables is extracted but not cleaned from the 'extraneous content' like headers and page numbers. Also, OCR.space failed to extract multiline text in cells correctly.

PDFTables

PDFTables is a cloud platform that allows users to convert PDF tables to Excel accurately, CSV, XML, or HTML without downloading any software. Tool overview:

Type of software available: online web application
Platforms: any modern web browser - all processing goes ‘in the Cloud’
Terms of use: free/paid (starting from $40 for 1000 pages) subscription plans
Supported output formats: Excel, CSV, XML or HTML
Notes: allows to convert multiple PDFs at once

After parsing the sample PDF file and extracting tabular data from it via PDFTables, we get the following result:

Four separate tables of the original PDF are detected as a big, single table. We can also see that all headers are captured as the table element and included as additional, ‘non-original’ cells inside the table. Moreover, cells with multiline text are split into multiple table rows. And finally, the cells of the first table of the sample file are mistakenly merged.

Summary: When it comes to getting tabular data out of PDF docs, PDFTables is the least effective tool. The only task it coped with is the correct detection of cells separated with small margins. But other challenges of ‘extra formatting’ can not be completed with the help of PDFTables.

A comparative pivot table of software tools

In this study, we compared siх software tools — Tabula, PDFTron, Amazon Textract, Excalibur, OCR.space, and PDFTables — by performing their core functions of parsing PDF tables and extracting data from them. We estimated each tool by their ability to complete the following five tasks:

If there are multiple tables on a page — detect them all separately
If non-tabular data is on a page — skip it and do not include it in the extraction result
If any cell contains multiline text — it should not be split into multiple table rows
If there are cells spanned on more than one row/column — they should be recognized correctly, at least separately from other cells
If there are small margins between non-tabular data and a table or between different cells and their content inside the table — they should be recognized separately

If we see from the extraction result that the task is completed successfully, a tool gets 1 point. If we see any mistakes and inconsistencies compared to the original PDF file's tabular data, it receives 0 points.

Here is a comparative pivot table with all the results:

Conclusions:

1. Excalibur is the 'winner' of the study. It successfully coped with most of the challenges of 'extra formatting' except incorrect column & row spanning. Thus it can be recommended as the #1 choice for extracting tabular data from PDFs.

2. Tabula and PDFTron demonstrated quite satisfying results. While Tabula better identifies and excludes non-tabular data (headers and a page number), PDFTron better deliminates cells with multiline text inside.

3. Amazon Textract and OCR.space got only 2 points of 5. In the extraction result provided by Amazon Textract, we see a large amount of data loss and messed the tables' order. The OCR.space's result contains non-tabular data mistakenly included, and multiline content split into multiple table rows.

4. PDFTables failed to complete most of the tasks. It was the only tool that did not recognize four tables of the sample PDF file as four separate ones. Moreover, we can find non-tabular content, split multiline text, and mistakenly merged cells in the extraction result.

5. The most challenging task was to detect column & row spanning correctly: none of the tools fully coped with that task. If you have a complicated table with many cells spanned on more than one row/column, you should look for another software for PDF data extraction.

Follow UpsilonIT on dev.to for the latest tech know-hows, best software development practices, and insightful case studies from industry experts! Remember: your likes and comments are fuel for our future posts!

Top comments (3)

Hugertown • Feb 27 '21

I'm looking for a way to extract the data from the PDF, but so far there are difficulties with that, because I've tried transferring, but it comes out distorted text. What to do?

AnupJoseph • Oct 15 '20

Can you please suggest some tools which help in dealing with a pdf having multiple tables in a single page and sometimes even spanning multiple pages

Upsilon • Jan 13 '21

Thanks for the question!

All tools except PDFTables coped well with the multiple tables on the page. PDFTables detected the original PDF's separate tables as a big, single table.

If you work with a multi-paged table, you will need to 'glue' its parts by yourself. Either manually or via a custom script (if you will come up with an algorithm). As far as we know, no instruments are allowing to do that.

For other parameters, Excalibur is the winner of the study.