Automate PDF Table Extraction using Tabula and Azure Functions

#pdftableextraction #azurefunctions #tabula #etl

Data Ingestion

Big data projects typically tend to ingest required data from a variety of sources in various formats.
An ETL(Extract Transform Load) tool automates the data ingestion process e.g. extract data from source, move data between multiple sources and finally load the processed data into a destination data store for further analysis and reporting.
Microsoft provides ADF(Azure Data Factory) for orchestration of data ingestion process. Azure Data Factory provides an interface to create, schedule, monitor and manage data ingestion using pipelines. Further Azure Data Factory enables execution of custom code to address complex ETL scenarios in the form of Azure Functions which is covered in this article. Azure Functions has integration with a number of tools and enables execution of code on-demand.

PDF Table Extraction

Extracting data from different formats pose different challenges. Most of the data extraction process can be achieved out of the box via an ETL tool like ADF, others requires a bit of coding. One such scenario is extraction of data from tables in pdf. PDF contains loads of information locked in tables. Extracting data from pdf tables and transforming into a structured format has become a common business requirement for big data projects. So how do we automate the tedious task of extracting data locked away in pdf tables? What if we could extract data out of pdf tables and transform in to a consumable format with few lines of code (just one line!!)?

Tabula-py is a free python library that provides an excellent programming interface for automation of PDF extraction.

Tabula User Interface

Tabula app
Before moving on to the solution, I would like to introduce the GUI version of Tabula. Tabula provides a GUI interface that could be downloaded and easily installed. This tool could be used to validate if the table is extractable. All you need to do is upload the pdf containing the table and verify if the extraction process is successful. In essence, what we have is a tool to find out quickly if the pdfs work without requiring additional resources.

Use the selection rectangle to mark the required section for extraction. The selection can be saved in the form of templates for future reference.

Tabula provides templates to save data selection. These templates determine what data will be extracted from pdf. To leverage template based table extraction using tabula-py library make use of the below method:

tabula.read_pdf_with_template(pdf_path, "/path/to/tabula-template.json")

Tabula offers two extraction options - Stream and Lattice. If the tables have lines separating cells, use the lattice parameter. If your tables don't have separation lines, try stream parameter.

tabula.read_pdf("/path/to/sample.pdf", pages="all",stream=True)

Finally export and check if the data has been extracted properly. Once we preview and make sure the data is right we can go ahead and automate the process. Look for bad pdf formats. If we are able to see desired output with tabula GUI app - this is a good case for automation of the extraction process using the library.

High Level Flow

tabula-py enables conversion of tables in a pdf into csv/tsv files. Once the data is available in the form of csv it could be easily loaded into a database either via programming or Azure Data Factory. Most programming languages and ETL tools support easy conversion of csv into a database table. ADF provides OOTB feature to load csv into database without additional programming.

The python code deployed on Azure Function reads the pdf from azure blob storage. The tabula-py package extracts the table from pdf and converts it into csv. The resulting csv file is then stored in a different location within Azure blob storage.

Python Code


import logging

import tabula

import azure.functions as func

import os

import csv

from azure.storage.blob import BlobClient,ContainerClient



def main(req: func.HttpRequest) -> func.HttpResponse:

    logging.info('Python HTTP trigger function processed a request.')



    try:

        # [ create_blob_client_sas_url for source blob location]

        sas_url = "https://pdfextracts.blob.core.windows.net/pdfcontainer/1.pdf"



        #[Download Blob]

        blob_client = BlobClient.from_blob_url(sas_url)

        blob_data = blob_client.download_blob()

        blob_name = blob_data.name

        logging.info(blob_name +' downloaded')



        #save to tmp folder

        pdf_path = '/tmp/{}'.format(blob_name)

        with open(pdf_path, "wb") as my_blob:

            blob_data.readinto(my_blob)

        #logging.info(blob_name +' contents read')

        logging.info('Saved PDF to path: '+ pdf_path)



        #strip .pdf extension from downloaded blob

        file_name_without_extension = os.path.splitext(blob_name)[0]



        csv_file_name = file_name_without_extension+'.csv'

        csv_path = '/tmp/{}'.format(csv_file_name)

        #csv_path = '/tmp/{}'.format(file_name_without_extension)+'.csv'



        logging.info('CSV path '+ csv_path)



        tabula.convert_into(pdf_path,csv_path, output_format="csv", stream=True)

        logging.info(blob_name +' processed by tabula')



        #Create SAS client from a different container on the same blob

        sas_url = "https:/pdfextraction.blob.core.windows.net/csvcontainer?"

        container_Client = ContainerClient.from_container_url(sas_url)



        logging.info('created sas for upload')



        #Read the csv from temp and upload to blob

        with open(csv_path , "rb") as data:

            container_Client.upload_blob(name=csv_file_name, data=data)

        logging.info(csv_file_name+' uploaded post extraction')



    except ValueError:

        pass



    logging.info('process completed')

        return func.HttpResponse("PDF function executed successfully.")

Host on Azure Functions Container

The table extraction process requires installation of tabula-py library and also has dependency on java runtime. The best way to deploy the pdf extraction code would be to package the app along with its dependencies.

Docker provides a way to package the code and dependencies into a portable format which then could be deployed onto any container based hosting solution. What is docker?

Azure Functions is a managed hosting option that supports containers. Azure Functions in a docker container

Azure Functions lets developers focus on application code rather than have to worry about the hosting infrastructure. To integrate Azure Functions with Azure Data Factory python code has to be modified to add the status code in the final response.

return_msg= {
            'statusCode': 200,
            'body': msg
        }   
        return func.HttpResponse(json.dumps(return_msg))

This Azure Function can now be executed from Azure Data Factory. In order to invoke the above code add an Azure Function activity in the data pipeline.
Azure Functions - Azure Data Factory Integration

Apart from serving as an API, Azure Functions offer integration with a number of other services in the form of triggers and bindings. For example an Azure Function can be automatically invoked when a new file is added to blob storage.

Once deployed the application could either act as an independent API accessed from a web app for on demand processing or could be integrated with ADF for scheduled processing. Either way now that the data from pdf is available as text file or csv, it is easier to load the data into a data store via programming or ADF for further analysis.