DEV Community

Cover image for Automate data entry from PDF to Excel with Python
Stokry
Stokry

Posted on

Automate data entry from PDF to Excel with Python

I will show you today how you can automate the process of importing PDF data into Excel. This can save a lot of time for everyone doing this daily.

This is our test data:
enter image description here

Let's jump to the code!

First we need to import dependencies

from tika import parser
import pprint
from collections import defaultdict
import re
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

After that we need to define PrettyPrinter and get the content of the PDF file and convert it into a list:

pp = pprint.PrettyPrinter(indent=3)
parsedPDF = parser.from_file("final-test.pdf")

content = parsedPDF['content']
contentlist = content.split('\n')
Enter fullscreen mode Exit fullscreen mode

Remove empty strings in the list resulting from the split

contentlist = list(filter(lambda a: a != '', contentlist))
Enter fullscreen mode Exit fullscreen mode

Create an iterator and other flags that we will use to for the algorithm, Iterator of the contents of PDF per line:

iterateContent = iter(contentlist)
Enter fullscreen mode Exit fullscreen mode

Dictionary placeholder of the data scraped

data = defaultdict(dict)
Enter fullscreen mode Exit fullscreen mode

Our counter to count how many blocks did we able to get

 cntr = 0
Enter fullscreen mode Exit fullscreen mode

Indicator which line are we in a specific block of data

 line = 1
Enter fullscreen mode Exit fullscreen mode

The algorithm will use the flags cntr and line to determine if we are in a new block or existing block

while True:
    try:
        string = next(iterateContent)
    except StopIteration:
        break

    if re.match('^[A-Z\s]+$', string):
        cntr += 1           

        data[cntr]['Name'] = string
        line = 2
        print('matched')

    elif line == 2:
        data[cntr]['Address'] = string
        line += 1

    elif line == 3:
        data[cntr]['Website'] = string
        line += 1
Enter fullscreen mode Exit fullscreen mode
print("Total data:", len(data.keys()))
Enter fullscreen mode Exit fullscreen mode

Setting up the data into Dataframe

df = pd.DataFrame(data.values())
df.index += 1
print(df)
Enter fullscreen mode Exit fullscreen mode

Write the dataframe into excel

writer = pd.ExcelWriter("dataframe.xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name='output', index=False)
writer.save()
Enter fullscreen mode Exit fullscreen mode

Our final results:

enter image description here

Thank you all.

Top comments (1)

Collapse
 
ypanagis profile image
Yannis Panagis

Very useful. I was wondering if it would make sense to add 'with Apache Tika' at a visible point, eg before the code snippets.

Thank you for the post again!