Rishab Dugar

Posted on Sep 22

PDF Extraction: Retrieving Text and Tables together using Python🐍

#datascience #python #computerscience #pdf

Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.

Understanding the Approach

The method involves extracting table objects and text lines separately and then combining them based on their positional values. This ensures that the extracted data maintains the correct order and structure as it appears in the PDF. Let’s break down the code and logic step-by-step.

As an example, we will use the sample_pdf below, containing tables and text in multiple pages.

sample_pdf_with_text_and_table.pdf - Google Drive

drive.google.com

Prerequisites

Before running the code, we should ensure that the necessary libraries are installed. Besides pdfplumber and pandas, we also need the tabulate library. This library is used by pandas To convert DataFrame objects to Markdown format, which is crucial for our table extraction process. This conversion helps in maintaining the structure and readability of table data extracted from the PDF.

Installing Required Libraries

You can install these libraries using pip. Run the following commands in your

pip install pdfplumber pandas tabulate

Step-by-Step Explanation

Import Libraries: First things first, we start by importing all necessary libraries.

pdfplumber is used for extracting text and tables from PDFs.
pandas is used for handling and manipulating data.
extract_text, get_bbox_overlap, and obj_to_bbox are utility functions from pdfplumber.
tabulate helps in converting data into Markdown format.

import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
import tabulate

Function Definition and PDF Opening:

The function process_pdf takes pdf_path as an argument, which is the path to the PDF file.
pdfplumber.open(pdf_path) opens the PDF file.
all_text is initialized as an empty list to store the extracted text from all pages.

    def process_pdf(pdf_path):
      pdf = pdfplumber.open(pdf_path)
      all_text = []

Iterate Over Pages:

for page in pdf.pages — The for loop iterates over each page in the PDF.
filtered_page — is initially set to the current page.
chars — captures all characters on the filtered_page.

      for page in pdf.pages:
        filtered_page = page
        chars = filtered_page.chars

Table Detection and Filtering:

for table in page.find_tables() — The for loop iterates over each table found on the page.
first_table_char — stores the first character of the cropped table area.
filtered_page — is updated by filtering out characters that overlap with the table's bounding box using get_bbox_overlap and obj_to_bbox.

        for table in page.find_tables():
            first_table_char = page.crop(table.bbox).chars[0]
            filtered_page = filtered_page.filter(lambda obj: 
                get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
            )
            chars = filtered_page.chars

Extract and Convert Table to Markdown:

table.extract() extracts the table content.
A DataFrame df is created from the extracted table data.
The first row is set as the header using df.columns = df.iloc[0].
The rest of the DataFrame is converted to Markdown format and stored in markdown.

            df = pd.DataFrame(table.extract())
            df.columns = df.iloc[0]
            markdown = df.drop(0).to_markdown(index=False)

Append Markdown to Characters:

The first_table_char is updated with the markdown content and appended to chars.

chars.append(first_table_char | {"text": markdown})

Extract Page Text:

extract_text(chars, layout=True) extracts the text from the filtered characters with layout preservation.
The extracted text page_text is appended to all_text.

        page_text = extract_text(chars, layout=True)
        all_text.append(page_text)

Close PDF and Return Text:

The PDF file is closed using pdf.close().
The extracted text from all pages is joined into a single string with newline characters and returned.

    pdf.close()
    return "\n".join(all_text)

Execute Function and Print Result:

The path to the PDF file is defined in pdf_path.
process_pdf(pdf_path) is called to process the PDF and extract text.
The extracted text is printed.

# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)

Complete Code

Here is the complete script for extracting text and tables as markdown from a PDF:

import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
def process_pdf(pdf_path):
    pdf = pdfplumber.open(pdf_path)
    all_text = []
    for page in pdf.pages:
        filtered_page = page
        chars = filtered_page.chars
        for table in page.find_tables():
            first_table_char = page.crop(table.bbox).chars[0]
            filtered_page = filtered_page.filter(lambda obj: 
                get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
            )
            chars = filtered_page.chars
            df = pd.DataFrame(table.extract())
            df.columns = df.iloc[0]
            markdown = df.drop(0).to_markdown(index=False)
            chars.append(first_table_char | {"text": markdown})
        page_text = extract_text(chars, layout=True)
        all_text.append(page_text)
    pdf.close()
    return "\n".join(all_text)
# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)

Output :

Hello
World
| First name   | Last name   |   Age | City        |
|:-------------|:------------|------:|:------------|
| Nobita       | Nobi        |    15 | Tokyo       |
| Eli          | Shane       |    23 | Orlando     |
| Rahul        | Jain        |    22 | Los Angeles |
| Lucy         | Carlyle     |    17 | London      |
| Anthony      | Lockwood    |    19 | Leicester   |
Loreum  ipsum
dolor sit amet,
consectetur
adipiscing
Hello
Python
| First name   | Last name   | Address             |
|:-------------|:------------|:--------------------|
| James        | Watson      | 221 B, Baker Street |
| Mycroft      | Holmes      | Diogenes Club       |
| Irene        | Adler       | 21 New Jersey       |
| Lucy         | Carlyle     | 33 Claremont Square |
| Anthony      | Lockwood    | 35 Portland Row     |
Neque  porro
quisquam  est qui
            dolorem
      ipsum     quia
      dolor sit amet,
consectetur, adipisci
velit..."

Conclusion

This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout. Credits to cmdlineuser and jsvine for their insightful discussion and innovative solution to the problem!

That’s all for now! Hope this tutorial was helpful. Feel free to explore and adapt this method to fit your specific needs.

Top comments (3)

Jeff Stone • Nov 21

Hi Rishab,
Nice post that addresses a vexing problem. I tried your code on a complex .PDF that I have but got the following error:
File c:\users\js.spyder-py3\temp.py:12 in process_pdf

first_table_char = page.crop(table.bbox).chars[0]

File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:535 in crop

return CroppedPage(self, bbox, relative=relative, strict=strict)

File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:677 in init

test_proposed_bbox(crop_bbox, parent_page.bbox)

File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:656 in test_proposed_bbox

raise ValueError(

ValueError: Bounding box (19.448275862068964, 154.38000000000005, 1183.5160975609742, 553.6492307692307) is not fully within parent page bounding box (0, 0, 792, 612)

Do you have any idea how to adjust for this situation?

Thanks,

Jeff

Rishab Dugar • Dec 5

Hi Jeff,

Thanks for pointing that out! The error seems related to the table's bounding box extending outside the page's boundaries. You could try adding a check to validate the table's bounding box before processing:

if (
    table.bbox[0] < page.bbox[0] or
    table.bbox[1] < page.bbox[1] or
    table.bbox[2] > page.bbox[2] or
    table.bbox[3] > page.bbox[3]
):
    print(f"Skipping table with invalid bounding box: {table.bbox}")
    continue

Alternatively, you can clamp the bounding box to ensure it fits within the page:

def clamp_bbox(bbox, page_bbox):
    return (
        max(bbox[0], page_bbox[0]),
        max(bbox[1], page_bbox[1]),
        min(bbox[2], page_bbox[2]),
        min(bbox[3], page_bbox[3]),
    )

adjusted_bbox = clamp_bbox(table.bbox, page.bbox)
cropped_page = page.crop(adjusted_bbox)

I haven’t tested this extensively, but it should address the bounding box issue. Let me know how it works for your case!

Richard Kous • Dec 3

PowerPoint sunumlarıyla çalışıyorsanız ve bunları daha kolay paylaşmak veya yazdırmak için PDF formatına dönüştürmek istiyorsanız, PDFGuru gibi bir araç kullanmanızı öneririm. Bu hizmet, .pptx dosyalarını biçimlendirmeyi ppt pdf dönüştürücü ve slayt yapısını koruyarak kolayca PDF'ye dönüştürmenize olanak tanır. Bu, özellikle sunumunuzu herhangi bir cihazda açılabilecek veya baskı için kullanılabilecek daha evrensel bir formatta kaydetmek istiyorsanız kullanışlıdır.

DEV Community

PDF Extraction: Retrieving Text and Tables together using Python🐍

Understanding the Approach

sample_pdf_with_text_and_table.pdf - Google Drive

Prerequisites

Installing Required Libraries

Step-by-Step Explanation

Complete Code

Output :

Conclusion

Top comments (3)

Read next

Data Science in the Era of Generative AI, IoT, and Sustainable Technologies: A Complete Roadmap

Python crawler practice: using 98ip proxy IP to obtain cross-border e-commerce data

Python 🐍 and variable types

Displaying Python Script Outputs on Conky Panels