Extracting both text and tables can be challenging when working with PDF files due to their complex structure. However, the “pdfplumber” library offers a powerful solution. This article explores an effective method for combining text and table extraction from PDFs using pdfplumber
. Special thanks to Karl Genockey a.k.a. cmdlineuser and other contributors for their brilliant approach discussed here.
Understanding the Approach
The method involves extracting table objects and text lines separately and then combining them based on their positional values. This ensures that the extracted data maintains the correct order and structure as it appears in the PDF. Let’s break down the code and logic step-by-step.
As an example, we will use the sample_pdf below, containing tables and text in multiple pages.
Prerequisites
Before running the code, we should ensure that the necessary libraries are installed. Besides pdfplumber
and pandas
, we also need the tabulate
library. This library is used by pandas
To convert DataFrame objects to Markdown format, which is crucial for our table extraction process. This conversion helps in maintaining the structure and readability of table data extracted from the PDF.
Installing Required Libraries
You can install these libraries using pip
. Run the following commands in your
pip install pdfplumber pandas tabulate
Step-by-Step Explanation
- Import Libraries: First things first, we start by importing all necessary libraries.
-
pdfplumber
is used for extracting text and tables from PDFs. -
pandas
is used for handling and manipulating data. -
extract_text
,get_bbox_overlap
, andobj_to_bbox
are utility functions frompdfplumber
. -
tabulate
helps in converting data into Markdown format.
import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
import tabulate
- Function Definition and PDF Opening:
- The function
process_pdf
takespdf_path
as an argument, which is the path to the PDF file. -
pdfplumber.open(pdf_path)
opens the PDF file. -
all_text
is initialized as an empty list to store the extracted text from all pages.
def process_pdf(pdf_path):
pdf = pdfplumber.open(pdf_path)
all_text = []
- Iterate Over Pages:
-
for page in pdf.pages
— The for loop iterates over each page in the PDF. -
filtered_page
— is initially set to the currentpage
. -
chars
— captures all characters on thefiltered_page
.
for page in pdf.pages:
filtered_page = page
chars = filtered_page.chars
- Table Detection and Filtering:
-
for table in page.find_tables()
— The for loop iterates over each table found on the page. -
first_table_char
— stores the first character of the cropped table area. -
filtered_page
— is updated by filtering out characters that overlap with the table's bounding box usingget_bbox_overlap
andobj_to_bbox
.
for table in page.find_tables():
first_table_char = page.crop(table.bbox).chars[0]
filtered_page = filtered_page.filter(lambda obj:
get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
)
chars = filtered_page.chars
- Extract and Convert Table to Markdown:
-
table.extract()
extracts the table content. - A DataFrame
df
is created from the extracted table data. - The first row is set as the header using
df.columns = df.iloc[0]
. - The rest of the DataFrame is converted to Markdown format and stored in
markdown
.
df = pd.DataFrame(table.extract())
df.columns = df.iloc[0]
markdown = df.drop(0).to_markdown(index=False)
- Append Markdown to Characters:
- The
first_table_char
is updated with themarkdown
content and appended tochars
.
chars.append(first_table_char | {"text": markdown})
- Extract Page Text:
-
extract_text(chars, layout=True)
extracts the text from the filtered characters with layout preservation. - The extracted text
page_text
is appended toall_text
.
page_text = extract_text(chars, layout=True)
all_text.append(page_text)
- Close PDF and Return Text:
- The PDF file is closed using
pdf.close()
. - The extracted text from all pages is joined into a single string with newline characters and returned.
pdf.close()
return "\n".join(all_text)
- Execute Function and Print Result:
- The path to the PDF file is defined in
pdf_path
. -
process_pdf(pdf_path)
is called to process the PDF and extract text. - The extracted text is printed.
# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)
Complete Code
Here is the complete script for extracting text and tables as markdown from a PDF:
import pdfplumber
import pandas as pd
from pdfplumber.utils import extract_text, get_bbox_overlap, obj_to_bbox
def process_pdf(pdf_path):
pdf = pdfplumber.open(pdf_path)
all_text = []
for page in pdf.pages:
filtered_page = page
chars = filtered_page.chars
for table in page.find_tables():
first_table_char = page.crop(table.bbox).chars[0]
filtered_page = filtered_page.filter(lambda obj:
get_bbox_overlap(obj_to_bbox(obj), table.bbox) is None
)
chars = filtered_page.chars
df = pd.DataFrame(table.extract())
df.columns = df.iloc[0]
markdown = df.drop(0).to_markdown(index=False)
chars.append(first_table_char | {"text": markdown})
page_text = extract_text(chars, layout=True)
all_text.append(page_text)
pdf.close()
return "\n".join(all_text)
# Path to your PDF file
pdf_path = r"sample_pdf.pdf"
extracted_text = process_pdf(pdf_path)
print(extracted_text)
Output :
Hello
World
| First name | Last name | Age | City |
|:-------------|:------------|------:|:------------|
| Nobita | Nobi | 15 | Tokyo |
| Eli | Shane | 23 | Orlando |
| Rahul | Jain | 22 | Los Angeles |
| Lucy | Carlyle | 17 | London |
| Anthony | Lockwood | 19 | Leicester |
Loreum ipsum
dolor sit amet,
consectetur
adipiscing
Hello
Python
| First name | Last name | Address |
|:-------------|:------------|:--------------------|
| James | Watson | 221 B, Baker Street |
| Mycroft | Holmes | Diogenes Club |
| Irene | Adler | 21 New Jersey |
| Lucy | Carlyle | 33 Claremont Square |
| Anthony | Lockwood | 35 Portland Row |
Neque porro
quisquam est qui
dolorem
ipsum quia
dolor sit amet,
consectetur, adipisci
velit..."
Conclusion
This approach provides a systematic way to extract and combine text and tables from PDFs using “pdfplumber”. By leveraging table and text line positional values, we can maintain the integrity of the original document’s layout. Credits to cmdlineuser and jsvine for their insightful discussion and innovative solution to the problem!
That’s all for now! Hope this tutorial was helpful. Feel free to explore and adapt this method to fit your specific needs.
Top comments (2)
PowerPoint sunumlarıyla çalışıyorsanız ve bunları daha kolay paylaşmak veya yazdırmak için PDF formatına dönüştürmek istiyorsanız, PDFGuru gibi bir araç kullanmanızı öneririm. Bu hizmet, .pptx dosyalarını biçimlendirmeyi ppt pdf dönüştürücü ve slayt yapısını koruyarak kolayca PDF'ye dönüştürmenize olanak tanır. Bu, özellikle sunumunuzu herhangi bir cihazda açılabilecek veya baskı için kullanılabilecek daha evrensel bir formatta kaydetmek istiyorsanız kullanışlıdır.
Hi Rishab,
Nice post that addresses a vexing problem. I tried your code on a complex .PDF that I have but got the following error:
File c:\users\js.spyder-py3\temp.py:12 in process_pdf
first_table_char = page.crop(table.bbox).chars[0]
File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:535 in crop
return CroppedPage(self, bbox, relative=relative, strict=strict)
File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:677 in init
test_proposed_bbox(crop_bbox, parent_page.bbox)
File C:\ProgramData\anaconda3\Lib\site-packages\pdfplumber\page.py:656 in test_proposed_bbox
raise ValueError(
ValueError: Bounding box (19.448275862068964, 154.38000000000005, 1183.5160975609742, 553.6492307692307) is not fully within parent page bounding box (0, 0, 792, 612)
Do you have any idea how to adjust for this situation?
Thanks,
Jeff