DEV Community

Cover image for Extract Text from PDF Using Python
Seraph★776
Seraph★776

Posted on • Edited on

Extract Text from PDF Using Python

Introduction

This article will discuss how to extract text from a PDF using Python. To complete this task we'll use the PyPDF2 module. PyPDF2 is a free and open-source python library capable of many tasks such as splitting, merging, cropping, adding custom data, encrypting, and retrieving text from PDFs.

The PDF Sample File

The PDF sample file that will be used to extract text from will be The Raven by Edgar Allan Poe.

Directory Structure

This is the directory structure prior to executing script.py

Python Project/
├── app/
│   ├── script.py
│   ├── the_raven.pdf
│
Enter fullscreen mode Exit fullscreen mode

Implementation

  1. Open PDF and Extract Text
  2. Save Text to File.

Open PDF and Extract Text

def extract_text_from_pdf(pdf_filename: str) -> str:
    text_output = ''
    with open(pdf_filename, 'rb') as pdf_object:
        pdf_reader = PyPDF2.PdfFileReader(pdf_object)
        for i in range(0, pdf_reader.numPages):
            page_obj = pdf_reader.getPage(i)
            text_output += page_obj.extractText()
    return text_output
Enter fullscreen mode Exit fullscreen mode
  1. The convert_pdf_to_text() function takes one parameter, pdf_filename, which is the filename of the PDF from which the text will be extracted.
  2. pdf_filename is opened in rb mode (which opens the file in a binary format for reading) as pdf_object, which is then passed to the PyPDF2 object named pdf_reader.
  3. We then iterate over all pages of the PyPDF2 object using the range() function, and the numPages attribute to define the upper bound of the range function.
  4. We then create a page_obj instance for each page, and extract the text from each page_object using the extractText() method.
  5. Finally, we concatenate the results to our text_output string, and return the results.

Save Text to File.

def save_converted_text(text_file: str, filename: str) -> None:
    with open(filename, 'w+', encoding='utf8') as file_obj:
        file_obj.write(text_file)
    print(f'{text_file} has been successfully saved.')
Enter fullscreen mode Exit fullscreen mode
  1. save_converted_text() function takes two parameters, text_file which is the extracted text from the PDF, and filename which is the name you will save your file as. The file name is opened in w+ mode (write + read) using 'utf8' encoding as file_obj.
  2. The contents of text_file are then written to file_obj. A message is printed if the operation executes successfully.

What is Encoding?

Many times applications often use internationalized messages to display output in a variety of user-selected languages such as English, French, Japanese, Hebrew, or Russian. Web content can be written in any of these languages and can also include a variety of emoji symbols. Python’s string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. If encoding is not specified, UTF-8 will be used by default. Read the Official Python Documentation to learn more about encoding

Full Code

import PyPDF2


# STEP 1: open PDF and convert to text
def extract_text_from_pdf(pdf_filename: str) -> str:
    text_output = ''
    with open(pdf_filename, 'rb') as pdf_object:
        pdf_reader = PyPDF2.PdfFileReader(pdf_object)
        for i in range(0, pdf_reader.numPages):
            page_obj = pdf_reader.getPage(i)
            text_output += page_obj.extractText()
    return text_output


# STEP 2: Save Text to File
def save_converted_text(text_file: str, filename: str) -> None:
    with open(filename, 'w+', encoding='utf8') as file_obj:
        file_obj.write(text_file)
    print(f'{text_file} has been successfully saved.')


if __name__ == '__main__':
    # extract text from PDF
    text_from_pdf = extract_text_from_pdf('the_raven.pdf')
    # save extracted text
    save_converted_text(text_from_pdf, 'the_raven.txt')
Enter fullscreen mode Exit fullscreen mode

Directory Structure

This is the directory structure after executing script.py

Python Project/
├── app/
│   ├── script.py
│   ├── the_raven.pdf
│   ├── the_raven.txt
│
Enter fullscreen mode Exit fullscreen mode

Conclusion

After reading this article you should now be able to extract text from a PDF using Python's PyPDF2 library. Remember, if you extract text and you encounter unrecognizable text make sure you are using the correct string encoding. If you found this article helpful, please like, follow, and leave a comment!

🔗 Resource Links

Top comments (1)

Collapse
 
kayla_klein_b9011cdffad2b profile image
Kayla M

Great tutorial! is there a particular reason why you use PyPDF2 over PyMuPDF or pdfminer?