DEV Community

ksn-developer
ksn-developer

Posted on

Extracting text from pdf files using pyPDF3

PyPDF3 is a Python library for working with PDF files that builds upon the PyPDF2 library. It provides an easy-to-use interface for reading and writing PDF files, and it includes tools for extracting text from PDF files. In this article, we will explore how to use PyPDF3 to extract text from PDF documents.

Installation

To use PyPDF3, you need to install it using pip. You can do this by running the following command in your command prompt or terminal:

pip install PyPDF3

Once you have installed PyPDF3, you can import it in your Python script using the following line of code:

import PyPDF3

Extracting Text from PDF Documents

To extract text from a PDF document using PyPDF3, you first need to open the PDF file in binary mode using Python's built-in open() function. You can then create a PdfFileReader object using PyPDF3, which allows you to read the contents of the PDF file. Here's an example:

   import PyPDF3
   with open('sample.pdf', 'rb') as pdf_file:
     pdf_reader = PyPDF3.PdfFileReader(pdf_file)
     text = ''
     for page_num in range(pdf_reader.numPages):
        page = pdf_reader.getPage(page_num)
        text += page.extractText()
   print(text)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)