DEV Community

loading...

Read/Extract Text from Pdf in Java

eiceblue profile image E-iceblue Product Family ・2 min read

In this article, we’re going to explain how to read/extract text from a Pdf file in Java.

An overview of content:

  1. Read/Extract All Text from a Pdf
  2. Read/Extract Text from a Specific Rectangle Area in a Pdf Page
  3. Read/Extract Text using SimpleTextExtractionStrategy

The Pdf library we need:

Spire.PDF for Java (Spire.Pdf.jar download link: https://www.e-iceblue.com/Download/pdf-for-java.html)

Spire.PDF for Java info: Spire.PDF for Java is a professional Java component that enables developers to create Pdf files from scratch or process existing Pdf files in Java application without having Adobe Acrobat to be installed.

The example Pdf file:
alt text

Sample Code

Imported Namespaces

import com.spire.pdf.*;
import com.spire.pdf.exporting.text.SimpleTextExtractionStrategy;

import java.awt.geom.Rectangle2D;
import java.io.*;

Read/Extract All Text from a Pdf

//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");

StringBuilder sb= new StringBuilder();

//Extract text from every page of the Pdf
for (PdfPageBase page: (Iterable<PdfPageBase>) pdf.getPages()) {
    sb.append(page.extractText(true));
}

try {
    //Write the text into a .txt file 
    FileWriter writer = new FileWriter("ExtractText.txt");
    writer.write(sb.toString());
    writer.flush();
} catch (IOException e) {
    e.printStackTrace();
}

//Close the PdfDocument object
pdf.close();

Output: alt text

Read/Extract Text from a Specific Rectangle Area in a Pdf Page

//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");

//Get the first page of the Pdf
PdfPageBase page = pdf.getPages().get(0);

//Instantiate a Rectangle2D object 
Rectangle2D rect = new Rectangle2D.Float();
//Set location and size
rect.setFrame( 50, 50, 500, 100);

//Extract text from the given rectangle area in the first page
StringBuilder sb= new StringBuilder();
StringBuilder append = sb.append(page.extractText(rect));

try {
    //Write the text into a .txt file 
    FileWriter writer = new FileWriter("ExtractText.txt");
    writer.write(sb.toString());
    writer.flush();
} catch (IOException e) {
    e.printStackTrace();
}

//Close the PdfDocument object
pdf.close();

Output: alt text

Read/Extract Text using SimpleTextExtractionStrategy

//Instantiate a PdfDocument object
PdfDocument pdf = new PdfDocument();
//Load the Pdf file
pdf.loadFromFile("Additional.pdf");

//Get the first page of the Pdf
PdfPageBase page = pdf.getPages().get(0);

//Extract text from the first page using SimpleTextExtractionStrategy
SimpleTextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
StringBuilder sb= new StringBuilder();
StringBuilder append = sb.append(page.extractText(strategy));

try {
    //Write the text into a .txt file 
    FileWriter writer = new FileWriter("ExtractText.txt");
    writer.write(sb.toString());
    writer.flush();
} catch (IOException e) {
    e.printStackTrace();
}

//Close the PdfDocument object
pdf.close();

Output: alt text

Discussion (4)

pic
Editor guide
Collapse
jklein001 profile image
Joshua Klein

What about searching into the pdf file for a pattern, preferably a regular expression? Is it mandatory to first extract the text?

Collapse
eiceblue profile image
E-iceblue Product Family Author • Edited

Hi, the following code snippet shows you how to find text by a pattern and highlight the results with yellow.

import com.spire.pdf.general.find.PdfTextFind;
import java.awt.*;

public class FindByRegularExpression {

    public static void main(String[] args) throws Exception {

        //Load a PDF document
        PdfDocument pdf = new PdfDocument();
        pdf.loadFromFile("C:\\Users\\Administrator\\Desktop\\test.pdf");

        //Creat a PdfTextFind collection 
        PdfTextFind[] results;

        //Loop through the pages
        for (Object page : (Iterable) pdf.getPages()) {
            PdfPageBase pageBase = (PdfPageBase) page;

            //Define a pattern
            String pattern = "\\#\\w+\\b";

            //Find all results that match the pattern
            results = pageBase.findText(pattern).getFinds();

            //Highlight results with yellow
            for (PdfTextFind find : results) {
                find.applyHighLight(Color.yellow);
            }
        }

        //Save to file
        pdf.saveToFile("output.pdf");
    }
}
Collapse
im_khalil profile image
Khalil

Does this work with asian language content?

Collapse
eiceblue profile image