PDF is one of the most widely used digital documents and it is difficult to edit the text on the PDF. Extracting text from a PDF document is often required for analyzing the text, or getting the particular information about the PDF. In this article, we will demonstrate how to extract text from PDF documents programmatically in Java from the following four parts with the help of Spire.PDF for Java.
- Extract Text from PDF using Java
- Extract Text from Specific Page in PDF
- Extract Text from Specific Area in PDF
- Extract Highlighted Text from PDF Document
Install Spire.PDF for Java
First, you're required to add the Spire.Pdf.jar file as a dependency in your Java program. The JAR file can be downloaded from this link. If you use Maven, you can easily import the JAR file in your application by adding the following code to your project's pom.xml file.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.pdf</artifactId>
<version>8.7.0</version>
</dependency>
</dependencies>
Extract Text from PDF using Java
Spire.PDF offers PdfPageBase.extractText() method to extract the text from all the PDF pages. Here are the steps of how to extract text from all pages of PDF document.
- Create a PdfDocument instance.
- Load a sample PDF file using PdfDocument.loadFromFile() method.
- Create a StringBuilder object.
- Loop through all the pages of PDF and use PdfPageBase.extractText() method to extract text, then append the data to the StringBuilder instance using StringBuilder.append() method.
- Write the extracted data to a txt document using FileWriter.write() method.
import com.spire.pdf.*;
import java.io.*;
public class extractTextfromPDF {
public static void main(String[] args) throws Exception {
//Create a Pdf file
PdfDocument pdf = new PdfDocument();
//Load the file from disk
pdf.loadFromFile("PDFSample.pdf");
//Create a StringBuilder instance
StringBuilder sb = new StringBuilder();
PdfPageBase page;
//Traverse all the pages in the document.
for (int i = 0; i < pdf.getPages().getCount(); i++) {
page = pdf.getPages().get(i);
//Extract the text from the pdf pages
sb.append(page.extractText(true));
}
FileWriter writer;
try {
//Create a new txt file to save the extracted text
writer = new FileWriter("ExtractText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
pdf.close();
}
}
Extract Text from Specific Page in PDF
Here are the steps to extract the text from a specific page of PDF.
- Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
- Create a StringBuilder object.
- Get the first page of PDF using PdfDocument.getPages().get(0) method.
- Use page.extractText() method to extract text from the first page, then append the data to the StringBuilder instance using StringBuilder.append() method.
- Write the extracted data to a txt document using FileWriter.write() method.
import com.spire.pdf.*;
import java.io.*;
public class extractTextFromParticularPage {
public static void main(String[] args) throws Exception {
//Create a Pdf file
PdfDocument pdf = new PdfDocument();
//Load the file from disk
pdf.loadFromFile("PDFSample.pdf");
//Create a StringBuilder instance
StringBuilder sb = new StringBuilder();
//Get the first page
PdfPageBase page = pdf.getPages().get(0);
//Extract the text and keep white space
sb.append(page.extractText(true));
FileWriter writer;
try {
//Create a new txt file to save the extracted text
writer = new FileWriter("extractTextFromParticularPage.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
pdf.close();
}
}
Extract Text from Specific Area in PDF
Here are the steps to extract the text from a specific rectangle area on a PDF page.
- Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
- Create a StringBuilder object.
- Get the first page of PDF using PdfDocument.getPages().get(0) method.
- Use page.extractText(new Rectangle2D.Float(x, y, width, height)) method to extract the text from the specific rectangle area, then append the data to the StringBuilder instance using StringBuilder.append() method.
- Write the extracted data to a txt document using FileWriter.write() method.
import com.spire.pdf.*;
import java.io.*;
import java.awt.geom.Rectangle2D;
public class extractTextFromSpecificArea {
public static void main(String[] args) throws Exception {
//Create a Pdf file
PdfDocument pdf = new PdfDocument();
//Load the file from disk
pdf.loadFromFile("PDFSample.pdf");
//Create a StringBuilder instance
StringBuilder sb = new StringBuilder();
//Get the first page
PdfPageBase page = pdf.getPages().get(0);
//Extract text from a specific rectangular area within the page
sb.append(page.extractText(new Rectangle2D.Float(60, 120, 500, 220)));
FileWriter writer;
try {
//Create a new txt file to save the extracted text
writer = new FileWriter("extractTextFromParticularArea.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
pdf.close();
}
}
Extract Highlighted Text from PDF Document
Spire.PDF also supports to extract highlighted text from PDF document.
- Create a PdfDocument instance and load a sample PDF file using PdfDocument.loadFromFile() method.
- Create a StringBuilder object.
- Get the first page of PDF using PdfDocument.getPages().get(0) method.
- Get the annotation collection of the first page of the document by using page.getAnnotationsWidget().
- Loop through the pop-up annotations, after extract data from each annotation using annotations.get(int).getText() method, then append the data to the StringBuilder instance using StringBuilder.append() method.
- Write the extracted data to a txt document using Writer.write() method.
import com.spire.pdf.*;
import com.spire.pdf.annotations.PdfTextMarkupAnnotationWidget;
import java.io.*;
public class extractHighlightedText {
public static void main(String[] args) throws Exception {
//Create a Pdf file
PdfDocument pdf = new PdfDocument();
//Load the file from disk
pdf.loadFromFile("PDFSample0.pdf");
//Create a StringBuilder instance
StringBuilder sb = new StringBuilder();
//Get the first page
PdfPageBase page = pdf.getPages().get(0);
for (int i = 0; i < page.getAnnotationsWidget().getCount(); i++) {
if (page.getAnnotationsWidget().get(i) instanceof PdfTextMarkupAnnotationWidget) {
PdfTextMarkupAnnotationWidget textMarkupAnnotation = (PdfTextMarkupAnnotationWidget) page.getAnnotationsWidget().get(i);
sb.append(page.extractText(textMarkupAnnotation.getBounds()));
}
}
FileWriter writer;
try {
//Create a new txt file to save the extracted text
writer = new FileWriter("extractHilightedText.txt");
writer.write(sb.toString());
writer.flush();
} catch (IOException e) {
e.printStackTrace();
}
pdf.close();
}
}
Conclusion
In this article, we have demonstrated how to extract text from PDF using Java. With Spire.PDF for Java, we could extract text from PDF file for different scenarios, such as extracting all the text from a PDF; only extract the text from a specific page, or a specific page area. And we can also only get the highlighted text from the PDF. You can check the PDF forum for more features to operate the PDF files.
Top comments (1)
good to know: Spire.PDF is not open source and you need a license to use it. An open source alternative would be Apache PDFBox