DEV Community

AnnaGraz
AnnaGraz

Posted on

How to extract text and image from word in Java applications

For a number of reasons, extracting text and graphics from a Word document can be a helpful and required operation. Perhaps you want to save an image from the document to use in another project or you need to reuse the content in a new document or project. It can be easier to share content with others who might not have access to the original document or software by extracting text and images from a Word document. In this post, we'll show you how to use the Spire.Doc for Java library to extract text and images from Word documents in Java.

Part 1: Understanding the Spire.Doc Library

To extract text and image from word in Java, we will make use of the Spire.Doc lib. Spire.Doc for Java is a professional Java Word API that enables developers to create, convert, manipulate and print Word documents without using Microsoft Office. It offers a variety of tools for working with word documents, including the ability to extract text and graphics.

Before we can use the Spire.Doc, we need to add its dependency to our Java project. We can do this by adding the following dependency to our Maven project:

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>

<dependencies>
    <dependency>
        <groupId>e-iceblue</groupId>
        <artifactId>spire.doc</artifactId>
        <version>11.6.0</version>
    </dependency>
</dependencies>
Enter fullscreen mode Exit fullscreen mode

Extract Text from Word

After you extracted the text, you can use it for a variety of purposes, such as summarizing the document, analyzing the content, or creating new documents based on the extracted text.

import com.spire.doc.Document;
import java.io.FileWriter;
import java.io.IOException;

public class ExtractText {

    public static void main(String[] args) throws IOException {

        //Create a Document object and load a Word document
        Document document = new Document();
        document.loadFromFile("sample1.docx");

        //Get text from document as string
        String text=document.getText();

        //Write string to a .txt file
        writeStringToTxt(text," ExtractedText.txt");
    }
    public static void writeStringToTxt(String content, String txtFileName) throws IOException{
        FileWriter fWriter= new FileWriter(txtFileName,true);
        try {
            fWriter.write(content);
        }catch(IOException ex){
            ex.printStackTrace();
        }finally{
            try{
                fWriter.flush();
                fWriter.close();
            } catch (IOException ex) {
                ex.printStackTrace();
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Extract Images from Word

You can use Spire.Doc for java to extract an image or all the images at once from a Word document in Java applications easily and save them as files.

import com.spire.doc.*;
import com.spire.doc.documents.*;
import com.spire.doc.fields.*;
import com.spire.doc.interfaces.*;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.*;
import java.util.*;

public class ExtractImage {
    public static void main(String[] args) throws IOException {

        //Create a Document object and load a Word document
        Document document = new Document();
        document.loadFromFile("sample2.docx");

        //Create a queue and add the root document element to it
        Queue<ICompositeObject> nodes = new LinkedList<>();
        nodes.add(document);

        //Create a ArrayList object to store extracted images
        List<BufferedImage> images = new ArrayList<>();

        //Traverse the document tree
        while (nodes.size() > 0) {
            ICompositeObject node = nodes.poll();
            for (int i = 0; i < node.getChildObjects().getCount(); i++)
            {
                IDocumentObject child = node.getChildObjects().get(i);
                if (child instanceof ICompositeObject)
                {
                    nodes.add((ICompositeObject) child);
                }
                else if (child.getDocumentObjectType() == DocumentObjectType.Picture)
                {
                    DocPicture picture = (DocPicture) child;
                    images.add(picture.getImage());
                }
            }
        }

        //Save images to the specific folder
        for (int i = 0; i < images.size(); i++) {
            File file = new File(String.format("output/extractImage-%d.png", i));
            ImageIO.write(images.get(i), "PNG", file);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Conclusion

In conclusion, extracting text and images from a Word document can be a very useful task for various purposes. By following the steps outlined above, you can easily extract text and images from a Word document and save them to your computer. Whether you need to edit the text separately or use the images in another project, this process can help you achieve your goals efficiently. Spire.Doc for Java supports a rich set of Word elements, including section, header, footer, footnote, endnote, paragraph, list, table, text, TOC, form field, mail merge, hyperlink, bookmark, watermark, image, style, shape, textbox, ole, WordArt, background settings, digital signature, document encryption and many more.

Related topics:

  1. Java Find and Replace Text in Word Documents
  2. Java Insert Images to Word Documents
  3. Java Add Text Watermarks or Image Watermarks to Word
  4. Java Add Background Color or Picture to Word Documents
  5. Java Compare Two Word Documents

Top comments (0)