DEV Community

CertosinoLab
CertosinoLab

Posted on

Find occurrences of a word in a Pdf file with c# and PdfPig

Introducing PdfPig

PdfPig is an open source C # library that allows us to extract text and other content from pdfs. Its a port of the java pdfbox library. You can find more here: https://github.com/UglyToad/PdfPig

The Word Counter

The project will be a simple console application and will have the following structure:

Creating The Project

With Visual Studio 2022, follow the steps below:

  1. Open Visual Studio 2022

  2. Create New Project

  3. Select Console Application in C#

  4. Set Name and Path

  5. Choose .NET 5.0 Framework

Now you have the basic structure of a console application. Create a folder and call it pdf. Add a pdf inside this folder. In this tutorial i used a pdf created from this page: https://en.wikipedia.org/wiki/Cr%C3%AApe

To get PdfPig:

  1. On the search bar print Manage NuGet Packages

  2. Click on Browse

  3. Search PdfPig

  4. Install It

The Code

Thanks to PdfPig extracting text from the pdf and calculating the occurrences of a word is trivial, here the full code:

using System;
using UglyToad.PdfPig;
using UglyToad.PdfPig.Content;

namespace pdf_pig_word_counter
{
    internal class Program
    {
        static void Main(string[] args)
        {
            string wordToFind = "pancake";
            int numberOfOccurrences = 0;

using (PdfDocument document = PdfDocument.Open(@"YOUR PATH\pdf\test.pdf"))
            {
                foreach (Page page in document.GetPages())
                {
                    string pageText = page.Text;

foreach (Word word in page.GetWords())
                    {
                        if (word.Text.ToLower().Contains(wordToFind.ToLower()))
                            numberOfOccurrences++;
                    }
                }
                Console.WriteLine("Total Occurrences: " + numberOfOccurrences);
            }
        }

}
}
Enter fullscreen mode Exit fullscreen mode

This program will tell us how many times the word pancake is present in the pdf.

You can find the project here: https://github.com/CertosinoLab/mediumarticles/tree/pdf_pig_word_counter

Thank you!

Top comments (0)