DEV Community

Usman Aziz
Usman Aziz

Posted on • Originally published at blog.groupdocs.com

Count Words and Occurrences of Each Word in a Document using C#

Repetition of data can diminish the worth of the content. Working as a writer, you must follow DRY (don’t repeat yourself) principle. The statistics such as word count or the number of occurrences of each word can let you analyze the content but it’s hard to do it manually for multiple documents. So in this article, I’ll demonstrate how to programmatically count words and the number of occurrences of each word in PDFWordExcelPowerPointEbookMarkup, and Email document formats using C#. For extracting text from documents, I’ll be using GroupDocs.Parser for .NET which is a powerful document parsing API.

Steps to count words and their occurrences in C

1. Create a new project.

2. Install GroupDocs.Parser for .NET using NuGet Package Manager.

3. Add the following namespaces.

using GroupDocs.Parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq; 
Enter fullscreen mode Exit fullscreen mode

4. Create an instance of the Parser class and load the document.

using (Parser parser = new Parser("sample.pdf"))
{
  // your code goes here.
}
Enter fullscreen mode Exit fullscreen mode

5. Extract the text from the document into a TextReader object using Parser.GetText() method.

using (TextReader reader = parser.GetText())
{

}
Enter fullscreen mode Exit fullscreen mode

6. Split up the text into words, save them into a string array and perform word count.

Dictionary<string, int> stats = new Dictionary<string, int>();
string text = reader.ReadToEnd();
char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
// split words
string[] words = text.Split(chars);
int minWordLength = 2;// to count words having more than 2  characters

// iterate over the word collection to count occurrences
foreach (string word in words)
{
    string w = word.Trim().ToLower();
    if (w.Length > minWordLength)
    {
        if (!stats.ContainsKey(w))
        {
            // add new word to collection
            stats.Add(w, 1);
        }
        else
        {
            // update word occurrence count
            stats[w] += 1;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

7. Order the words by their occurrence count and display the results.

// order the list by word count
var orderedStats = stats.OrderByDescending(x => x.Value);
// print total word count
Console.WriteLine("Total word count: {0}", stats.Count);
// print occurrence of each word
foreach (var pair in orderedStats)
{
    Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);
}
Enter fullscreen mode Exit fullscreen mode

Complete Code

using (Parser parser = new Parser("sample.pdf"))
{                
    // Extract a text into the reader
    using (TextReader reader = parser.GetText())
    {
        Dictionary<string, int> stats = new Dictionary<string, int>();
        string text = reader.ReadToEnd();
        char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
        // split words
        string[] words = text.Split(chars);
        int minWordLength = 2;// to count words having more than 2 characters

        // iterate over the word collection to count occurrences
        foreach (string word in words)
        {
            string w = word.Trim().ToLower();
            if (w.Length > minWordLength)
            {
                if (!stats.ContainsKey(w))
                {
                    // add new word to collection
                    stats.Add(w, 1);
                }
                else
                {
                    // update word occurrence count
                    stats[w] += 1;
                }
            }
        }

        // order the collection by word count
        var orderedStats = stats.OrderByDescending(x => x.Value);
        // print total word count
        Console.WriteLine("Total word count: {0}", stats.Count);
        // print occurrence of each word
        foreach (var pair in orderedStats)
        {
            Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Results

Alt Text

Oldest comments (2)

Collapse
 
fcrozetta profile image
Fernando Crozetta

Interesting article..
But, since the package used is paid, I was wondering if there was an open source library to replace the one that you are using... I think docX will do the work, no?

Is there any reason why you chose this particular library?

Collapse
 
usmanaziz profile image
Usman Aziz • Edited

Hi Fernando,

The reason for using this API was it supports a variety of document formats of PDF, Word, Excel, PowerPoint, Ebook, Markup, and Emails. Whereas, DocX library seems to support the manipulation of Word formats only.

The good thing about GroupDocs API is that you can try it free using a temporary license: purchase.groupdocs.com/temporary-l....