DEV Community

Cover image for Counting all words across markdown files ~ CLI
Chris Bongers
Chris Bongers

Posted on • Originally published at daily-dev-tips.com

Counting all words across markdown files ~ CLI

Quite a while ago, my good friend and colleague @inhuofficial asked me if I knew how many words I'd written.

And although I'm at 800+ articles, I had no idea how many words this was.

So I decided to find a solution to give him an answer and shock myself (and maybe even you?)

This solution will use the command-line interface (CLI), which is the simplest way to do it.
In a future article, I might dive into some other solutions.

CLI count words in markdown files

The first step is to count words in some text; luckily for us, Unix already has this command called wc (word count).

To use it, we can simply use a command like this:

wc -w <<< "Some random words"
Enter fullscreen mode Exit fullscreen mode

This should output 3 as there are three words in this string.

Word count command in Unix terminal

Now that we know how to count words, we need a way to extract the actual content from our markdown file.

There are several Unix markdown parsers. If you have a favorite one, you can use that. Else, I suggest using pandoc.

If you don't have it yet, you can install it with Homebrew.

brew install pandoc
Enter fullscreen mode Exit fullscreen mode

We can then use it to read a markdown file like this:

pandoc --strip-comments -t plain {your-markdown}.md
Enter fullscreen mode Exit fullscreen mode

The commands include the --strip-comments command to strip all HTML comments and comments from the markdown.
And the -t as the parameter to define what to convert it to, in our case, plain text.

When I run this on one of my markdown files, I get the following result.

Pandoc converting markdown into plain text

So how do we now count these words quickly?

We can combine the pandoc and the wc command into one line.

pandoc --strip-comments -t plain {your-markdown}.md | wc -w
Enter fullscreen mode Exit fullscreen mode

And it will result in the number of words in that document!

Result of counting all words in markdown file

Pretty awesome! We now know how to count all words in a single markdown file.

Retrieving all words across all markdown files

Now that we know how it's done, the real question is, how many words did you write in total?

And to answer that, we must count all words across all markdown files.

And no, we don't want to run this command for each file and add each output.

So to make this work, we can leverage the find command to find all files that end in the .md extension.

find . -iname "*.md"
Enter fullscreen mode Exit fullscreen mode

This will result in a list of all your markdown files in the folder structure you are in.

We can combine the above two commands with this find command to count all words. (Be aware it might take a while depending on how many files you have)

find . -iname "*.md" | xargs pandoc --strip-comments -t plain | wc -w
Enter fullscreen mode Exit fullscreen mode

Result showing 416006 words written

Wow, I already wrote 416006 words? That is just crazy stuff.

If you are anything like me, the question around how many books would that be popped up.

And a quick google shows: "The average word count for adult fiction is between 70,000 to 120,000 words."

Does this mean I wrote around four novels already?

My mind is blown 🤯.

Thank you for reading, and let's connect!

Thank you for reading my blog. Feel free to subscribe to my email newsletter and connect on Facebook or Twitter

Oldest comments (4)

Collapse
 
grahamthedev profile image
GrahamTheDev

🤯 so you are officially a novelist? 😃❤️

Collapse
 
dailydevtips1 profile image
Chris Bongers

I guess so 😂

Collapse
 
johannes_k_rexx profile image
johnblommers

The example fails when Markdown files contain spaces. It also fails if there are LaTeX expressions in any of the files. May I present code that worked for me:

find . -iname "*.md" -printf '"%p" ' | \
    xargs pandoc \
          --strip-comments \
          -t markdown |\
    wc -w
Enter fullscreen mode Exit fullscreen mode

Note with care the -printf statement. Its purpose is to put double-quotes around the filename followed by a space to separate them.

The -t markdown keeps Pandoc from complaining about LaTeX expressions.

As an afterthought I wondered if adding the -type f command-line option to the find command might be cleaner, in case we had a directory name ending in .md.

Collapse
 
dailydevtips1 profile image
Chris Bongers

Ah nice one!
I didn't have any latex so didn't come across this, but makes total sense 👏