DEV Community

How to Count Words and Characters in JavaScript

The Dev Drawer on January 18, 2023

Most if not all developers have used some sort of character counter online to validate SEO, or to just see how many characters a string has. You ca...

Read full post

lionel-rowe • Jan 19 '23

Why does Count need to be a class? I'd sort of understand if it was a Web Component, but it just references global state via document and window. Instantiating it with new is meaningless, as it doesn't encapsulate any of its own state.

You'd be better off just using a module instead. Better still, 2 modules — one for DOM manipulation and another for the counting logic, as they're quite different concerns (though including all the logic in the same file is fair enough when the entire app has <100 lines of JS, IMO).

You might also want to look into improving the counting algorithms — for example:

export const countWords = (str) => {
    const value = str.trim()
    if (!value) return 0
    return value.split(/\s+/).length
}

export const countChars = (str) => {
    return str.length 
}

countChars('🚀') // 2 (expected 1)
countWords('web-development tutorial') // 2 (expected 3)

I'll leave this here 😉 Counting symbols in a JavaScript string — Mathias Bynens

The Dev Drawer • Jan 19 '23

Thanks for the tips. I did it as a class to simply show how it can be used in the tutorial or be added as a module. I like OOP versions of code so my preference bled over in the tutorial. I understand it is a small file that could essentially be done by calling the functions directly but I wanted to show it as an OOP way.

I understand what you are saying though, sometimes using the KISS method is better, but I wanted the script to not only showcase how to get the result but how to build it as part of a large thing, even if it is a simple script.

Also, I did not account for symbols or in the above comment, other languages for this tutorial. I was hoping to get a quick tutorial out for something I recently used. It may be a bit specific but it was something I had to use recently as part of a larger project so I wanted to share.

lionel-rowe • Jan 20 '23

I'd argue that instantiating a singleton class that doesn't encapsulate any state and instead accesses global state outside of itself isn't really OOP, despite the class and new keywords... though I guess it depends on your definition of OOP. Practically speaking, it's a module that for some reason needs to be instantiated. It'd make sense for such a thing to be a class in Java, because everything has to be a class in Java, even things that really shouldn't be... but JS has no such limitation.

BTW I hope my feedback doesn't come across as too negative — it's a nice-looking app, and this article has already inspired me in 2 ways in my own code. Firstly by reminding me that class-based encapsulation can be pretty damn useful (I usually opt for more of a mixed functional/imperative style), and secondly by Jon Randy alerting me to the existence of Intl.Segmenter in the other comment thread. Both turned out to be extremely useful for my current project of creating a locale-aware (encapsulating the locale data) term checker for translations.

The Dev Drawer • Jan 21 '23

Any feedback is good feedback for me, so you are good. I used this code as part of a larger project with many other JS classes so it was not really a singleton in how it was being used but I see your point. I just thought it was cool so I put something together quickly for the video using the basic aspects of what I was doing in my other project.

Also, I saw the other comment and the segmenter is something I was unaware of. I am glad other commenters helped show you something new as well. I will definitely be using it in the future.

lionel-rowe • Jan 28 '23

Sorry for late reply — meant to reply earlier then forgot. Hooray for stale Chrome tabs! So, when I say "singleton", I mean specifically the Singleton design pattern. That doesn't mean the class doesn't coexist and interact with other classes; it simply means only 1 instance of that class is supposed to exist at any given time.

Jon Randy 🎖️ • Jan 19 '23 • Edited

Interestingly, the regex metacharacter for whitespace - \s - does not work for zero-length spaces, so the word count (using your code) for the Thai 'น้อยก็หนึ่ง' comes out as 1 when there are actually 3 words: 'น้อย', 'ก็', and 'หนึ่ง'. There are other languages that would have similar issues.

A better way to do this is with Intl.Segmenter - which is language aware:

const segmenterTh = new Intl.Segmenter('th', { granularity: 'word' })
const string1 = 'น้อยก็หนึ่ง'
const wordCount = [...segmenterTh.segment(string1)].length
console.log(wordCount)   // 3

One drawback here is that Intl.Segmenter is not yet supported on Firefox

lionel-rowe • Jan 19 '23

Wait, Thai is delimited with zero-width spaces? Is that a standard thing that gets done automatically with common input methods? If so, that makes it much easier to implement an approximate word-counting algorithm that works cross-linguistically (CJK is still a problem, but counting Script=Han characters as each being 1 "word" is usually an acceptable alternative — e.g. MS Word does that. Not sure about kana, though)

Jon Randy 🎖️ • Jan 19 '23

If it is done correctly, then the zero width spaces are there (not sure how input methods/apps handle this) - in reality however, people don't usually bother putting any spaces in (except between sentences, which is normal). There are tools around to automatically add the zero-width spaces though - but I imagine writing those would be no fun.

Jon Randy 🎖️ • Jan 19 '23

Just did some quick googling on Thai input methods - apparently one spacebar hit for zero-width, and two hits for real space is common.

lionel-rowe • Jan 19 '23 • Edited

Yeah if I go to thai.tourismthailand.org/Home, grab the first longish span of text, and split on /[^\p{L}\p{N}\p{M}]+/u, it gives me

Thai "word"	MTed English
ททท	TAT
เปิดตัวโครงการ	project launch
365	365
วัน	day
มหัศจรรย์เมืองไทยเที่ยวได้ทุกวัน	Amazing Thailand, you can travel every day.
ชวนผู้ประกอบการธุรกิจท่องเที่ยวเสนอดีลพิเศษผ่าน	inviting travel business operators to offer special deals through
LAZADA	LAZADA
ร่วมสร้างตำนานการท่องเที่ยวไทยครั้งใหม่ตลอดปี	Join to create a new legend of Thai tourism throughout the year.
2566	2023

Guessing "Join to create a new legend of Thai tourism throughout the year" isn't considered a single word in Thai 😅

Edit: Intl.Segmenter seems to give much better results, though.

[...new Intl.Segmenter('th', { granularity: 'word' }).segment(str)].filter(x => x.isWordLike).map(x => x.segment).slice(0, 10)

Results:

Thai "word"	MTed English
ททท	TAT
เปิด	open
ตัว	one
โครงการ	project
365	365
วัน	day
มหัศจรรย์	amazing
เมือง	city
ไทย	Thai
เที่ยว	travel

Would love to know how it works for Thai, even without ZWSPs as cues.

Jon Randy 🎖️ • Jan 19 '23

If memory serves from the Thai lessons I've had, there are many rules about how words can begin and end (what letters can be used in what order etc.) - these would probably get you a lot of the way there.

lionel-rowe • Jan 19 '23 • Edited

It might just be some sort of massive dictionary lookup, as it even does a decent approximation for Chinese, which has no such rules:

const str = `聚苯乙烯塑料、聚苯乙……、……乙烯塑料`
;[...new Intl.Segmenter('zh', { granularity: 'word' }).segment(str)].filter(x => x.isWordLike).map(x => x.segment)
// ['聚苯乙烯', '塑料', '聚', '苯', '乙', '乙烯', '塑料']

Kamonwan Achjanis • Aug 4 '23

Don't forget to filter by isWordLike property of each segment:
dev.to/kamonwan/the-right-way-to-b...