We preferred to think that String in JavaScript is an array of characters.
const name = ‘Nick’
console.log(name.length) // 4
Variable name
has 4 characters ‘N’, ‘i’, ‘c’, ‘k’ and length is also 4.
Everything seems logical.
Let’s go further and add emoji to my name.
const name = ‘Nick 🐃’
console.log(name.length) // 7
Hmm, strange.
Variable name
must have 6 characters ‘N’, ‘i’, ‘c’, ‘k’, ‘ ‘ (whitespace) and ‘🐃’
But have 7.
It seems like the bull has 2 characters.
const emoji = ‘🐃’
console.log(emoji.length) // 2
Interesting 🤔
Let’s figure out why.
We go to the official documentation of ECMAScript (it’s a programming language on which JavaScript is based).
Scroll to “6.1.4 The String Type.”
And find this:
“The String type is the set of all ordered sequences of zero or more 16-bit unsigned integer values (“elements”) up to a maximum length of 2⁵³ - 1 elements. The String type is generally used to represent textual data in a running ECMAScript program, in which case each element in the String is treated as a UTF-16 code unit value.”
So string in JavaScript is a sequence of UTF-16 code unit values.
❓What is UTF-16?
💬 A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point to a unique byte sequence.
One UTF-16 code unit value is a number from 0x0000 to 0xFFFF.
❓What is 0x0000 and 0xFFFF?
💬 0x represent the hexadecimal numeral system, often shortened to "hex", is a numeral system made up of 16 symbols (base 16). The standard numeral system is called decimal (base 10) and uses ten symbols: 0,1,2,3,4,5,6,7,8,9. Hexadecimal uses the decimal numbers and six extra symbols.
If we convert my name Nick to UTF-16 (like JavaScript see it) we will get 0x004e 0x0069 0x0063 0x006b.
0x004e = N
0x0069 = i
0x0063 = c
0x006b = k
But how does JavaScript treat emojis?
In UTF-16, Unicode characters from the Basic Multilingual Plane (contains characters for almost all modern languages) are encoded with one code unit.
Other characters from the non-Basic Multilingual Plane (emojis, musical notations, cards, hieroglyphs, etc) require two code units.
So UTF-16 format represents 🐃 emoji with two code units (0Xd83d 0Xdc03).
That’s why ‘🐃’.length
gives 2.
To consolidate everything we have learned, let’s play a little with Unicode and JavaScript.
const name = ‘Nick’
const nameInUnicode = ‘\u004e\u0069\u0063\u006b’
console.log(name === nameInUnicode) // true
console.log(nameInUnicode.length) // 4
const fullName = ‘Nick 🐃’
const fullNameInUnicode = ‘\u004e\u0069\u0063\u006b\u0020\ud83d\udc03’
console.log(fullName === fullNameInUnicode) // true
console.log(fullNameInUnicode.length) // 7
❓ What is \u?
💬 A Unicode character escape sequence represents the single Unicode code point formed by the hexadecimal number following the “\u” or “\U” characters.
In the end
Knowing that string in JavaScript is a sequence of UTF-16 code unit values can save you from unpredictable bugs when you work with different characters not from BMP, like emojis.
If you like this article, share it with your friends and follow me on Twitter.
Also, every week I send out a "3–2–1" newsletter with 3 tech news, 2 articles, and 1 piece of advice for you.
Top comments (9)
Great post! I almost didn’t read it because the title needs an “a” before string... but I’m glad I did read it! Thanks!
Looks like you fixed it!
Yes, thanks for pointing out!
WOW! what a great article.
i really felt like adding this one to complete your post dev.to/prodexia/master-web-designi...
You cannot change a string in JS, just instantiate a new one with changed content. Also, string literals cannot store any own properties.
Fixed, thanks.