DEV Community

Discussion on: How to caculate emoji length?

Collapse
 
powerc9000 profile image
Clay Murray

As to why this happens. The web is typically in UTF8 (read 99% of all internet users use UT8) which is a way to encode characters. Basically encoding is assigning a number to a letter. Like you could encode "A" as 1 "B" as 2 etc. Back in the day there was a thing created called ASCII en.wikipedia.org/wiki/ASCII but it only used 7 bits to encode the characters. Which means a max of 128 total characters. Well that's fine for english but what if you need more? So there were a bunch of different ways text got encoded. Like lots and lots and some of them incompatible with ascii. Eventually the web settled on a way to make letters as numbers called UTF8. UTF8 is interesting because it can use a variable number of bits to represent a character (up to 32 currently). This makes it compatible with the old ASCII. But also allows for a huge number of different characters and languages. So the scheme looks at the first 8 bits and if it's in a certain range it will look at the next 8 bits etc until it can make a character.
Well to put a wrinkle in it. Although web pages and code are all in utf8, internally javascript stores strings as utf16. UTF16 is like UTF8 but instead of using a minimum size of 8 bits it uses 16 or 32 bits to represent a letter. So when you ask javascript how long a string is, it breaks it up into 16 bit chunks and tells you how many 16 bit chunks there are. BUT some characters (and emoji) are encoded as two 16 bit chunks so javascript will tell you that the length is 2

So that's part 1. Part 2 is emoji. Emoji are interesting. What you see on screen is not necessarily the full truth. Emoji have a way to be joined together. For instance the pride flag ๐Ÿณ๏ธโ€๐ŸŒˆ is ACUALLY a white flag ๐Ÿณ and a rainbow ๐ŸŒˆ mashed together with an invisible emoji that says "hey mash these two together". So on systems that don't know about the pride flag you just get ๐Ÿณ ๐ŸŒˆ. Well what does that tell us about length? Well ๐Ÿณ is 2 and ๐ŸŒˆ is 2 and ๐Ÿณ๏ธโ€๐ŸŒˆ is 6. 6 because of the invisible "mash these two together" character. So what is it about ๐Ÿ‘จโ€๐Ÿ‘จโ€๐Ÿ‘งโ€๐Ÿ‘ฆ that returns 11? Well it's a super mashup emoji it's ๐Ÿ‘จโ€ and ๐Ÿ‘จโ€ and ๐Ÿ‘งโ€ and ๐Ÿ‘ฆ all put together with the "mash these two together character" it actually makes it possible to have a huge variety of family emojis because we are combining them. So why 11 and not 14? (length 2 for each man length 2 for each for the children and 3 mash together characters) well man emoji are only length 1 not length 2 and the girl emoji is length 1 not two so we can subtract 3 from 14 netting 11 length. (176 total bits for just that emoji! Compared to just 8 for the letter A)