DEV Community

loading...
Cover image for Converting UTF (including emoji) to HTML 🤯

Converting UTF (including emoji) to HTML 🤯

Nikki Massaro Kauffman
Uses #WebComponents, a11y, bubble gum, duct tape, wit, and telekinesis to reach more people.
・Updated on ・4 min read

Sometimes my coworker likes to mention things just to get my mind stuck on them. Take the text from this request:

Because of some limitations both in UTF-8 and mysql (less a concern for us now but still..) it would probably be good to have some kind of simple-emoji type of tag. Similar to how we have simple-icon, a simple-iconcould be used to provide minor tweaks / accounting for emojis in a consistent way.

So last night I worked on translating UTF (including emoji) into their HTML entities.

Basic Unicode to HTML Entity Conversion

I started with started with an adapted version of this conversion logic to convert any character that is not part of the 127 ASCII characters:

utf2Html(str){
  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

If we want to check this function (quite literally by dropping a UTF-8 checkmark ✓ into the function), its character code 10003 is the same as it's unicode value so it can be used to generate correct HTML entity ✓

The Problem with Emoji Conversion

While the function above works on UTF-8 special characters, it won't work all of the emoji we have available today. I found a really good explanation for in a post called Unicode in Javascript.

Take the 🤯 emoji, for example.

The character code for this emoji is 55357, so the entity returned by the function above would be �, which does not work.

The unicode value for 🤯 is actually 129327 (or 0001 1111 1001 0010 1111 in binary). In order to express this character as in it's 16-bit form, it is split into a surrogate pair of 16-bit units, in string form as \uD83E\uDD2F (according this handy Surrogate Pair Calculator)--🤯

So in order to get the correct value, we need to know:

  • if a character is one of these surrogate pair emojis, and
  • how to calculate a surrogate pair's value.

Determining if an Emoji is a Surrogate Pair

The JavaScript string length for any type of character is 1.
It is the same for characters, symbols and emoji

JavaScript Result
't'.length 1
'✓'.length 1
'🤯'.length 1

But if I use the spread operator (...) to get length, I can see that my emoji is made of a surrogate pair.

JavaScript Result
[...'t'].length 1
[...'✓'].length 1
[...'🤯'].length 2

That means that I can tell which characters are surrogate pairs if [...char].length > 1:

utf2Html(str){
  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {
        //TODO calculate a surrogate pair's value
      }

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Notice I left a //TODO comment about calculating the pair. We'll tackle that next...

Calculating a Surrogate Pair's Unicode Value

I couldn't find a good post for converting a surrogate pair to it's unicode value, so instead followed these steps for converting from unicode to surrogate pairs in reverse:

# Step 🤯 Example
1 Get the value of each part of the pair. 55358 / 56623
2 Convert each value to a binary number. 1101100000111110 / 1101110100101111
3 Take the last 10 digits of each number. 0000111110 / 0100101111
4 Concatenate the two binary numbers a single 20-bit binary number. 00001111100100101111
5 Convert 20-bit number to a decimal number. 63791
6 Add 0x10000 to the new number. 129327

The Completed UTF (Including Emoji) to HTML Function

utf2Html(str){
  let result = '', 
    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,
    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {

        //handle and convert utf surrogate pairs
        let concat = '';

        //for each part of the pair
        for(let i = 0; i < 2; i++){

          //get the character code value 
          let dec = char[i].charCodeAt(),
            //convert to binary 
            bin = dec.toString(2),
            //take the last 10 bits
            last10 = bin.slice(-10);
            //concatenate into 20 bit binary
            concat = concat + last10,
            //add 0x10000 to get unicode value
            unicode = parseInt(concat,2) + 0x10000;
        }

        //html entity from unicode value
        return decimal2Html(unicode); 
      }

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Update

Thanks to a comment by LUKE知る, I have an even simpler way to do this:

export function utf2Html(str) {
  return [...str].map((char) => char.codePointAt() > 127 ? `&#${char.codePointAt()};` : char).join('');
}
Enter fullscreen mode Exit fullscreen mode

Mind blown meme: Problems Saving Unicode, Convert Symbols to HTML, Many Emoji are Surrogate Pairs, Convert Symbols & Emoji to HTML

Discussion (6)

Collapse
lukeshiru profile image
LUKESHIRU

Really interesting post. I tried the utf2Html function with the example emoji 🤯 and it returns &#55358;, when based on your table should return &#129327;. This is first because [...char] always returns an array with a single value, and second because you're using String.prototype.charCodeAt instead of the better String.prototype.codePointAt. I believe this might be a simpler approach:

/** @param {string} string */
const htmlEscape = string =>
    [...string]
        .map(char => {
            const code = char.codePointAt(0);
            return code > 127 ? `&#${code};` : char;
        })
        .join("");

htmlEscape("💃🏻"); // "&#128131;&#127995;"
htmlEscape("🤯"); // "&#129327;"
Enter fullscreen mode Exit fullscreen mode

Cheers!

Collapse
nikkimk profile image
Nikki Massaro Kauffman Author • Edited

Thanks for the reply! It looks like there are limitations on how the character is passed to the function and used in the spread operator. I corrected my original post. Also I wasn't aware of String.prototype.codePointAt.

I guess the only limitation with either of our approaches is support of IE11, since neither String.prototype.codePointAt nor the spread operator would work.

So for IE11, we'd need toString.prototype.charCodeAt, but we'd probably have to test with something like !!char[1].

Thad, said, I'm probably going to switch to String.prototype.codePointAt in my original use case, so again, thanks!

Collapse
lukeshiru profile image
LUKESHIRU

Yup, I didn't checked IE11, mainly because Microsoft already moved to "Edge", and if I remember correctly the official support for IE11 will end in June of next year. Nowadays I mainly check Chrome, Firefox and, sometimes, Safari (which is the worst of the bunch).

Thread Thread
nikkimk profile image
Nikki Massaro Kauffman Author

Yeah. Safari is definitely a thorn in my side.

Collapse
referenz profile image
referenz

Why convert Unicode characters into HTML entities at all?

Collapse
nikkimk profile image
Nikki Massaro Kauffman Author • Edited

Good question (thanks for asking). I work on front end open source web components, if I don't have control over the backend, I can be sure that UTF-8 is supported.