Nikki Massaro Kauffman

Posted on Jul 30, 2021 • Edited on Sep 8, 2022

Converting UTF (including emoji) to HTML

#javascript #unicode #emoji #utf8

Sometimes my coworker likes to mention things just to get my mind stuck on them. Take the text from this request:

Because of some limitations both in UTF-8 and mysql (less a concern for us now but still..) it would probably be good to have some kind of simple-emoji type of tag. Similar to how we have simple-icon, a simple-iconcould be used to provide minor tweaks / accounting for emojis in a consistent way.

So last night I worked on translating UTF (including emoji) into their HTML entities.

Basic Unicode to HTML Entity Conversion

I started with started with an adapted version of this conversion logic to convert any character that is not part of the 127 ASCII characters:

utf2Html(str){
  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}

If we want to check this function (quite literally by dropping a UTF-8 checkmark ✓ into the function), its character code 10003 is the same as it's unicode value so it can be used to generate correct HTML entity ✓

The Problem with Emoji Conversion

While the function above works on UTF-8 special characters, it won't work all of the emoji we have available today. I found a really good explanation for in a post called Unicode in Javascript.

Take the 🤯 emoji, for example.

The character code for this emoji is 55357, so the entity returned by the function above would be &#55357;, which does not work.

The unicode value for 🤯 is actually 129327 (or 0001 1111 1001 0010 1111 in binary). In order to express this character as in it's 16-bit form, it is split into a surrogate pair of 16-bit units, in string form as \uD83E\uDD2F (according this handy Surrogate Pair Calculator)--🤯

So in order to get the correct value, we need to know:

if a character is one of these surrogate pair emojis, and
how to calculate a surrogate pair's value.

Determining if an Emoji is a Surrogate Pair

The JavaScript string length for any type of character is 1.
It is the same for characters, symbols and emoji

JavaScript	Result
`'t'.length`	1
`'✓'.length`	1
`'🤯'.length`	1

But if I use the spread operator (...) to get length, I can see that my emoji is made of a surrogate pair.

JavaScript	Result
`[...'t'].length`	1
`[...'✓'].length`	1
`[...'🤯'].length`	2

That means that I can tell which characters are surrogate pairs if [...char].length > 1:

utf2Html(str){
  let result = '', 

    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,

    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {
        //TODO calculate a surrogate pair's value
      }

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}

Notice I left a //TODO comment about calculating the pair. We'll tackle that next...

Calculating a Surrogate Pair's Unicode Value

I couldn't find a good post for converting a surrogate pair to it's unicode value, so instead followed these steps for converting from unicode to surrogate pairs in reverse:

#	Step	🤯 Example
1	Get the value of each part of the pair.	55358 / 56623
2	Convert each value to a binary number.	1101100000111110 / 1101110100101111
3	Take the last 10 digits of each number.	0000111110 / 0100101111
4	Concatenate the two binary numbers a single 20-bit binary number.	00001111100100101111
5	Convert 20-bit number to a decimal number.	63791
6	Add 0x10000 to the new number.	129327

The Completed UTF (Including Emoji) to HTML Function

utf2Html(str){
  let result = '', 
    //converts unicode decimal value into an HTML entity
    decimal2Html = (num) => `&#${num};`,
    //converts a character into an HTML entity 
    char2Html = (char) => {
      let item = `${char}`;

      //spread operator can detect emoji surrogate pairs 
      if([...item].length > 1) {

        //handle and convert utf surrogate pairs
        let concat = '';

        //for each part of the pair
        for(let i = 0; i < 2; i++){

          //get the character code value 
          let dec = char[i].charCodeAt(),
            //convert to binary 
            bin = dec.toString(2),
            //take the last 10 bits
            last10 = bin.slice(-10);
            //concatenate into 20 bit binary
            concat = concat + last10,
            //add 0x10000 to get unicode value
            unicode = parseInt(concat,2) + 0x10000;
        }

        //html entity from unicode value
        return decimal2Html(unicode); 
      }

      //ASCII character or html entity from character code
      return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
    };

  //check each character
  [...str].forEach(char=>{
    result += char2Html(char);
  });

  return result;
}

Update

Thanks to a comment by LUKE知る, I have an even simpler way to do this:

export function utf2Html(str) {
  return [...str].map((char) => char.codePointAt() > 127 ? `&#${char.codePointAt()};` : char).join('');
}

Top comments (7)

Nikki Massaro Kauffman • Aug 2 '21 • Edited

Thanks for the reply! It looks like there are limitations on how the character is passed to the function and used in the spread operator. I corrected my original post. Also I wasn't aware of String.prototype.codePointAt.

I guess the only limitation with either of our approaches is support of IE11, since neither String.prototype.codePointAt nor the spread operator would work.

So for IE11, we'd need toString.prototype.charCodeAt, but we'd probably have to test with something like !!char[1].

Thad, said, I'm probably going to switch to String.prototype.codePointAt in my original use case, so again, thanks!

Rash Edmund Jr • Jun 24 '23 • Edited

hey guys, i'm fetching data from an API and it returns text that has these character entities in them, i want to parse them to their actual equivalence.

like:
& q u o t ; = '"'
& l t ; = '<' and so on. i cant have a the dirty html missed up and displayed for the user

Kolemjdouci • Dec 18 '21

how this htmlentity to raw emoji, back?

Nikki Massaro Kauffman • Aug 2 '21

Yeah. Safari is definitely a thorn in my side.

referenz • Aug 1 '21

Why convert Unicode characters into HTML entities at all?

Nikki Massaro Kauffman • Aug 2 '21 • Edited

Good question (thanks for asking). I work on front end open source web components, if I don't have control over the backend, I can be sure that UTF-8 is supported.

harry • Oct 29 '22

awsome - does exactly what I want!!! well done!

DEV Community

Converting UTF (including emoji) to HTML

Basic Unicode to HTML Entity Conversion

The Problem with Emoji Conversion

Determining if an Emoji is a Surrogate Pair

Calculating a Surrogate Pair's Unicode Value

The Completed UTF (Including Emoji) to HTML Function

Update

Top comments (7)

Read next

Why Rewriting Everything in Rust Won’t Solve All Your Problems

Show Dev - My Shopping Cart App

AI Travel Planner app built with Next.js 15, Tailwind CSS, Prisma, Open AI, and Clerk

Cybersecurity Course for Beginners: Your Ultimate Guide In 2025