Sometimes my coworker likes to mention things just to get my mind stuck on them. Take the text from this request:
Because of some limitations both in UTF-8 and mysql (less a concern for us now but still..) it would probably be good to have some kind of
simple-emoji
type of tag. Similar to how we havesimple-icon
, asimple-icon
could be used to provide minor tweaks / accounting for emojis in a consistent way.
So last night I worked on translating UTF (including emoji) into their HTML entities.
Basic Unicode to HTML Entity Conversion
I started with started with an adapted version of this conversion logic to convert any character that is not part of the 127 ASCII characters:
utf2Html(str){
let result = '',
//converts unicode decimal value into an HTML entity
decimal2Html = (num) => `&#${num};`,
//converts a character into an HTML entity
char2Html = (char) => {
//ASCII character or html entity from character code
return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
};
//check each character
[...str].forEach(char=>{
result += char2Html(char);
});
return result;
}
If we want to check this function (quite literally by dropping a UTF-8 checkmark ✓ into the function), its character code 10003 is the same as it's unicode value so it can be used to generate correct HTML entity ✓
The Problem with Emoji Conversion
While the function above works on UTF-8 special characters, it won't work all of the emoji we have available today. I found a really good explanation for in a post called Unicode in Javascript.
Take the 🤯 emoji, for example.
The character code for this emoji is 55357, so the entity returned by the function above would be �
, which does not work.
The unicode value for 🤯 is actually 129327 (or 0001 1111 1001 0010 1111 in binary). In order to express this character as in it's 16-bit form, it is split into a surrogate pair of 16-bit units, in string form as \uD83E\uDD2F
(according this handy Surrogate Pair Calculator)--🤯
So in order to get the correct value, we need to know:
- if a character is one of these surrogate pair emojis, and
- how to calculate a surrogate pair's value.
Determining if an Emoji is a Surrogate Pair
The JavaScript string length for any type of character is 1.
It is the same for characters, symbols and emoji
JavaScript | Result |
---|---|
't'.length |
1 |
'✓'.length |
1 |
'🤯'.length |
1 |
But if I use the spread operator (...) to get length, I can see that my emoji is made of a surrogate pair.
JavaScript | Result |
---|---|
[...'t'].length |
1 |
[...'✓'].length |
1 |
[...'🤯'].length |
2 |
That means that I can tell which characters are surrogate pairs if [...char].length > 1
:
utf2Html(str){
let result = '',
//converts unicode decimal value into an HTML entity
decimal2Html = (num) => `&#${num};`,
//converts a character into an HTML entity
char2Html = (char) => {
let item = `${char}`;
//spread operator can detect emoji surrogate pairs
if([...item].length > 1) {
//TODO calculate a surrogate pair's value
}
//ASCII character or html entity from character code
return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
};
//check each character
[...str].forEach(char=>{
result += char2Html(char);
});
return result;
}
Notice I left a //TODO
comment about calculating the pair. We'll tackle that next...
Calculating a Surrogate Pair's Unicode Value
I couldn't find a good post for converting a surrogate pair to it's unicode value, so instead followed these steps for converting from unicode to surrogate pairs in reverse:
# | Step | 🤯 Example |
---|---|---|
1 | Get the value of each part of the pair. | 55358 / 56623 |
2 | Convert each value to a binary number. | 1101100000111110 / 1101110100101111 |
3 | Take the last 10 digits of each number. | 0000111110 / 0100101111 |
4 | Concatenate the two binary numbers a single 20-bit binary number. | 00001111100100101111 |
5 | Convert 20-bit number to a decimal number. | 63791 |
6 | Add 0x10000 to the new number. | 129327 |
The Completed UTF (Including Emoji) to HTML Function
utf2Html(str){
let result = '',
//converts unicode decimal value into an HTML entity
decimal2Html = (num) => `&#${num};`,
//converts a character into an HTML entity
char2Html = (char) => {
let item = `${char}`;
//spread operator can detect emoji surrogate pairs
if([...item].length > 1) {
//handle and convert utf surrogate pairs
let concat = '';
//for each part of the pair
for(let i = 0; i < 2; i++){
//get the character code value
let dec = char[i].charCodeAt(),
//convert to binary
bin = dec.toString(2),
//take the last 10 bits
last10 = bin.slice(-10);
//concatenate into 20 bit binary
concat = concat + last10,
//add 0x10000 to get unicode value
unicode = parseInt(concat,2) + 0x10000;
}
//html entity from unicode value
return decimal2Html(unicode);
}
//ASCII character or html entity from character code
return char.charCodeAt() > 127 ? decimal2Html(char.charCodeAt()) : char;
};
//check each character
[...str].forEach(char=>{
result += char2Html(char);
});
return result;
}
Update
Thanks to a comment by LUKE知る, I have an even simpler way to do this:
export function utf2Html(str) {
return [...str].map((char) => char.codePointAt() > 127 ? `&#${char.codePointAt()};` : char).join('');
}
Top comments (7)
Thanks for the reply! It looks like there are limitations on how the character is passed to the function and used in the spread operator. I corrected my original post. Also I wasn't aware of
String.prototype.codePointAt
.I guess the only limitation with either of our approaches is support of IE11, since neither
String.prototype.codePointAt
nor the spread operator would work.So for IE11, we'd need to
String.prototype.charCodeAt
, but we'd probably have to test with something like!!char[1]
.Thad, said, I'm probably going to switch to
String.prototype.codePointAt
in my original use case, so again, thanks!hey guys, i'm fetching data from an API and it returns text that has these character entities in them, i want to parse them to their actual equivalence.
like:
& q u o t ; = '"'
& l t ; = '<' and so on. i cant have a the dirty html missed up and displayed for the user
how this htmlentity to raw emoji, back?
Yeah. Safari is definitely a thorn in my side.
Why convert Unicode characters into HTML entities at all?
Good question (thanks for asking). I work on front end open source web components, if I don't have control over the backend, I can be sure that UTF-8 is supported.
awsome - does exactly what I want!!! well done!