Fun with UTF-8: Homoglyphs

#unicode #utf #raku

ꓧ𐐬𝗆𐐬𝗀ⅼУрႹ ⅰѕ 𝗌е𝗍 𝗈ſ ဝո𝖾 𝗈г ꝳо𝗋е ɡ𝗋аρႹ𝖾ⅿе𝗌 𝗍Ⴙа𝗍 Ⴙ𝖺ѕ 𝗂ꝱ𝖾ꝴ𝗍𝗂𐐽а𝗅 о𝗋 ѵ𝖾г𝗒 𝗌Ꭵⅿі𝗅аꝵ ⅼꝏ𝗄 𝗍ᴏ 𝗌იო𝖾 о𝗍ꜧ𝖾𝗋 𐑈е𝗍 ဝſ ɡꝵ𝖺рႹеოеѕ. Like in previous sentence, that does not use a single ASCII letter:

ꓧ - LISU LETTER XA
𐐬 - DESERET SMALL LETTER LONG O
𝗆 - MATHEMATICAL SANS-SERIF SMALL M
𐐬 - DESERET SMALL LETTER LONG O
𝗀 - MATHEMATICAL SANS-SERIF SMALL G
ⅼ - SMALL ROMAN NUMERAL FIFTY
У - CYRILLIC CAPITAL LETTER U
р - CYRILLIC SMALL LETTER ER
Ⴙ - GEORGIAN CAPITAL LETTER CHIN
...

Homoglyphs are not Unicode specific, but it was ability to write in many scripts using single UTF encoding that made them popular.

Similarity is conditional

It is font dependent. Two sets of graphemes looking very similar (or even identical) in one font may not look that similar in another. For example т - CYRILLIC SMALL LETTER TE looks like ASCII T, but in cursive fonts (those that resembles handwriting connected letters) looks like m.

Similarity is subjective

For many people unfamiliar with given alphabets Ǧ and Ğ may look exactly the same. But if someone is using those letters on daily basis he will notice immediately that first one has CARON and the other has BREVE on top.

They are not limited to single grapheme

For example ထ - MYANMAR LETTER THA looks like two ASCII o letters. And the other way - ASCII rn looks like single ASCII letter m.

Applications?

Fun. 𐐑ǃkǝ pɹoducǃng weird looking bᴝt ɹeadɐble ʇext.
Trolling. Programmer's classic is to replace in someone's code ; with ; - GREEK QUESTION MARK - and watch some funny debugging attempts. More advanced version is to modify keybinding. For example on macOS create ~/Library/KeyBindings/DefaultKeyBinding.dict with following content:

{
    ";" = (insertText:,";");
}

And observe how Python suddenly became someone's favorite language of choice :P

Just promise you won't troll stressed out junior dev before the end of sprint.

Phishing. This is "Fun with UTF-8" sub series, but unfortunately this application is anything but fun. Homoglyphs are massively used to spoof company names, bypass anti-spam filters and create fake domains. For example can you spot difference between Paypal and ꓑayраl?

Common way to detect those is to check Script Unicode property, more on those in this post. Single word using more than one script should be considered suspicious:

$ raku -e '"Paypal".comb.classify( *.uniprop("Script") ).say'
{Latin => [P a y p a l]} # real

$ raku -e '"ꓑayраl".comb.classify( *.uniprop("Script") ).say'
{Cyrillic => [р а], Latin => [a y l], Lisu => [ꓑ]} # fake

Raku note: Method comb without param extracts list of characters. Those characters are classified by classify method. Classification key is output of uniprop method for given character.

Tools

I'm maintaining HomoGlypher library/package which allows to handle common homoglyph operations:

Unwind. From ASCII text create list of all possible homoglyphied text variants. This is useful for example in checking if some domain is spoofed.
Collapse - From homoglyphied text recover all possible ASCII text variants. Useful for normalization of text before passing it to content filters.
Randomize - From ASCII text create single homoglyphied text with given replacement probability.
Tokenize. Create regular expression token that will match homoglyphied text equivalent to given ASCII text. I think this may be the only homoglyph related library in the existence having this feature :)

Huge list of mappings is provided, so you won't have to dig through Unicode blocks on your own to find possible similarities between graphemes.

Give it a try. And if you know other homoglyph libraries please leave a note in the comments for future readers.