๊ง๐ฌ๐๐ฌ๐โ
ผะฃัแน โ
ฐั ๐ะต๐ ๐ลฟ แีธ๐พ ๐ะณ ๊ณะพ๐ะต ษก๐ะฐฯแน๐พโ
ฟะต๐ ๐แนะฐ๐ แน๐บั ๐๊ฑ๐พ๊ด๐๐๐ฝะฐ๐
ะพ๐ ัต๐พะณ๐ ๐แฅโ
ฟั๐
ะฐ๊ต โ
ผ๊๐ ๐แด ๐แแ๐พ ะพ๐๊ง๐พ๐ ๐ะต๐ แลฟ ษก๊ต๐บัแนะตแะตั. Like in previous sentence, that does not use a single ASCII letter:
๊ง - LISU LETTER XA
๐ฌ - DESERET SMALL LETTER LONG O
๐ - MATHEMATICAL SANS-SERIF SMALL M
๐ฌ - DESERET SMALL LETTER LONG O
๐ - MATHEMATICAL SANS-SERIF SMALL G
โ
ผ - SMALL ROMAN NUMERAL FIFTY
ะฃ - CYRILLIC CAPITAL LETTER U
ั - CYRILLIC SMALL LETTER ER
แน - GEORGIAN CAPITAL LETTER CHIN
...
Homoglyphs are not Unicode specific, but it was ability to write in many scripts using single UTF encoding that made them popular.
Similarity is conditional
It is font dependent. Two sets of graphemes looking very similar (or even identical) in one font may not look that similar in another. For example ั - CYRILLIC SMALL LETTER TE
looks like ASCII T
, but in cursive fonts (those that resembles handwriting connected letters) looks like m
.
Similarity is subjective
For many people unfamiliar with given alphabets วฆ
and ฤ
may look exactly the same. But if someone is using those letters on daily basis he will notice immediately that first one has CARON
and the other has BREVE
on top.
They are not limited to single grapheme
For example แ - MYANMAR LETTER THA
looks like two ASCII o
letters. And the other way - ASCII rn
looks like single ASCII letter m
.
Applications?
Fun. ๐วkว pษนoducวng weird looking bแดt ษนeadษble สext.
Trolling. Programmer's classic is to replace in someone's code
;
with;
-GREEK QUESTION MARK
- and watch some funny debugging attempts. More advanced version is to modify keybinding. For example on macOS create~/Library/KeyBindings/DefaultKeyBinding.dict
with following content:
{
";" = (insertText:,";");
}
And observe how Python suddenly became someone's favorite language of choice :P
Just promise you won't troll stressed out junior dev before the end of sprint.
- Phishing. This is "Fun with UTF-8" sub series, but unfortunately this application is anything but fun. Homoglyphs are massively used to spoof company names, bypass anti-spam filters and create fake domains. For example can you spot difference between
Paypal
and๊ayัะฐl
?
Common way to detect those is to check Script
Unicode property, more on those in this post. Single word using more than one script should be considered suspicious:
$ raku -e '"Paypal".comb.classify( *.uniprop("Script") ).say'
{Latin => [P a y p a l]} # real
$ raku -e '"๊ayัะฐl".comb.classify( *.uniprop("Script") ).say'
{Cyrillic => [ั ะฐ], Latin => [a y l], Lisu => [๊]} # fake
Raku note: Method comb
without param extracts list of characters. Those characters are classified by classify
method. Classification key is output of uniprop
method for given character.
Tools
I'm maintaining HomoGlypher library/package which allows to handle common homoglyph operations:
Unwind. From ASCII text create list of all possible homoglyphied text variants. This is useful for example in checking if some domain is spoofed.
Collapse - From homoglyphied text recover all possible ASCII text variants. Useful for normalization of text before passing it to content filters.
Randomize - From ASCII text create single homoglyphied text with given replacement probability.
Tokenize. Create regular expression token that will match homoglyphied text equivalent to given ASCII text. I think this may be the only homoglyph related library in the existence having this feature :)
Huge list of mappings is provided, so you won't have to dig through Unicode blocks on your own to find possible similarities between graphemes.
Give it a try. And if you know other homoglyph libraries please leave a note in the comments for future readers.
Top comments (0)