DEV Community

PythicCoder for Microsoft Azure

Posted on • Originally published at towardsdatascience.com on

Preventing Homoglyph attacks with OCR.

TLDR; This post describes what homoglyph attacks are and how to prevent them with Cognitive Services.

Getting Started

Code for this story can be found on github.

aribornstein/HomoglyphAttackPreventionService

One click deployment instructions on Azure can be found below.

What is a Homoglyph attack?

In orthography and typography, a homoglyph is one of two or more characters with shapes that appear identical or very similar. In layman's terms, a homo-glyph is any character that looks similar to another character such the S and $ in the image above .

Language models are often vulnerable to obfuscation attacks using homo-glyphs, due to the way they encode text. In Unicode and Ascii for example, the same character codes look different in different fonts, and a model will struggle to learn their similarities.

Character encodings for beginners

To drive this point home, let’s take a look at the phrase below:

I got a $hitty result from my awesome cloud sentiment analysis model.

The phrase above clearly demonstrates negative sentiment. The word $hitty is a homoglyphic obfuscation of the profane word shitty.

Let’s see how the four most popular cloud sentiment analysis services on the internet, handle this attack.

Clockwise, Azure Text Analytics and GCP Natural Language correctly classify the original sentiment but fail on the obfuscated text, IBM Watson fails to correctly classify sentiment of either text and AWS Comprehend does not provide a demo with out an AWS account but also fails on the example sentence.

As we can see Azure Text Analytics and GCP Natural Language correctly classify the original sentiment but both fail on the obfuscated text. IBM Watson, fails to correctly classify sentiment of either text. AWS Comprehend, does not provide a demo without an AWS subscription and also fails on the example sentence.

While the $ and S example above may seem arbitrary, the Latin and Cyrillic letters below in wikipedia demonstrate how such an attack can be effective and hard to detect.

This presents all sorts of problems for use cases, where such attacks can exploit or cause harm to applications, such as bots trying to hide from fake news detectors.

How to Prevent Homoglyph attacks?

Azure Computer Vision Correctly Reveals the Homoglyph when the text is represented as an image.

After talking with my friend Amit Moryossef from the BIU NLP lab, we realized we might be able to prevent homoglyph attacks using OCR systems.

Using Azure Computer Vision service, I tested this theory, with the sentence above and it correctly used the image domain context to extract the word Shitty from the homoglyph $hitty.

Using this capability I’ve written the following open source container service that will:

  1. Take a given text as input

  2. Convert the text to an image

  3. Process the image using OCR

  4. Return the correct text with homoglyphs removed.

The Docker service is cloud agnostic. I provide a one click deployment option to Azure for convenience.

If you have an existing Azure subscription, you can get started by clicking the button below to auto deploy the service.

Click here to get started!

Otherwise you can get a free Azure Account here and then click the deploy button above.

Create your Azure free account today | Microsoft Azure

If you have any questions, comments, or topics you would like me to discuss, feel free to follow me on Twitter. Thanks again to Amit Moryossef and the BIU NLP lab for the amazing inspiration and Iddan Sachar with his help debugging ARM for one click deployment.

Using the Service

To use the service just send a URL encoded query string of up to 200 characters to the service perfect for validating tweets. Below is an example call using curl be sure to use your own service endpoint.

Next Steps

While the service works very well at removing homoglyphs there are still a few cases it fails on.

Example Failure Case

Future work will explore using a more custom approach to solving this problem but this approach works very well for very minimal effort.

Additional Resources

About the Author

Aaron (Ari) Bornstein is an avid AI enthusiast with a passion for history, engaging with new technologies and computational medicine. As an Open Source Engineer at Microsoft’s Cloud Developer Advocacy team, he collaborates with Israeli Hi-Tech Community, to solve real world problems with game changing technologies that are then documented, open sourced, and shared with the rest of the world.


Top comments (0)