DEV Community

Andrew Lock "Sock"
Andrew Lock "Sock"

Posted on • Originally published at andrewlock.net on

Adding simple email address obfuscation for your blog like Cloudflare Scrape Shield

Adding simple email address obfuscation for your blog like Cloudflare Scrape Shield

In this post I show a simple way to obfuscate email addresses to make it harder for bots to scrape them from your site. It uses a similar approach as Cloudflare Scrape Shield.

It's important to not that the encoding scheme used here is incredibly weak. But that's kind of the point. It's only meant to provide rudimentary protection against automated scraping by bots. It's obfuscation, not encryption!

Background - Cloudflare Scrape Shield

I include my email address on the about page of my blog in case people want to get in touch. I've personally only ever had pleasant emails from people (though I'm well aware that's a rarity for many people in our industry). Somewhat surprisingly perhaps, I don't get a huge amount of spam because of it.

Some time ago I moved my blog from a self-hosted instance of Ghost to Netlify. At the same time, I also removed the Cloudflare caching layer, as Netlify uses its own layer of caching. One of the features of Cloudflare is Scrape Shield. This has multiple parts to it, but the one I was most interested in was email obfuscation.

Cloudflare's email obfuscation works by modifying the HTML output of your app when they serve it. If cloudflare detects an email address in an <a> tag, for example:

<a href="mailto:example@example.org">Contact me</a>

It will modify this element inline, and inject a script element:

<a href="/cdn-cgi/l/email-protection#a5c0ddc4c8d5c9c0e5c0ddc4c8d5c9c08bcad7c2">Contact me</a>
<script data-cfasync="false" src="/cdn-cgi/scripts/f2bf09f8/cloudflare-static/email-decode.min.js"></script>

When the page is served, the email-decode.min.js script is executed, and the <a> tag is replaced with the original. The advantage of this is that bots need to execute the JavaScript on your page in order to retrieve your email address, which raises the barrier (slightly) for bots trying to scrape the email address from your app.

To avoid causing problems, there are a bunch of places that Cloudflare won't obfuscate email addresses. See the documentation for details.

When I moved my blog from Cloudflare to Netlify, I didn't want to lose that email obfuscation, so I looked at how I could implement it myself. Luckily, it's pretty trivial to achieve, as I found from reading this excellent post. This post is very much based on that one.

So, how does the email address "encryption" work?

Decoding an obfuscated email address

First of all, while technically encryption, the scheme is so weak, you really shouldn't think of it as that. It's more just like obfuscation. That's all that's required for our intended goal, but it's important to keep in mind.

I'll start with the decoding strategy - how do you retrieve the email address from the encoded version shown previously?

The email is encoded into the # portion of the modified attribute, i.e. /cdn-cgi/l/email-protection#EMAIL. In the previous example, that was:

a5c0ddc4c8d5c9c0e5c0ddc4c8d5c9c08bcad7c2

The overall strategy to decoding this is as follows:

  • Remove the first 2 characters (a5), and convert to its hex equivalent value (165). This is the key for the rest of the calculation.
  • Iterate through the remainder of the characters, incrementing by two. For each pair of characters (the first pair is c0):
    • Convert the pair to its hex equivalent (192)
    • Perform a bitwise XOR of the number with the key. so 165 ^ 192 = 101
    • Convert the result (101) to its UTF-16 equivalent (e)
    • Append the result to previous results
  • Repeat until all characters are consumed. The final result is the original email

The XOR scheme used is one of the most basic encryption schemes possible. And on top of that, the key for the encryption is stored right along-side the cipher text! Again, this is not secure encryption; it is simply obfuscation.

This is actually a simplified description of the cloudflare approach - Cloudflare have an additional step to handle Unicode codepoints (which can be multiple bytes long). See this blog post for a description of that step.

So how can you implement this algorithm for your own apps?

Implementing email obfuscation on your own blog

Cloudflare dynamically replaces email addresses in your HTML, and injects additional scripts into the DOM. That's not really necessary in my case - my blog is statically generated, and even if it wasn't, there's probably only a few email addresses I would want to be encoding.

Because of those constraints, I opted to encode the email address on my blog ahead of time, rather than trying to do it on-the-fly. I can also then just include the email decoding script in the standard JavaScript bundle for the site.

Encoding the email address

Given you have an email address you want to obfuscate on your site, e.g. `example@example.org`, how can you encode that in the required format?

I wrote a small JavaScript function that takes an email address, and a key in the range 0-255 and outputs an obfuscated email address. It uses the algorithm from the previous section in reverse to generate the output:

function encodeEmail(email, key) {
    // Hex encode the key
    var encodedString = key.toString(16);

    // loop through every character in the email
    for(var n=0; n < email.length; n++) {

        // Get the code (in decimal) for the nth character
        var charCode = email.charCodeAt(n);

        // XOR the character with the key
        var encoded = charCode ^ key;

        // Hex encode the result, and append to the output string
        encodedString += encoded.toString(16);
    }
    return encodedString;
}

I only have a couple of emails on my blog I want to obfuscate, so I ran them through this function, choosing an arbitrary key. I used Chrome's dev tools to run it - open up any old website, hit F12 to view the console, and copy-paste the function above. Then run the function using your email, picking a random number between 0-255:

encodeEmail('example@example.org', 156);

The hex encoded output is what we'll use in our website.

Image of encoding a string in Chrome's dev console

The code to decode the email is is very similar.

Decoding the email address

The function to decode an email address from the encoded string is shown below, and follows the algorithm shown previously:

function decodeEmail(encodedString) {
    // Holds the final output
    var email = ""; 

    // Extract the first 2 letters
    var keyInHex = encodedString.substr(0, 2);

    // Convert the hex-encoded key into decimal
    var key = parseInt(keyInHex, 16);

    // Loop through the remaining encoded characters in steps of 2
    for (var n = 2; n < encodedString.length; n += 2) {

        // Get the next pair of characters
        var charInHex = encodedString.substr(n, 2)

        // Convert hex to decimal
        var char = parseInt(charInHex, 16);

        // XOR the character with the key to get the original character
        var output = char ^ key;

        // Append the decoded character to the output
        email += String.fromCharCode(output);
    }
    return email;
}

When you pass this function an encoded email, you'll get your original back:

Image of decoding a string in Chrome's dev console

Now lets look at how to use these functions in a website.

Replacing existing emails with obfuscated emails

I only use my email in anchor tags, so I want the final (unencoded) tag on my blog to look something like the following:

<a href="mailto:example@example.org">example@example.org</a>

In my source code, instead of the above, I use the following:

<a class="eml-protected" href="#">9cf9e4fdf1ecf0f9dcf9e4fdf1ecf0f9b2f3eefb</a>

If bots scrape the website, they won't see an easily recognisable email, which will hopefully go some way to prevent it being scraped.

There's lots of different points at which you could decode the string, depending on the experience you want. You could keep the string encoded on your website until someone clicks a "reveal" button for example. I had a very simple use case, so I chose to automatically decode the email immediately when the page loads.

// Find all the elements on the page that use class="eml-protected"
var allElements = document.getElementsByClassName("eml-protected");

// Loop through all the elements, and update them
for (var i = 0; i < allElements.length; i++) {
    updateAnchor(allElements[i])
}

function updateAnchor(el) {
    // fetch the hex-encoded string
    var encoded = el.innerHTML;

    // decode the email, using the decodeEmail() function from before
    var decoded = decodeEmail(encoded);

    // Replace the text (displayed) content
    el.textContent = decoded;

    // Set the link to be a "mailto:" link
    el.href = 'mailto:' + decoded;
}

Hopefully the code is self explanatory, but I'll walk through it here

  • Find all elements on the page with the class eml-protected
  • For each element:
    • Fetch the inner text (9cf9e4fdf1ecf0f9dcf9e4fdf1ecf0f9b2f3eefb) in the example above
    • Run the inner text through the decoder, to get the real email address
    • Replace the text of the anchor to be `example@example.org`
    • Set the href of the anchor to be mailto:example@example.org.

The code is functionally complete, but there's a lot of short-cuts:

  • No error checking or handling
  • Assumes that all eml-protected elements are <a> tags
  • Assumes the document is fully loaded before the script runs
  • Assumes the encoded email isn't corrupted or invalid

If you're applying this approach to a larger site, you don't have strict control over the contents, or any of these assumptions don't hold, then you'll probably need to be more careful. For my purposes, this is more than enough 🙂

Summary

In this post I showed how you can obfuscate email addresses on a website to make it harder for bots to easily scrape them. The encoding scheme is based on the one used in Cloudflare's scrape shield product, which uses a simple XOR scheme to hide the data as a hex-string. This is not at all "secure", especially as the key for decoding is included in the string, but it serves its purposes of obfuscating emails from automated systems.

Top comments (0)