loading...

Protect Your Contact Information From Crawlers

bahe007 profile image Bastian Heinlein ・3 min read

In Germany, we are required to publish contact information on every commercial website including email address and phone number. I never put much effort in protecting this data, but recently I've started to receive an increasing amount of spam on some public email addresses. While it still is manageable with a proper spam filter, I decided to take some counter steps when I updated agty.de in order to prevent bots from detecting my contact data.

Side Constraints

The solution should be accessible and have minimal CO2 impact (learn more). Hence, it shouldn't depend on jQuery or other frameworks. Instead it should only use pure JavaScript if any at all. Also, it shouldn't simply display the address on an image as this would increase both the page's size and difficulty for people depending on accessibility features.

And of course, it should look at least ok and be simple to use.

Simplifications

Because my project is very small and doesn't use a real database, I decided to not detect contact information like email addresses or phone numbers automatically. Instead, I searched for those manually.

Assumptions

While I'm no expert on entity recognition or information extraction, I made some educated guesses about these tasks:

  1. It's trivial to detect an email address if it's used liked that <a href="mailto:hi@test.de">hi@test.de</a>.
  2. You could probably create a simple regular expression to find email addresses that aren't embedded in <a> tags. Hence, it's no big help to just write an email address without any tags in a normal paragraph.
  3. Most crawlers will simply load the HTML file and not even execute any JavaScript on startup.
  4. Even more crawlers won't execute JavaScript if it's tied to an user action like clicking a button.
  5. Yet, a lot of crawlers will apply their regular expressions also to JavaScript and CSS files.

A Simple Solution

Based on my assumptions, side constraints and simplifications, I came up with this solution:

Wherever you would write your email address, post this code snippet instead. The button tag is used in the hope of better accessibility although JavaScript is required.

<button class="show-email">Display E-Mail-Address <noscript>(requires JavaScript)</noscript></button>

Add event listeners to all email buttons that might be in this document. While we could use the onclick action of buttons, this way our email button's code is smaller.

document.addEventListener("DOMContentLoaded", function(e) {
    let emailButtons = document.getElementsByClassName("show-email");
    for (let i = 0; i < emailButtons.length; i++) {
        emailButtons[i].addEventListener("click", showEmail);
    }
});

This function does the actual stuff: Whenever an email button is clicked, its text is changed to the email address. In order to prevent bots from finding the address in JavaScript code, I cut it down into several parts.

function showEmail(evt) {
    let target = evt.target;
    let email = "test";
    email = email.concat("@");
    email = email.concat("test");
    email = email.concat(".de");
    target.innerHTML = email;
}

Some styles are always a good idea, of course it should be designed to fit your website's design.

.show-email {
    border: none;
    outline: none;

    padding: 0;
    margin: 0;

    box-shadow: none;
    background-color: white;

    font-size: 10pt;
    font-weight: bold;

    cursor: pointer;
}

.show-email:hover {
    text-decoration: underline;
}

Of course, you could also use this technique for other types of information like your name, phone number, social media profiles or whatever. Using this technique, I'd consider it likely to make the information available for humans but not for crawlers. At least that method should make it harder for crawlers to automatically detect your information.

Call For Action

While the proposed solution hopefully works, I invite you to share your own ideas about this problem. Most likely, you'll have much better thoughts than me!

Posted on by:

bahe007 profile

Bastian Heinlein

@bahe007

Solving Digital Problems With Mathematically Correct Descriptions

Discussion

markdown guide
 

I use to encode the email part (and other sensitive data) with Hex codes in HTML like this:


&#109;&#x61;&#105;&#x6C;&#116;&#x6F;:&#105;&#x6E;&#102;&#x6F;&#64;&#x65;&#120;&#x61;&#109;&#x70;&#108;&#x65;&#46;&#x63;&#111;&#x6D;

which will render as

mailto:info@example.com

I got my email embedded like this and didn't receive spam so far. It has the advantage of users getting the right values to see because the browser shows their normal entities (a normal user doesn't even recognize that the chars are written in HEX HTML codes), click on a mail address works too and no JS is required.

@edit: fun fact: this form also interpreted the hex encoded values :)

 

yes, this was nice approach, but problem is it works with "text" crawlers only. nowadays, there are many "headless browsers", which actually render the page's dom in memory, even run javascript code, and then crawl the output.

this of course applies to Mr. Heinlein's approach as well.

in both cases, just add recaptcha and you'll be good... for now...

 

I know that this is possible, but luckily by now there aren't many crawlers now here which crawl emails using JavaScript - I guess. Maybe that "business" is not so interesting any more? That solution is simple and easy to implement, but doesn't keep all bots outside.

And if I ever can avoid captchas, I will. Example: dirty-co.de/user-experience/wenn-d... - I wanted to get information about my package which should be delivered by DHL but their captcha didn't load properly so I was stuck there ... bad UX!

lol, i had the exact same captcha-not-shown problem with huawei website some time ago.

but google's recaptcha is pretty neat lately. you don't even have to copy/write anything if recaptcha evaluates you as "human" (which it does in most cases).

Ah well, I didn't read exactly enough ;). So if you're respecting the GDPR (as we are forced to in Germany here) you may come to a new issue trying to use Google Recaptcha ... :S

This! But it's not only the regulations like GDPR, I personally wouldn't like to give Google more information about the people using my websites and even more important about the website's usage.

oh, i didn't think about that in GDPR context, why there is a problem with recaptcha? (i'm more into tech than law stuff, so i don't know).

This is my imperfect understanding: While it is still possible to use Google Recaptcha, it causes a lot of privacy headaches, because Google processes personal data and places cookies. The latter means that you'll need at least some kind of cookie banner and make sure, cookies are only placed after this was explicitly allowed. But more importantly is the former: Google not only processes personal data, but it does this possibly in the US or somewhere else. This means - as I understand - that you'll need some kind of contract with them to protect your and your user's interests. That is usually a standard contract, however it is a legal binding contract.

And in some related cases of which I'm aware, courts ruled that you could theoretically be partially responsible if your contract partner disobeys privacy regulations.

While this is my best knowledge, there are of course no guarantees that my probably out-of-date-for-several-months knowledge isn't necessarily anymore correct.

 

Nice, it seems like the easiest solutions work best, sometimes :-)

 

I usually let CloudFlare handle that for me, they inject a script that mangles mailto: links and only unmangle them in JavaScript.

For websites where CloudFlare is not desirable, I run a similar mangling/demangling algorithm that uses base64 encoding of the address, with some characters replaced and the final string reversed to avoid easy detection of bWFpbHRvOg (base64 for mailto:) at the start.

I found the technique here:
code.luasoftware.com/tutorials/jav...

Example:
github.com/franky47/penelopebuckle...

 

Thanks for sharing this solution, I really like it!

 

Most of the time it's not your website that's prone to crawling but all other places. Any registry or 3rd party service that will have a data leak.
If I were you, I would present my regular gmail address. Perhaps it doesn't look "professional" but has the best spam filter that's available on the market.
For real clients for whom you run presentations, you can use expirable one time links dedicated for each client. You put your professional e-mail there. And on the business cards.

 

My personal, statistical not-significant experience of some (accidental) A/B-testing is that crawlers are a bigger problem than database breaches, especially considering the fact that most of the spam messages were explicitly targeted to businesses.

 

Or use G Suite for business and get the best of both worlds?

 
 

Yes but by doing that you also remove the google "crawlers" that searches your contact to reference it. Could be bad for some web sites.

 

Well, of course you also make life harder for "good" crawlers, but as far as I am aware, you can manually enter and edit business data on Google's website.