Scrapfly

Posted on Apr 10, 2023 • Originally published at scrapfly.io on Mar 13, 2023

How to Bypass Datadome Anti Scraping in 2023

#scraperblocking

Datadome is an anti-bot and anti-scraping service used by websites like Leboncoin, Vinted, Deezer etc. to block non-human visitors.

In this article, we'll be taking a look at how to bypass Datadome anti-scraping protection. We'll start by taking a quick look at what Datadome is, how to identify it and how is it identifying web scrapers. Then, we'll take a look at existing techniques and tools for bypassing Datadome bot protection. Let's dive in!

What is Datadome?

Datadome is a paid WAF service that protects websites from bots. It has legitimate uses of blocking malicious bots and scripts but it's also used by websites to block web scrapers from accessing public data.

It is particularly popular with European websites like Leboncoin, Vinted, Deezer, Malt and many others.

Datadome Block Page Example

Most of Datadome bot blocks result in HTTP status codes 400-500 (usually 403). The error message can appear in many different forms but usually, it's requesting javascript to be enabled or a captcha to be solved.

Datadome block page on Leboncoin website

These errors are mostly encountered on the first request to the website. Though, as Datadome is using an AI behavior analysis, it can also block requests after a few successful requests.

How does Datadome Detect Web Scrapers?

To detect web scrapers, Datadome uses several different techniques to estimate the likeliness that the connecting user is not a bot.

Datadome is taking a look at all connection metrics like encryption type (TLS), HTTP protocol use and javascript engine to determine a trust score.

Based on the final trust score Datadome either lets the user in, blocks them or requests a captcha challenge to be solved.

This complex process is done in real-time making web scraping difficult as many factors can influence the trust score. However, by understanding each step of this process we have a good chance of bypassing Datadome bot protection. Let's take a look at each step in detail.

TLS Fingerprinting

TLS (or SSL) is the first step in the HTTP connection. When using encrypted connections (like https instead of http) the server and client have to negotiate an encryption method. Since, there are many different ciphers and encryption methods available the negotiation itself can give away a lot of information about the client.

This is generally referred to as JA3 fingerprinting. Different operating systems, web browsers or programming libraries have varying access to TLS encryption which results in different JA3 fingerprints.

If a scraper uses a library that has different TLS capabilities of a usual web browser it can be identified using this method.

So, use web scraping libraries and tools that are resistant to JA3 fingerprinting

There are many online tools like ScrapFly's JA3 fingerprint web tool that can be used to validate your tools for JA3 fingerprinting.

For more see our full introduction to TLS fingerprinting which covers TLS fingerprinting in greater detail.

IP Address Fingerprinting

Next is IP address analysis. Datadome has access to many different IP databases and can lookup the connecting client's IP address. This can be used to identify the client's location, ISP, reputation and other information.

The most important metric use here is IP address type as there are 3 different types of IP addresses:

Residential are home addresses assigned by internet provides to average people. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
Mobile addresses are assigned by mobile phone towers and mobile users. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses it makes it much more difficult to rely on IP addresses for identification.
Datacenter addresses are assigned to various data centers and server platforms like Amazon's AWS, Google Cloud etc. So, datacenter IPs provide a significant negative trust score as they are likely to be used by bots.

Using IP analysis Datadome can have a rough estimate of how likely the connecting client is a human or a bot. For example, very few people browse the web from IPs owned by data centers.

So, use high-quality residential or mobile IP addresses.

For a more in-depth look, see our full introduction to IP blocking.

HTTP Details

The next step is to analyze the HTTP connection details. HTTP protocol is becoming increasingly complex which makes it easier to identify connections coming from web scrapers.

To start, most of the web runs on HTTP2 or HTTP3 while most web scraping libraries still use HTTP1.1. However, while many modern libraries like Python's httpx and cURL support HTTP2 it's still not the default.

HTTP2 is also susceptible to HTTP2 fingerprinting which can be used to identify web scrapers. See our http2 fingerprint test page for more info.

Then, request headers and header order pay an important role in identifying web scrapers. Since most web browsers have strict header value and order rules any mismatch like missing Origin or User-Agent header can be a strong giveaway.

So, make sure to use HTTP2 and match header values and order of real web browser.

For more see our full introduction to request headers role in blocking

Javascript Fingerprinting

Finally, the most complex and hardest step to address is javascript fingerprinting. Datadome is using the client's javascript engine to fingerprint the client machine for details like:

Javascript runtime information
Hardware and operating system details
Web browser information and capabilities

That's a lot of data that can be used in the trust score calculations. Fortunately, javascript fingerprinting takes time to execute and is prone to false positives. In other words, it's not as reliable as other methods and can be bypassed.

There are two ways to bypass javascript fingerprinting.

The obvious one is to inspect and reverse engineer all of the javascript code Datadome is using to fingerprint the client. This is a very time-consuming process and requires a lot of javascript knowledge. To add, it requires a lot of maintenance as Datadome is constantly updating their fingerprinting code.

A more practical approach is to use a real web browser for web scraping. This can be done using browser automation libraries like Selenium, Puppeteer or Playwright that can start a real headless browser and navigate it for web scraping.

So, introducing browser automation using tools like Selenium, Puppeteer or Playwright is the best way to handle javascript fingerprinting

Many advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-heavy browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)

Behavior Analysis

Even when all of the above steps are passed Datadome can still block the client if it detects suspicious behavior. Datadome is using AI to analyze connection patterns and user profiles.

This means the trust score is not a static number but is constantly being adjusted based on the client's behavior.

So, it's important to distribute web scraper traffic through multiple different agents using proxies and different fingerprinting configurations.

For example, when scraping using browser automation tools, it's important to use different browser profiles like screen size, operating systems and rendering capabilities.

How to Bypass Datadome Bot Protection?

We can see what a complex process Datadome is using to identify web scrapers. Fortunately, this can work to our advantage as by avoiding common pitfalls and web scraper agent details it is possible to bypass Datadome bot protection. Here's a quick summary:

Use high-quality residential or mobile IP addresses
Use HTTP2 and match header values and order of real web browser
Introduce browser automation using tools like Selenium, Puppeteer or Playwright
Distribute web scraper traffic through multiple different agents

Note that as Datadome develops it introduces more techniques to identify web scrapers. So, it's important to keep up with the latest developments and use the latest web scraping tools.

For example, recently Datadome was updated with the capability to detecting headless browser use. So for Datadome bypass in 2023 plugins like Puppeteer stealth need to be used when web scraping.

Bypass with Scrapfly

While bypassing Datadome is possible, maintaining bypass strategies can be very time-consuming. This is where services like ScrapFly come in!

Using ScrapFly web scraping API we can hand over all of the web scraping complexity and bypass logic to an API!

Scrapfly is not only a Datadome bypasser but also offers many other web scraping features:

Millions of residential proxies from over 50+ countries
Datadome and any other anti-scraping protection bypass
Headless cloud browsers that render javascript pages and automate browser actions
Python SDK
Easy monitoring and debugging tools

For example, to scrape pages protected by Datadome or any other anti-scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://www.leboncoin.fr/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like France
    country="FR",
    # and proxy type like residential:
    proxy_pool="residential_proxy_pool",
))
print(result.scrape_result)

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping Datadome protected pages:

Is it legal to scrape Datadome protected pages?

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Datadome using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass Datadome protected pages as Google and Archive tend to be whitelisted. However, not all pages are cached and the ones that are are often outdated making them unsuitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to bypass Datadome entirely and scrape the website directly?

This is more of an internet security problem as that would be possible only by taking advantage of a vulnerability. This can be illegal in some countries and is often very difficult to do either way.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, Akamai, Imperva (aka Incapsula) and PerimeterX though they function very similarly to Datadome so everything in this tutorial can be applied to them as well.

Summary

In this article, we took a deep dive into Datadome anti-bot protection when web scraping.

To start, we've taken a look at how Data dome identifies web scrapers through TLS, IP and javascript client fingerprinting. We saw that using residential proxies and fingerprint-resistant libraries is a good start. Using real web browsers and remixing their fingerprint data can make web scrapers much more difficult to detect.

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.

For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!

{ "@context": "<a href="https://schema.org">https://schema.org</a>", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "Is it legal to scrape Datadome protected pages?", "acceptedAnswer": { "@type": "Answer", "text": "Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website." } }, { "@type": "Question", "name": "Is it possible to bypass Datadome using cache services?", "acceptedAnswer": { "@type": "Answer", "text": "Yes, public page caching services like Google Cache or Archive.org can be used to bypass Datadome protected pages as Google and Archive tend to be whitelisted. However, not all pages are cached and the ones that are are often outdated making them unsuitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically." } }, { "@type": "Question", "name": "Is it possible to bypass Datadome entirely and scrape the website directly?", "acceptedAnswer": { "@type": "Answer", "text": "This is more of an internet security problem as that would be possible only by taking advantage of a vulnerability. This can be illegal in some countries and is often very difficult to do either way." } }, { "@type": "Question", "name": "What are some other anti-bot services?", "acceptedAnswer": { "@type": "Answer", "text": "There are many other anti-bot <abbr title=\"Web Application Firewall\">WAF</abbr> services like <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-cloudflare-anti-scraping/\">Cloudflare</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-akamai-anti-scraping/\">Akamai</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-imperva-incapsula-anti-scraping/\">Imperva (aka Incapsula)</a> and <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-perimeterx-human-anti-scraping/\">PerimeterX</a> though they function very similarly to Datadome so everything in this tutorial can be applied to them as well." } } ] }

DEV Community