DEV Community

Scrapfly for Scrapfly

Posted on • Originally published at scrapfly.io on

How to bypass Cloudflare when web scraping in 2023

How to bypass Cloudflare when web scraping in 2023

How to bypass Cloudflare when web scraping in 2023

Cloudflare is mostly known for its CDN service though when it comes to web scraping it's the Cloudflare Bot Management that is notorious.

Cloudflare can restrict who can access its content so this is where the need to bypass Cloudflare when web scraping arises.

To bypass Cloudflare bot management we first should take a quick look at how it's working. Then, we can identify the challenges and design a strategy.

In this article, we'll first take a look at how Cloudflare is using various web technologies to calculate a trust score and then we'll take a look at existing solutions that increase this trust score when web scraping.

We will also cover common Cloudflare errors and signs that indicate that requests have failed to bypass Cloudflare and what they mean exactly. Let's dive in!

What Is Cloudflare Bot Management?

Cloudflare Bot Management is a web service that tries to detect and block web scrapers and other bots from accessing the website.

It's a complex multi-tier service that is usually used in legitimate bot and spam prevention but it's becoming an increasingly popular way to block web scrapers from accessing public data.

To start let's take a look at some common Cloudflare errors that scrapers encounter and what do they mean.

Popular Cloudflare Errors

Most of the Cloudflare bot blocks result in HTTP status codes 403 (most commonly), 401, 429 and 502. Though, most importantly the body contains the actual error codes and definitions. These codes can help us understand what's going on and help us to bypass Cloudflare 403 errors.

There are several different error messages that indicated that we have failed to bypass Cloudflare:

Cloudflare Error 1020: Access Denied is one of the most commonly encountered errors when scraping Cloudflare which doesn't indicate the exact cause. So to bypass Cloudflare 1020 full scraper fortification is needed.

Cloudflare Error 1009 comes with a message of "... has banned the country or region of your IP address". This is caused by website being geographically locked to specific countries. So, to bypass Cloudflare 1009, proxies from allowed countries can be used. In other words, if the website is only available in the US, the scraper needs a US proxy to bypass this error.

Cloudflare Error 1015: You are being rate limited means the scraper is scraping too fast. While it's best to respect rate limiting when web scraping this limit can be set really low. To bypass Cloudflare 1015 the scraper traffic should be distributed through multiple agents (proxies, browsers etc.).

Cloudflare Error 1010: Access Denied is caused by a blocked browser fingerprint. This is often encountered when scraping using headless browsers without fingerprinting obfuscations. To bypass Cloudflare 1010 headless browsers need to be fortified against Javascript fingerprinting

There's the Cloudflare challenge page (aka browser check) which doesn't indicate a block but lack of trust in the client not being a bot. So, to bypass Cloudflare browser check we can either raise our general trust rating or solve the challenge (we'll cover this in the bypass section below).

Some of these Cloudflare security check pages can request to solve captcha challenges though the best way to implement a Cloudflare captcha bypass is to not encounter it at all!

Let's take a look at how exactly Cloudflare detects web scrapers next.

Finally, here's a list of all Cloudflare block error artifacts:

  • Response headers might have cf-ray field value.
  • Server header field has value cloudflare.
  • Set-Cookie response headers have __cfuid= cookie field.
  • "Attention Required!" or "Cloudflare Ray ID:" in HTML.
  • "DDoS protection by Cloudflare" in HTML.
  • CLOUDFLARE_ERROR_500S_BOX when requesting invalid URLs.

How does Cloudflare detect Web Scrapers?

To detect web scrapers, Cloudflare is using many different technologies to determine whether traffic is coming from a human user or a machine.

How to bypass Cloudflare when web scraping in 2023

Cloudflare is combining the results of many different analysis and fingerprinting methods into an overall trust score. This score determines whether the user is allowed to visit the website.

Based on the final trust score the user can be let through or requested to solve a challenge like Captcha or a computational javascript proof of work or blocked entirely:

How to bypass Cloudflare when web scraping in 2023

To add, Cloudflare is tracking the continuous behavior of the user and is constantly adjusting the trust score.

This complex operation makes web scraping difficult but if we take a look at each individual tracking component we can see that bypassing Cloudflare when web scraping is very much possible!

TLS Fingerprinting

TLS (or SSL) is the first thing that happens when we establish a secure connection (i.e. using https instead of http). The client and the server negotiate how the data will be encrypted.

This negotiation can be fingerprinted as modern web browsers have very similar TLS capabilities that some web scrapers might be missing. This is generally referred to as JA3 fingerprinting.

For example, some web scraping libraries and tools have unique TLS negotiation patterns that can be instantly recognized. While some use the same TLS techniques as a web browser and can be very difficult to differentiate.

So, use web scraping libraries that are resistant to JA3 fingerprinting.

For more see our full introduction to TLS fingerprinting which demonstrates how to test your tools for JA3 and our JA3 fingerprint web tool for a more in-depth look at this technique.

IP Address Fingerprinting

Many factors play in IP address analysis. To start there are different types of IP addresses:

  • Residential are home addresses assigned by ISPs to average internet consumers. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
  • Mobile addresses are assigned by mobile phone towers. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since many mobile users might share and recycle IP addresses between each other (there's one tower handling everyone) this means that Cloudflare can't reliably fingerprint these IPs.
  • Datacenter addresses are assigned to various datacenters and server platforms like Google Cloud, AWS etc. So, datacenter IPs provide a significant negative trust score as they are most likely to be used by non-humans.

With IP analysis Cloudflare can have a rough guess at how trustworthy the connecting client is. For example, people rarely browse from datacenter IPs thus web scrapers using datacenter proxies are very likely to be blocked.

So, use high-quality residential or mobile proxies.

For more see our full introduction to IP blocking and how IP trust is being calculated.

HTTP Details

Since most human users use one of the few web browser available the HTTP connection details is an easy way to identify scrapers and bots.

To start, most web is using HTTP2 and many web scraping tools still use HTTP1.1 which is a dead giveaway.

In addition, HTTP2 connection can be fingerprinted so scrapers using HTTP clients need to avoid this. See our http2 fingerprint test page for more info.

Other HTTP connection details like request headers can influence trust score as well. For example, most web browsers order their request headers in a specific order which can be different from HTTP libraries used in web scraping.

So, make sure the headers in web scraper requests match ones of a real web browser, including the ordering.

For more see our full introduction to request headers role in blocking

Javascript Fingerprinting

Finally, javascript provides a lot of information about the connecting client which is used in the trust score calculations. Since javascript allows arbitrary code execution on the client's machine it can be used to extract a lot of information about the connecting user.

Using Javascript code the server can fingerprint:

  • Javascript runtime details
  • Hardware details and capabilities
  • Operating system details
  • Web browser details

That's a lot of information that can be used in calculating the trust score.

Fortunately, javascript is intrusive and takes time to execute, so it's disliked by bots and humans alike thus limiting practical Javascript fingerprinting techniques. In other words, nobody wants to wait 5 seconds for the page to load.

For a more in-depth look see our article on javacript use in web scraper detection.

Bypassing javascript fingerprinting is by far the most difficult task here. In theory, it's possible to reverse engineer and simulate these javascript tasks but a much more accessible and common approach is to use a real web browser for web scraping.

This can be done using Selenium, Puppeteer or Playwright browser automation libraries that can start a real headless browser and navigate it for web scraping.

So, introducing browser automation to your scraping pipeline can drastically increase the trust score.

More advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-intensive browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)

Behavior Analysis

With all that, the trust score is not a constant number and will be constantly adjusted with ongoing connection.

For example, if we start with a score of 80 and proceed to connect to 100 pages in a few seconds we stand out as a non-human user and that will reduce the trust score.

On the other hand, the bot behaves human-like the trust score can remain steady or even increase.

So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent reducing of the trust core.

How to Bypass Cloudflare Bot Protection?

Now that we've covered all of the parts that are used by Cloudflare to identify web scrapers - how do we blend in overall?

In practice, we have two options.

We could reverse engineer and fortify against all of these detection techniques by using browser-like http2 connections, with the same TLS capabilities and common javascript....

Alternatively, we can use real web browsers for web scraping. By controlling a real web browser we no longer need to pretend to make bypassing Cloudflare much more approachable.

However, many automation tools like Selenium, Playwright and Puppeteer leave traces of their existence which optimally need to be patched to achieve higher trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions.

For sustained web scraping with Cloudfare bypass in 2023, these browsers should be remixed with different fingerprint profiles: screen resolution, operating system, browser type all play a role in Cloudflare's bot score.

Finally, some existing open source tools can help with Cloudflare bypass like cloudscraper which can solve Cloudflare's javascript challenges using Python or Nodejs solvers.

Bypass with ScrapFly

While bypassing Cloudflare is possible, maintaining bypass strategies can be very time-consuming.

How to bypass Cloudflare when web scraping in 2023

Using ScrapFly web scraping API we can defer all of this complex bypass logic to an API!

Scrapfly is not only a Cloudflare bypasser but also offers many other web scraping features:

For example, to scrape pages protected by Cloudflare using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="some cloudflare protected page",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country or type
    country="US",
    proxy_pool="residential_proxy_pool",
))
print(result.scrape_result)

Enter fullscreen mode Exit fullscreen mode

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping Cloudflare pages:

Is it legal to scrape Cloudflare protected pages?

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Cloudflare entirely and scrape the website directly?

Sort of. Since Cloudflare is a CDN it's possible to avoid it entirely by connecting to the web server directly. This is being done by discovering the real IP address of the server using DNS records or reverse engineering. However, this method is easily detectable so it's rarely used when web scraping.

Is it possible to bypass Cloudflare using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

What are some other anti-bot services?

There are many other anti-bot WAF services like PerimeterX, Akamai, Datadome and Imperva (aka Incapsula) though they function very similarly to PerimeterX so everything in this tutorial can be applied to them as well.

Summary

In this article, we've taken a look at how to get around Cloudflare anti-bot systems when web scraping.

To start, we've taken a look at how Cloudflare bot management identifies web scrapers through TLS, IP and javascript-based client analysis. Using residential proxies and fingerprint-resistant libraries can be a good start. Further, using real web browsers and remixing their fingerprint data can make web scrapers difficult to detect

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.

For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!

<!--kg-card-end: markdown--><!--kg-card-begin: html-->{<br> &quot;@context&quot;: &quot;<a href="https://schema.org">https://schema.org</a>&quot;,<br> &quot;@type&quot;: &quot;FAQPage&quot;,<br> &quot;mainEntity&quot;: [<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it legal to scrape Cloudflare protected pages?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it possible to bypass Cloudflare entirely and scrape the website directly?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Sort of. Since Cloudflare is a <abbr title=\"Content Deliver Network\">CDN</abbr> it&#39;s possible to avoid it entirely by connecting to the web server directly. This is being done by discovering the real IP address of the server using DNS records or reverse engineering. However, this method is easily detectable so it&#39;s rarely used when web scraping.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it possible to bypass Cloudflare using cache services?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Yes, public page caching services like Google Cache or Archive.org can be used to bypass Cloudflare. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;What are some other anti-bot services?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>There are many other anti-bot <abbr title=\"Web Application Firewall\">WAF</abbr> services like <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-perimeterx-human-anti-scraping/\">PerimeterX</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-akamai-anti-scraping/\">Akamai</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-datadome-anti-scraping/\">Datadome</a> and <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-imperva-incapsula-anti-scraping/\">Imperva (aka Incapsula)</a> though they function very similarly to PerimeterX so everything in this tutorial can be applied to them as well.</p>&quot;<br> }<br> }<br> ]<br> }<!--kg-card-end: html-->

Top comments (0)