Scrapfly for Scrapfly

Posted on Apr 10, 2023 • Originally published at scrapfly.io on Mar 10, 2023

How to Bypass Akamai when Web Scraping in 2023

#scraperblocking

Akamai Bot Manager is a popular web service that protects websites from bots and scrapers. It's used by many popular websites including Amazon, Ebay, Airbnb and many others.

Akamai is primarily known for using AI in their bot detection software but it's powered by traditional bot detection methods like fingerprinting and connection analysis. This means with careful engineering Akamai can be bypassed when web scraping.

In this article, we'll be taking a look at how to bypass Akamai Bot Manager and how to detect when a request has been blocked by Akamai. We'll also cover common Akamai errors and signs that indicate that requests have been blocked. Let's dive in!

What is Akamai Bot Manager?

Akamai offers a suite of web services and the Bot Manager service is used to determine whether connecting user is a human or an automated process. While it has a legitimate use of protecting websites from malicious bots it also blocks web scrapers from accessing public data.

Akamai Bot Manager is primarily used by big websites like Ebay.com, Airbnb.com, Amazon.com making web scraping of these targets difficult but possible. Next, let's take a look at some popular Akamai errors and how the whole thing works.

How to identify Akamai Block?

Most of the Akamai bot blocks result in HTTP status codes 400-500. Most commonly, status code 403 with the message "Pardon Our Interruption" or "Access Denied" is returned. Though to throw off bots Akamai can also return status code 200 with the same messages.

Screenshot of Akamai block page when scraping similarweb.com

This error is mostly encountered on the first request as Akamai is particularly good at detecting bots at the first stages of the connection. However, Akamai's AI behavior analysis can block connections at any point.

Let's take a look at how exactly Akamai is detecting web scraper and bots next.

How Does Akamai Detect Web Scrapers?

Akamai Bot Manager is using many different web technologies to determine whether a user is a human or a bot. Not only that, but Akamai continuously tracks users' behavior to adjust the detection results also known as the trust score.

The trust score is calculated in many different stages. The final score is then a weighted average of all the stages and determines whether the user is allowed to bypass Akamai.

This complex process is making web scraping difficult as developers have to manage many different factors to bypass Akamai. However, if we take a look at the individual stages we can see that bypassing Akamai is very much possible!

TLS Fingerprinting

TLS (or SSL) is the first step in the HTTP connection process. It's used in end-to-end encryption of https connections.

To start, both client and the server have to negotiate the encryption method. As there are many different ciphers and encryption options both sides have to agree on the same one. This is where TLS fingerprinting comes into play.

Since different computers, programs and even programming libraries have different TLS capabilities, if a scraper uses a library with different TLS capabilities of a regular web browser it can be identified through this method. This is generally referred to as JA3 fingerprint.

So, if a web scraper is using a library with different TLS capabilities compared to a regular web browser it can be identified through this method.

To avoid being JA3 fingerprinted ensure that the libraries and tools used in HTTP connection are JA3 resistant.

For that, see ScrapFly's JA3 fingerprint web tool that shows your fingerprint.

For more see our full introduction to TLS fingerprinting which covers TLS fingerprinting in greater detail.

IP Address Fingerprinting

The next step in Akamai's detection is IP address analysis and fingerprint.

To start, there are a few different types of IP addresses:

Residential are home addresses assigned by internet providers to average people. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
Mobile addresses are assigned by mobile phone towers and mobile users. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses it makes it much more difficult to rely on IP addresses for identification.
Datacenter addresses are assigned to various data centers and server platforms like Amazon's AWS, Google Cloud etc. So, datacenter IPs provide a significant negative trust score as they are likely to be used by bots.

Using IP analysis Akamai can determine whether the IP address is residential, mobile or datacenter. This is done by comparing the IP address to a database of known IP addresses and inspecting public IP provider details.

For example, since real users rarely browse from datacenter IPs if web scraper is using one it's a dead giveaway that it's a bot.

So, use high-quality residential or mobile proxies to avoid being blocked by Akamai at this stage.

For a more in-depth look, see our full introduction to IP blocking.

HTTP Details

The next step is the HTTP connection itself. HTTP protocol is becoming more complex and Akamai is using this complexity to detect bots.

To start, most of the web runs on HTTP2 and HTTP3 while many web scraping libraries are using HTTP1.1. So, if a web scraper is using HTTP1.1 it's a clear giveaway that it is a bot.

While many newer HTTP libraries like cURL and httpx support HTTP2 it can still be detected by Akamai using HTTP2 fingerprinting. See ScrapFly's http2 fingerprint test page for more info.

HTTP request headers also play an important role. Akamai is looking for specific headers that are used by web browsers but not by web scrapers. So, it's important to ensure that request headers and their order match that of a real web browser and context of the website.

For example, headers like Origin, Referer can be used in some pages of the website but not in others. Other identity headers like User-Agent and encoding headers like Accept-Encoding can also be used to identify bots.

For more see our full introduction to request headers role in blocking

Javascript Fingerprinting

Finally the most complex and difficult to bypass stage is Javascript fingerprinting.

As the web server can execute arbitrary javascript code on the client's machine it can be used to gather vast amounts of information about the connecting client:

Javascript engine details
Harware details and capabilities
Operating system information
Web browser context information

All of this data is used to create a unique fingerprint for tracking users and identifying bots.

Fortunately, javascript is complex and takes time to execute. This limits practical Javascript fingerprinting techniques. In other words, not many users can wait 3 seconds for the page to load or tolerate false positive blocks.

For an in-depth look see our article on javacript use in web scraper detection.

To bypass Akamai's javascript fingerprinting we generally have two very different options.

We can intercept and reverse engineer javascript behavior and feed Akamai with fake data. This is a very complex and time-consuming process as Akamai Bot team is constantly adjusting and changing things up.

Alternatively, we can run a real web browser using browser automation libraries like Selenium, Puppeteer or Playwright that can start a real headless browser and navigate it for web scraping.

So, use browser automation libraries to bypass Akamai's javascript fingerprinting.

This approach can even be mixed with traditional HTTP libraries as we can establish trust score using real web browser and switch session to HTTP library for faster scraping (this feature is also available using Scrapfly sessions)

Behavior Analysis

With all of the above methods bypassed Akamai can still detect bots using behavior analysis. As Akamai is tracking everything that happens on the website it can detect scrapers and bots by detecting abnormal behavior.

So, it's important to distribute web scraper traffic through multiple agents.

This is done by creating multiple profiles with proxies, header details and other settings. If browser automation is used then each profile should use a different browser version and configuration (like screen size etc.).

How to Bypass Akamai Bot Management?

Now that we're familiar with all of the methods being used to detect bots we have a general understanding of how to bypass Akamai bot protection by avoiding all of these detection methods.

There are many ways to approach this challenge but to bypass Akamai in 2023 we can summarize the general approach as follows:

Use high-quality residential or mobile proxies
Use browser automation libraries to bypass Akamai's javascript fingerprinting
Patch browser automation libraries with fingerprint resistance patches (like puppeteer-stealth)
Distribute web scraper traffic through multiple agents

Bypass with ScrapFly

While bypassing Akamai is possible, maintaining the bypass strategies can be very time-consuming. This is where services like ScrapFly web scraping API come in!

Using ScrapFly we can hand over all of the web scraping complexity and bypass logic to ScrapFly!

Scrapfly is not only a Akamai bypasser but also offers many other web scraping features:

Millions of residential proxies from over 50+ countries
Akamai and any other anti-scraping protection bypass
Headless cloud browsers that can render javascript pages and automate browsers
Python SDK
Easy monitoring and debugging tools

For example, to scrape pages protected by Akamai or any other anti-scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://amazon.com/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like Japan
    country="JP",
    # and proxy type like residential:
    proxy_pool="residential_proxy_pool",
))
print(result.scrape_result)

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping Akamai pages:

Is it legal to scrape Akamai-protected pages?

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Akamai using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass Akamai protected pages as Google and Archive tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to skip Akamai entirely and scrape the real website directly?

This threads closer to security research and it's not advised to partake when web scraping. While scraping and bypassing Akamai pages is perfectly legal abusing security flaws can be illegal in many countries.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, PerimeterX (aka Human), Datadome and Imperva (aka Incapsula) though they function very similarly to Akamai so everything in this tutorial can be applied to them as well.

Summary

In this article, we've taken a look at how to bypass Akamai Bot Management when web scraping.

We've started by identifying all of the ways Akamai is using to develop a trust score for each new connection and the role of this score in web scraping. We've taken a look at each method and what can we do to bypass it.

Finally, we've looked at how to bypass Akamai using ScrapFly web scraping API and how to use ScrapFly to scrape Akamai-protected pages, so give it a shot for free!

{ "@context": "<a href="https://schema.org">https://schema.org</a>", "@type": "FAQPage", "mainEntity": [ { "@type": "Question", "name": "Is it legal to scrape Akamai-protected pages?", "acceptedAnswer": { "@type": "Answer", "text": "Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website." } }, { "@type": "Question", "name": "Is it possible to bypass Akamai using cache services?", "acceptedAnswer": { "@type": "Answer", "text": "Yes, public page caching services like Google Cache or Archive.org can be used to bypass Akamai protected pages as Google and Archive tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically." } }, { "@type": "Question", "name": "Is it possible to skip Akamai entirely and scrape the real website directly?", "acceptedAnswer": { "@type": "Answer", "text": "This threads closer to security research and it's not advised to partake when web scraping. While scraping and bypassing Akamai pages is perfectly legal abusing security flaws can be illegal in many countries." } }, { "@type": "Question", "name": "What are some other anti-bot services?", "acceptedAnswer": { "@type": "Answer", "text": "There are many other anti-bot <abbr title=\"Web Application Firewall\">WAF</abbr> services like <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-cloudflare-anti-scraping/\">Cloudflare</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-perimeterx-human-anti-scraping/\">PerimeterX (aka Human)</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-datadome-anti-scraping/\">Datadome</a> and <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-imperva-incapsula-anti-scraping/\">Imperva (aka Incapsula)</a> though they function very similarly to Akamai so everything in this tutorial can be applied to them as well." } } ] }

DEV Community