DEV Community

Scrapfly for Scrapfly

Posted on • Originally published at scrapfly.io on

How to Bypass PerimeterX when Web Scraping in 2023

How to Bypass PerimeterX when Web Scraping in 2023

How to Bypass PerimeterX when Web Scraping in 2023

PerimeterX is one of the most popular anti-bot services on the market offering a wide range of protection against bots and scrapers. PerimeterX products Bot Defender, Page Defender and API Defender are all used to block web scrapers.

In this article, we'll take a look at how to bypass PerimeterX bot protection. We'll do this by taking a quick look at how it detects scrapers and how to modify our scraper code to prevent being detected by PerimeterX.

We'll also cover common PerimeterX errors and signs that indicate that requests have failed to bypass PerimeterX and their meaning. Let's dive in!

What is PerimeterX?

PerimeterX (aka Human) is a web service that protects websites, apps and APIs from automation such as scrapers. It uses a combination of web technologies and behavior analysis to determine whether the user is a human or a bot.

It is used by popular websites like Zillow.com, fiverr.com, and many others so by understanding PerimeterX bypass we can open up web scraping of many popular websites.

Next, let's take a look at some popular PerimeterX errors.

Popular PerimeterX Errors

Most of the PerimeterX bot block result in HTTP status codes 400-500, most commonly error 403. The body of the response contains a request to "enable javascript" or "Press and hold" button.

How to Bypass PerimeterX when Web Scraping in 2023
PerimeterX block page on fiverr.com

This error is mostly encountered on the first request to the website though since PerimeterX is using behavior analysis it can also be encountered at any point during web scraping.

Let's take a look at how exactly PerimeterX is detecting web scrapers and bots and how the "Press and hold" button works.

How Does PerimeterX Detect Web Scrapers?

To detect web scraping, PerimeterX uses many different technologies to estimate whether the traffic is coming from a human user or a bot.

How to Bypass PerimeterX when Web Scraping in 2023

PerimeterX uses a combination of fingerprinting and connection analysis to calculate a trust score for each client. This score determines whether the user can access the website or not.

Based on the final trust score, the user is either allowed to access the website or blocked with a PerimeterX block page which can further be bypassed by solving javascript challenges (i.e. the "press and hold" button).

How to Bypass PerimeterX when Web Scraping in 2023

This complex process makes web scraping difficult as there are many factors at play here. However, if we take a look at each individual factor we can see that bypassing PerimeterX is very much possible!

TLS Fingerprinting

TLS (or SSL) is the first step in HTTP connection establishment. It is used to encrypt the data that is being sent between the client and the server. Note that TLS is only applicable to https endpoints (not http).

First, the client and the server negotiate how encryption is done and this is where TLS fingerprinting comes into play. Different computers, programs and even programming libraries have different TLS capabilities.

So, if a scraper uses a library with different TLS capabilities compared to a regular web browser it can be identified quite easily. This is generally referred to as JA3 fingerprint.

For example, some libraries and tools used in web scraping, have unique TLS negotiation patterns that can be instantly recognized. While some use the same TLS techniques as a web browser and can be very difficult to differentiate.

To validate your tools see ScrapFly's JA3 fingerprint web tool that can tell you your exact JA3 fingerprint.

So, use web scraping libraries and tools that are resistant to JA3 fingerprinting.

For more see our full introduction to TLS fingerprinting which covers TLS fingerprinting in greater detail.

IP Address Fingerprinting

The next step is IP address analysis. Since IP addresses come in many different shapes and sizes there's a lot of information that can be used to determine whether the client is a human or a bot.

To start, there are different types of IP addresses:

  • Residential are home addresses assigned by internet providers to average people. So, residential IP addresses provide a positive trust score as these are mostly used by humans and are expensive to acquire.
  • Mobile addresses are assigned by mobile phone towers and mobile users. So, mobile IPs also provide a positive trust score as these are mostly used by humans. In addition, since mobile towers might share and recycle IP addresses it makes it much more difficult to rely on IP addresses for bot identification.
  • Datacenter addresses are assigned to various data centers and server platforms like Amazon's AWS, Google Cloud etc. So, datacenter IPs provide a significant negative trust score as they are likely to be used by bots.

Using IP analysis PerimeterX can have an estimate of how likely the connecting client is a human. To start, most people browser from residential IPs while most mobile IPs are used for mobile traffic.

So, use high-quality residential or mobile proxies.

For a more in-depth look, see our full introduction to IP blocking.

HTTP Details

The next step is the HTTP connection itself. This includes HTTP connection details like:

  • Protocol Version Most of the web is using HTTP2 and many web scraping tools still use HTTP1.1 which is a dead giveaway. Many newer HTTP client libraries like httpx or cURL support HTTP2 though not by default. HTTP2 can also be succeptible to fingerprinting so check ScrapFly's http2 fingerprint test page for more info.
  • Headers Pay attention to X- prefixed headers and the usual suspects like User-Agent, Origin, Referer can be used to identify web scrapers.
  • Header Order Web browsers have a specific way of ordering request headers. So, if the headers are not ordered in the same way as a web browser it can be a critical giveaway. To add, some HTTP libraries (like requests in Python) do not respect the header order and can be easily identified.

So, make sure the headers in web scraper requests match a real web browser, including the ordering.

For more see our full introduction to request headers role in blocking

Javascript Fingerprinting

Finally, the most powerful tool in PerimeterX's arsenal is javascript fingerprinting.

Since the server can execute arbitrary Javascript code on the client's machine it can extract a lot of information about the connecting user, like:

  • Javascript runtime details
  • Hardware details and capabilities
  • Operating system details
  • Web browser details

That's loads of data that can be used in calculating the trust score.

Fortunately, javascript takes time to execute and is prone to false positives. This limits the practical Javascript fingerprinting application. In other words, not many users can wait 3 seconds for the page to load or tolerate false positives.

For a really in-depth look see our article on javacript use in web scraper detection.

Bypassing javascript fingerprinting is the most difficult task here. In theory, it's possible to reverse engineer and simulate all of the javascript tasks PerimeterX is performing and feed it fake results though it's not practical.

A more practical approach is to use a real web browser for web scraping.

This can be done using browser automation libraries like Selenium, Puppeteer or Playwright that can start a real headless browser and navigate it for web scraping.

So, introducing browser automation to your scraping pipeline can drastically raise the trust score.

Tip: many advanced scraping tools can even combine browser and HTTP scraping capabilities for optimal performance. Using resource-heavy browsers to establish a trust score and continue scraping using fast HTTP clients like httpx in Python (this feature is also available using Scrapfly sessions)

Behavior Analysis

Even when scrapers' initial connection is indistinguishable from a real web browser, PerimeterX can still detect them through behavior analysis.

This is done by monitoring the connection and analyzing the behavior of the client. This includes:

  • Pages that are being visited. People browse in more chaotic patterns.
  • Connection speed and rate. People are slower and more random than bots.
  • Loading of resources like images, scripts, stylesheets etc.

The trust score is not a constant number and will be constantly adjusted.

So, it's important to distribute web scraper traffic through multiple agents using proxies and different fingerprint configurations to prevent behavior analysis.

For example, if browser automation tools are used different browser configurations should be used for each agent like screen size, operating system, web browser version, IP address etc.

How to Bypass PerimeterX (aka Human) Bot Protection?

Now that we're familiar with all of the ways PerimeterX can detect web scrapers, let's see how to bypass it.

In reality, we have two very different options:

We could reverse engineer and foritify against all of these techniques but PerimeterX is constantly updating their detection methods and it's a never-ending game of cat and mouse.

Alternatively, we can use real web browsers for scraping. This is the most practical and effective approach as it's much easier to ensure that the headless browser looks like a real one than to re-invent it.

However, many browser automation tools like Selenium, Playwright and Puppeteer leave data about their existence which need to be patched to achieve high trust scores. For that, see projects like Puppeteer stealth plugin and other similar stealth extensions that patch known leaks.

For sustained web scraping with PerimeterX bypass in 2023, these browsers should always be remixed with different fingerprint profiles: screen resolution, operating system, browser type all play an important role in PerimeterX's bot score.

Bypass with ScrapFly

While bypassing PerimeterX is possible, maintaining bypass strategies can be very time-consuming. This is where services like ScrapFly come in!

How to Bypass PerimeterX when Web Scraping in 2023

Using ScrapFly web scraping API we can hand over all of the web scraping complexity and bypass logic to an API!

Scrapfly is not only a PerimeterX bypasser but also offers many other web scraping features:

For example, to scrape pages protected by PerimeterX or any other anti scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://fiverr.com/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like Japan
    country="JP",
    # and proxy type like residential:
    proxy_pool="residential_proxy_pool",
))
print(result.scrape_result)

Enter fullscreen mode Exit fullscreen mode

FAQ

To wrap this article let's take a look at some frequently asked questions regarding web scraping PerimeterX pages:

Is it legal to scrape PerimeterX protected pages?

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass PerimeterX using cache services?

Yes, public page caching services like Google Cache or Archive.org can be used to bypass PerimeterX protected pages as Google and Archive is tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to bypass PerimeterX entirely and scrape the website directly?

No. PerimeterX integrates directly with the server software and is very difficult to reach the server without going through it. It is possible that some servers could have PerimeterX misconfigured but it's very unlikely.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, Akamai, Datadome and Imperva (aka Incapsula) though they function very similarly to PerimeterX so everything in this tutorial can be applied to them as well.

Summary

In this article, we took a deep dive into PerimeterX anti-bot systems when web scraping.

To start, we've taken a look at how Perimeter X identifies web scrapers through TLS, IP and javascript client fingerprint analysis. We saw that using residential proxies and fingerprint-resistant libraries is a good start. Further, using real web browsers and remixing their fingerprint data can make web scrapers much more difficult to detect.

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.

For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!

<!--kg-card-end: markdown--><!--kg-card-begin: html-->{<br> &quot;@context&quot;: &quot;<a href="https://schema.org">https://schema.org</a>&quot;,<br> &quot;@type&quot;: &quot;FAQPage&quot;,<br> &quot;mainEntity&quot;: [<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it legal to scrape PerimeterX protected pages?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it possible to bypass PerimeterX using cache services?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Yes, public page caching services like Google Cache or Archive.org can be used to bypass PerimeterX protected pages as Google and Archive is tend to be whitelisted. However, since caching takes time the cached page data is often outdated and not suitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it possible to bypass PerimeterX entirely and scrape the website directly?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>No. PerimeterX integrates directly with the server software and is very difficult to reach the server without going through it. It is possible that some servers could have PerimeterX misconfigured but it&#39;s very unlikely.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;What are some other anti-bot services?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>There are many other anti-bot <abbr title=\"Web Application Firewall\">WAF</abbr> services like <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-cloudflare-anti-scraping/\">Cloudflare</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-akamai-anti-scraping/\">Akamai</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-datadome-anti-scraping/\">Datadome</a> and <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-imperva-incapsula-anti-scraping/\">Imperva (aka Incapsula)</a> though they function very similarly to PerimeterX so everything in this tutorial can be applied to them as well.</p>&quot;<br> }<br> }<br> ]<br> }<!--kg-card-end: html-->

Top comments (0)