DEV Community

Scrapfly for Scrapfly

Posted on • Originally published at scrapfly.io on

How to Bypass Imperva Incapsula when Web Scraping in 2023

How to Bypass Imperva Incapsula when Web Scraping in 2023

How to Bypass Imperva Incapsula when Web Scraping in 2023

Imperva (aka Incapsula) is a popular WAF service used by websites like Glassdoor, Udemy, wix.com and many others.

This service is used to block bots and web scrapers from accessing the website. So, to scrape public data from these websites, the scrapers need to bypass Imperva Incapsula bot protection.

In this article, we'll be taking a look at how to bypass Imperva's anti-scraping protection. We'll start by taking a quick look at what is Imperva, how to identify it and how is it identifying web scrapers. Then, we'll take a look at existing techniques and tools for bypassing Imperva bot protection. Let's dive in!

What is Imperva (aka Incapsula)?

Imperva (previously known as Incapsula) is a WAF service suite that is used to protect websites from unwanted connections. It has legitimate uses though in the context of web scraping, it's used to block web scrapers from accessing public data.

Imperva/Incapsula is one of the first WAF services to be used by websites block web scraping and is generally well understood by web scraping community. So, let's take a look a how to identify it and how it's identifying web scrapers.

Imperva Block Page Example

Most of Imperva bot blocks result in HTTP status codes 400-500 and 403 being the most common one. To add, block pages can appear in status code 200 to throw off web scrapers.

The HTML content often indicates the block is powered by Imperva:

How to Bypass Imperva Incapsula when Web Scraping in 2023
Imperva block page on giffgaff.com website

These errors are mostly encountered on the first request to the website. Though, as Incapsula is using constant tracking it can start blocking with the same pages at any point during the scrape process.

Here's the full list of Incapsula block fragments.

  • Powered By Incapsula text snippet in HTML.
  • Incapsula incident ID keyword in HTML.
  • _Incapsula_Resource keyword in HTML.
  • subject=WAF Block Page keyword in HTML.
  • visid_incap value in request headers.
  • X-Iinfo response header.
  • Set-Cookie header has cookie field incap_ses and visid_incap.

How does Imperva identify web scrapers?

To detect web scraping, Imperva is taking advantage of many different analysis and fingerprinting techniques.

How to Bypass Imperva Incapsula when Web Scraping in 2023

Imperva is using combination of these techniques to establish a unique fingerprint and trust score for each connecting client.

Based on the final trust score Imperva decides whether to block the client, let it through or request additional verification (like captcha).

How to Bypass Imperva Incapsula when Web Scraping in 2023

The complexity of this process can be very daunting but if we take a look at each individual component we can see that bypassing Imperva is possible. Let's take a look at each of these components.

TLS Fingerprinting

TLS (or SSL) fingerprinting is a modern technique of identifying a client based on the way the client and server negotiate an encrypted connection. This is called JA3 fingerprint.

For a secure connection (i.e. https) encryption method needs to be negotiated as there are many different cipher and encryption options. So, if connecting client has unusual capabilities it can be easily identified.

Libraries used in web scraping can have different encryption capabilities compared to a web browser. So, web scrapers can be easily identified by their TLS fingerprint even before the actual HTTP request is made.

To avoid this, use libraries and tools that are JA3 resistant.

To validate, see ScrapFly's JA3 fingerprint web tool

For more see our full introduction to TLS fingerprinting which covers TLS fingerprinting in greater detail.

IP Address Fingerprinting

The next step is IP address analysis. Imperva has access to IP meta information databases that can be used to identify client's intentions and capabilities.

For example, if the IP address belongs to a known proxy or datacenter service, it can be easily identified as a bot. If the IP address is from a residential ISP, it is much more likely to be a human. Same for mobile networks.

So, use high-quality residential or mobile proxies to avoid being detected.

For a more in-depth look, see our full introduction to IP blocking and what IP metadata fields are used in bot detection.

HTTP Details

With the connection established the next step is HTTP connection analysis.

To start, most of the natural web runs on HTTP2 and HTTP3 (that's what web browsers prefer). So, naturally any HTTP1 connection is suspicious. Most HTTP libraries still use or default to HTTP1.1 which is a dead giveaway. More modern and feature-rich libraries like Python's httpx or cURL support HTTP2 though not by default.

Then, request header values and ordering can be used to identify the client. Web browser header generation is well understood and reliable, so it's on web scrapers to match it. For example, web browser send headers like User-Agent, Origin and Referer headers and in a specific order to boot.

So, make sure to use HTTP2 and match header values and ordering of a real web browser.

For more see our full introduction to request headers role in blocking

Javascript Fingerprinting

The final step is Javascript fingerprinting. This is a very powerful technique that can be used to identify a client based on the way it executes Javascript code.

Since the server is allowed to execute almost any arbitrary javascript code on the client's machine it can extract a lot of information about the client like:

  • Javascript engine details
  • Hardware and operating system information
  • Web browser data and rendering capabilities

That's a lot of data that can be used to identify web scrapers.

To handle this web scrapers have two approaches:

First, we could intercept the javascript fingerprinting and feed Imperva with fake data. However, this requires a lot of work and is not very reliable as any updates to the fingerprinting code will break scraping.

Alternatively, we can use a headless browser to execute the javascript code. This is a much more reliable approach as it's very unlikely that the fingerprinting code will change.

Headless browsers can be controlled by web scraping libraries like Puppeteer, Selenium or Playwright. These tools can be used to control a real web browser to establish a trust-worthy connection with Imperva.

So, using headless browser automation with Selenium, Puppeteer or Playwright is an easy way to handle javascript fingerprinting

Many advanced web scraping tools can juggle between headless browser and raw HTTP connections. So, the trust score can be established using slow browser based scraping and then switch to fast HTTP requests (this feature is also available in ScrapFly).

Behavioral Analysis

Even if we address all of these detection methods Imperva can still identify scrapers to continuous behavior analysis.

As Imperva is tracking all connection details and patterns it can use this information to adjust the trust score constantly which can lead to blocking or captcha challenges.

So, it's important to distribute scraping through multiple agents using proxies and different fingerprint configurations.

For example, when scraping using browser automation tools, it's important to use a collection of different profiles like screen size, operating system, rendering capabilities together with IP proxies.

How to Bypass Imperva?

We can see that there's a lot going on when it comes to Imperva's anti bot technology and since it's using score based approach we don't necessarily need to bypass all of the detection methods perfectly. To quickly summarize, here's where scrapers can be improved to avoid detection:

  • Use high quality residential or mobile proxies
  • Use HTTP2 (or later) version for all requests
  • Match request header values and ordering of a real web browser
  • Use headless browser automation to generate Javascript fingerprints
  • Distribute web scraper traffic through multiple agents

Note that as Imperva is developing and improving their methods it's important to stay in touch with web scraping tool and library updates. For example, see Puppeteer stealth plugin for Puppeteer that keeps track of new fingerprinting techniques.

Bypass with Scrapfly

While bypassing Imperva is possible, maintaining bypass strategies is a lot of work and this is where ScrapFly can help.

How to Bypass Imperva Incapsula when Web Scraping in 2023

Using Scrapfly web scraping API we can defer all of this complexity and bypass logic and focus on web scraping itself!

Scrapfly is not only an Imperva bypasser byt offers many other web scraping quality-of-life features:

For example, to scrape pages protected by Datadome or any other anti-scraping service, when using ScrapFly SDK all we need to do is enable the Anti Scraping Protection bypass feature:

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(key="YOUR API KEY")
result = scrapfly.scrape(ScrapeConfig(
    url="https://www.glassdoor.com/",
    asp=True,
    # we can also enable headless browsers to render web apps and javascript powered pages
    render_js=True,
    # and set proxies by country like France
    country="FR",
    # and proxy type like residential:
    proxy_pool="residential_proxy_pool",
))
print(result.scrape_result)

Enter fullscreen mode Exit fullscreen mode

FAQ

To wrap up this article, let's take a look at some frequently asked questions regarding web scraping Imperva protected pages:

Is it legal to scrape Imperva protected pages?

Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.

Is it possible to bypass Imperva using cache services?

Yes, public page caching services like Google Cache or Archive.org can sometimes be used to bypass Imperva protection as Google and Archive tend to be whitelisted. However, not all pages are cached and the ones that are are often outdated making them unsuitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.

Is it possible to bypass Imperva entirely and scrape the website directly?

Web security is a complex topic so yes but it's not advised as this can be illegal in some countries and in general not sustainable.

What are some other anti-bot services?

There are many other anti-bot WAF services like Cloudflare, Akamai, Datadome and PerimeterX though they function very similarly to Imperva's Incapsula so everything in this tutorial can be applied to them as well.

Summary

In this guide, we've taken a look at how to bypass Incapsula (now known as Imperva) when web scraping.

To start, we've taken a look at the detection methods Imperva is using and how can we address each one of them in our scraper code. We saw that using residential proxies and patching common fingerprinting techniques can vastly improve trust scores when it comes to Imperva's bot blocking.

Finally, we've taken a look at some frequently asked questions like alternative bypass methods and the legality of it all.

For an easier way to handle web scraper blocking and power up your web scrapers check out ScrapFly for free!

<!--kg-card-end: markdown--><!--kg-card-begin: html-->{<br> &quot;@context&quot;: &quot;<a href="https://schema.org">https://schema.org</a>&quot;,<br> &quot;@type&quot;: &quot;FAQPage&quot;,<br> &quot;mainEntity&quot;: [<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it legal to scrape Imperva protected pages?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Yes. Web scraping publicly available data is perfectly legal around the world as long as the scrapers do not cause damage to the website.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it possible to bypass Imperva using cache services?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Yes, public page caching services like Google Cache or Archive.org can sometimes be used to bypass Imperva protection as Google and Archive tend to be whitelisted. However, not all pages are cached and the ones that are are often outdated making them unsuitable for web scraping. Cached pages can also be missing parts of content that are loaded dynamically.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;Is it possible to bypass Imperva entirely and scrape the website directly?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>Web security is a complex topic so yes but it&#39;s not advised as this can be illegal in some countries and in general not sustainable.</p>&quot;<br> }<br> },<br> {<br> &quot;@type&quot;: &quot;Question&quot;,<br> &quot;name&quot;: &quot;What are some other anti-bot services?&quot;,<br> &quot;acceptedAnswer&quot;: {<br> &quot;@type&quot;: &quot;Answer&quot;,<br> &quot;text&quot;: &quot;<p>There are many other anti-bot <abbr title=\"Web Application Firewall\">WAF</abbr> services like <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-cloudflare-anti-scraping/\">Cloudflare</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-akamai-anti-scraping/\">Akamai</a>, <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-datadome-anti-scraping/\">Datadome</a> and <a class=\"text-reference\" href=\"https://scrapfly.io/blog/how-to-bypass-perimeterx-human-anti-scraping/\">PerimeterX</a> though they function very similarly to Imperva&#39;s Incapsula so everything in this tutorial can be applied to them as well.</p>&quot;<br> }<br> }<br> ]<br> }<!--kg-card-end: html-->

Top comments (0)