Learn about modern web scraping protection techniques and how to bypass them. Scrape up to three times more pages by combining IP address rotation with shared IP address emulation.
Web scraping is used everywhere. From e-commerce to automotive, industries are collecting valuable data from the web to get ahead of competition. But as web scraping grows in popularity and accessibility, websites employ ever more sophisticated techniques to block the robots.
We compare the effectiveness of plain IP address rotation and shared IP address emulation (aka session multiplexing) at bypassing the protections of Alibaba, Google and Amazon–sites notoriously protective of their data.
Our results show that shared IP address emulation can help you bypass blocking and significantly extend the efficiency of your proxies.
What is shared IP address emulation?
Emulating shared IP address sessions relies on websites knowing that many different users can be behind a single IP address. Requests from mobile phones, for example, are usually routed through only a few IP addresses. Meanwhile, users protected by a single corporate firewall may all be using the same IP address.
You can trick websites into limiting their blocking by emulating these user sessions. Shared IP address emulation relies on managing the requests you send to websites by using cookies, authentication tokens and browser HTTP signatures that make the requests look like they’re coming from multiple users routed through the same IP address.
Evaluation of shared IP address emulation
In this test, we ran a simple scraper that extracts a web page’s title and search result titles on randomly generated Alibaba, Google and Amazon search pages. Each run was performed using a new, free Apify account, which is allocated 30 random datacenter proxies from a shared pool.
We scraped each site first using only IP rotation and then with a fresh account using shared IP address emulation. Scraping with shared IP address emulation allowed us to scrape between two and three times more pages before being blocked.
Shared IP address emulation made simple with Apify SDK’s SessionPool
The open-source Apify SDK library for Node.js provides a toolbox for web scraping, crawling and web automation tasks. Its built-in SessionPool class enables shared IP address emulation with a few simple configuration parameters and method calls. It is easily pluggable into parts of the Apify ecosystem such as the Apify Proxy and actors but can also be used separately.
The code example below shows how you can create a simple crawler that uses the Apify Proxy and shared IP address emulation with the Apify SDK. The crawler recursively crawls the Apify domain, saving the title of each page it visits.
The example uses CheerioCrawler, Apify’s framework for the parallel crawling of web pages using plain HTTP requests and the cheerio HTML parser. Cheerio is a fast, flexible and lean implementation of core jQuery designed specifically for the server. It parses markup and provides an API for traversing and manipulating the resulting data structure.
The resulting crawler is extremely efficient.
Conclusion
Implementing shared IP address emulation with Apify SDK’s SessionPool is an easy task that can significantly reduce blocking when web scraping. It can reduce your proxy costs or simply allow you to scrape more pages.
Would you like to learn more about the Apify SDK? Check out this guide on getting started with Apify.
Feel free to let us know in the comments how this approach works for you!
Top comments (0)