Oleg Kulyk

Posted on Mar 7, 2021 • Edited on Mar 9, 2021 • Originally published at scrapingant.com

How to use rotating proxies with Puppeteer

#puppeteer #proxy #webscraping #node

Puppeteer is a high-level API to control headless Chrome. Most things that you can do manually in the browser can be done using Puppeteer, so it quickly became one of the most popular web scraping tool in Node.js and Python. Many developers use it for single-page applications (SPA) data extraction as it allows executing client-side Javascript. In this article, we are going to show how to set up a proxy in Puppeteer and how to spin up your own rotating proxy server.

Configuring proxy in Puppeteer

For requesting the target site via a proxy server we just should specify the --proxy-server launch parameter with a proper proxy address. For example, http://10.10.10.10:8080

const puppeteer = require('puppeteer');

(async() => {

  const browser = await puppeteer.launch({
     args: [ '--proxy-server=http://10.10.10.10:8080' ]
  });

  const page = await browser.newPage();
  await page.goto('https://httpbin.org/ip');
  await browser.close();
})();

As a result, httpbin should respond with a JSON, that contains the exact proxy server address, so the code above can be used for further proxy IP address testing:

{
  "origin": "10.10.10.10"
}

Pretty simple, isn't it? The only one downside of this approach, that the defined proxy server will be used for all the requests from the browser start, and for changing the proxy server the browser should be relaunched by puppeteer.launch with a new proxy IP address.

Rotate proxy servers by your own

To avoid ban while web scraping you need to use different proxies and rotate them. In case of implementing your custom IP pool you'll need to re-launch your headless Chrome each time with a new proxy server settings. How to implement proxy rotation by each browser request?

The answer is pretty simple - you may intercept each request with your own proxy rotation tool! That kind of tool will handle proxy rotation for the browser, and you'll be able to save the precious time while web scraping.

To spin up proxy rotation server you may use the handy library proxy-chain and ScrapingAnt free proxies list:

const proxies = {
  'session_1': 'http://185.126.200.167:3128',
  'session_2': 'http://116.228.227.211:443',
  'session_3': 'http://185.126.200.152:3128',
};

const server = new ProxyChain.Server({
  port: 8080,
  prepareRequestFunction: ({ request }) => {
      // At this point of code we should decide what proxy
      // to use from the proxies list.
      // You can chain your browser requests by header 'session-id'
      // or just pick a random proxy from the list
      const sessionId = request.headers['session-id'];
      const proxy = proxies[sessionId];
      return { upstreamProxyUrl: proxy };
  }
});

server.listen(() => console.log('Rotating proxy server started.'));

The only disadvantage of this method is that you have to handle a bigger codebase and have a deep dive into networking, proxy management, and maintenance.

One API call solution

In order to simplify the web scraper and have more space while scraping at scale, you might want to get rid of the infrastructure pain and just focus on what you really want to achieve (extract the data).

ScrapingAnt API provides the ability to scrape the target page with only one API call. All the proxies rotation and headless Chrome rendering already handled by the API side. You can check out how simple it is with the ScrapingAnt Javascript client:

const ScrapingAntClient = require('@scrapingant/scrapingant-client');

const client = new ScrapingAntClient({ apiKey: '<YOUR-SCRAPINGANT-API-KEY>' });

// Check the proxy rotation
client.scrape('https://httpbin.org/ip')
    .then(res => console.log(res))
    .catch(err => console.error(err.message));

Or with a plain Javascript request to API (a bit more boilerplate code):

var http = require("https");

var options = {
   "method": "POST",
   "hostname": "api.scrapingant.com",
   "port": null,
   "path": "/v1/general",
   "headers": {
       "x-api-key": "<YOUR-SCRAPINGANT-API-KEY>",
       "content-type": "application/json",
       "accept": "application/json",
       "useQueryString": true
   }
};

var req = http.request(options, function (res) {
   var chunks = [];

   res.on("data", function (chunk) {
       chunks.push(chunk);
   });

   res.on("end", function () {
       var body = Buffer.concat(chunks);
       console.log(body.toString());
   });
});

req.write(JSON.stringify({
    url: 'https://httpbin.org/ip',
}));
req.end();

With ScrapingAnt API, you can forget about any complications with IP rotation, and the internal anti-scraping avoiding mechanisms will help you to not be detected by Cloudflare. You can use it for free, follow here to sign in and get your API token.

Oldest comments (1)

Crawlbase • Mar 4 '24

Wow, what a comprehensive benchmark on Web Scraping APIs! It's awesome to see the breakdown on reliability, speed, and cost across different platforms. And yes, Scraping Fish seems to ace it with the highest success rate and quickest processing time. With such transparency in pricing, it's definitely worth checking out. Checkout Crawlbase and give your suggestions.