DEV Community

iankalvin
iankalvin

Posted on • Updated on

Scraping in Node.js + Cheerio made easy with ProxyCrawl

If you are new to web scraping like me, chances are, you already experienced being blocked by a certain website or unable to bypass CAPTCHAs.

As I search for an easy way to scrape web pages without worrying too much about being blocked, I came across ProxyCrawl which offers an easy to use Crawler API. The product allowed me to scrape Amazon pages smoothly with incredible reliability.

In this article, I wanted to share with you the steps on how I build a scraper and then integrate the crawling API into my project. This simple code will scrape product reviews from a list of Amazon URLs easily and write that scraped data straight to a CSV file.

Preparation

With this Node project, I have used ProxyCrawl's library and Cheerio which is like a JQuery tool for the server used in web scraping. So before starting with the actual coding, I will list all that is needed for this to work:

  1. We need a list of URLs so I have provided several examples here.
  2. A ProxyCrawl account. They have a free trial that you can use to call their API free of charge for your first 1000 requests, so this is perfect for our project.
  3. The Nodejs library from ProxyCrawl
  4. Node Cheerio Library from GitHub

Really, that’s it. So, without further ado, let’s start writing the code.

Coding with Node

At this point, you may already have installed your favorite code editor, but if not, I recommend installing Visual Studio code.

To set up our project structure, please do the following:

  • Create a project folder name it as Amazon
  • Inside the folder, create a file and name it Scraper.js

Once done, go to your terminal and install the following requirements:

  • npm i proxycrawl
  • npm i cheerio

After the package installation, go to your Amazon folder and paste the text file that contains the list of Amazon URLs which will be scraped by our code later.

Our project structure should now look like this:

Alt Text

Now that everything is set, let us start writing our code in the Scraper.js file. The following lines will load the Amazon-product.txt file into an array:

const fs = require('fs');
const file = fs.readFileSync('Amazon-products.txt');
const urls = file.toString().split('\n');
Enter fullscreen mode Exit fullscreen mode

Next, we’ll utilize the ProxyCrawl node library so we can easily integrate the crawling API into our project.

const { ProxyCrawlAPI } = require('proxycrawl');
Enter fullscreen mode Exit fullscreen mode

This code below will create a worker where we can place our token. Just make sure to replace the value with your normal token from your ProxyCrawl account:

const api = new ProxyCrawlAPI({ token: '_YOUR_TOKEN_' });
Enter fullscreen mode Exit fullscreen mode

After that, we can now write a code that will do 10 requests each second to the API. We will also use the setInterval function to crawl each of the URLs in your text file.

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
  for (let i = 0; i < requestsPerSecond; i++) {
    api.get(urls[currentIndex]);
    currentIndex++;
  }
}, 1000);
Enter fullscreen mode Exit fullscreen mode

At this point, we’re just loading the URLs. To do the actual scraping, we will use the Node Cheerio library and extract the reviews from the full HTML code of the webpage.

const cheerio = require('cheerio');
Enter fullscreen mode Exit fullscreen mode

The next part of our code is a function which will parse the returned HTML.

function parseHtml(html) {
  // Load the html in cheerio
  const $ = cheerio.load(html);
  // Load the reviews
  const reviews = $('.review');
  reviews.each((i, review) => {
  // Find the text children
  const textReview = $(review).find('.review-text').text().replace(/\s\s+/g, '')
;
    console.log(textReview);
  })
}
Enter fullscreen mode Exit fullscreen mode

This code is ready to use but will just log the results in the console. Let’s go ahead and insert a few lines to write this into a CSV file instead.

To do this, we will use the FS module that comes with node then create a variable called writeStream.

const fs = require('fs');
const writeStream = fs.createWriteStream('Reviews.csv');
Enter fullscreen mode Exit fullscreen mode

*Remember that the Reviews.csv is your CSV file and you can name it whatever you want.

We’ll add a header as well:

writeStream.write(`ProductReview \n \n`);
Enter fullscreen mode Exit fullscreen mode

Lastly, we’ll have to instruct our code to write the actual value to our CSV file.

writeStream.write(`${textReview} \n \n`);
Enter fullscreen mode Exit fullscreen mode

Now that our scraper is complete, the full code should look like this:

const fs = require('fs');
const { ProxyCrawlAPI } = require('proxycrawl');
const cheerio = require('cheerio');
const writeStream = fs.createWriteStream('Reviews.csv');

//headers
writeStream.write(`ProductReview \n \n`);

const file = fs.readFileSync('Amazon-products.txt');
const urls = file.toString().split('\n');
const api = new ProxyCrawlAPI({ token: '_YOUR_TOKEN_' });

function parseHtml(html) {
  // Load the html in cheerio
  const $ = cheerio.load(html);
  // Load the reviews
  const reviews = $('.review');
  reviews.each((i, review) => {
    // Find the text children
    const textReview = $(review).find('.review-text').text().replace(/\s\s+/g, '');
    console.log(textReview);
    // write the reviews in the csv file
    writeStream.write(`${textReview} \n \n`);
  })
}

const requestsPerSecond = 10;
var currentIndex = 0;
setInterval(() => {
  for (let i = 0; i < requestsPerSecond; i++) {
    api.get(urls[currentIndex]).then(response => {
      // Make sure the response is success
      if (response.statusCode === 200 && response.originalStatus === 200) {
        parseHtml(response.body);
      } else {
        console.log('Failed: ', response.statusCode, response.originalStatus);
      }
    });
    currentIndex++;
  }
}, 1000);
Enter fullscreen mode Exit fullscreen mode

RESULT

To run your scraper, simply press F5 on Windows or go to your terminal and type node filename

Example output:
Alt Text

I hope you’ve learned something from this guide. Just remember to sign up at ProxyCrawl to get your token and use the API to avoid blocks.

Feel free to utilize this code however you like 😊

Top comments (0)