loading...
Cover image for Web-scraping with NodeJS

Web-scraping with NodeJS

nitinreddy3 profile image Nitin Reddy ・3 min read

Today we are going to learn about how we can do web-scraping with NodeJS and some other tools.
We will be fetching the data from a web URL with the GET request and store it in a CSV file.

The codebase is available at Node-WEbScrap

Alt Text

Tools and things required:-

  • NodeJS
  • NPM packages
    1. request-promise - It helps us to make HTTP requests to the source Uri and get the data
    2. cheerio - This is used to load and parse markup data.
    3. json2csv - This is used to convert the JSON data to the CSV format
  • Basic knowledge of JavaScript

Let's get started with the project

  • Create a NodeJS project
   $ mkdir node-webscrap
   $ cd node-webscrap
   $ npm init
   $ yarn add request-promise request cheerio json2csv
  • Create an index.js file in the root directory of your project
   $ touch index.js
  • Get all the required modules inside the index.js
    const request = require("request-promise")
    const cheerio = require("cheerio")
    const fs = require("fs")
    const json2csv = require("json2csv").Parser;
  • Next, create an array of movies with proper strings. I have used rotten tomatoes to get the movie review URLs
   const movies = [
     "https://www.rottentomatoes.com/m/the_last_full_measure",
     "https://www.rottentomatoes.com/m/stray_dolls"
   ];
  • Now create a function with the below code base
   const dataRepresent = async() => {
     let rottenTomatoData = []

     for (let movie of movies) {
     const response = await request({
      uri: movie,
      headers: {
        "accept": 
"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "en-US,en;q=0.9,es;q=0.8"
      },
      gzip: true,
     })

     let $ = cheerio.load(response);
     let title = $("h1[class='mop-ratings-wrap__title mop-ratings-wrap__title--top']").text().trim()
     let tomatoMeterObj = $('#tomato_meter_link > .mop-ratings-wrap__percentage');
     let tomatoMeter = tomatoMeterObj && tomatoMeterObj.text().trim();
     let audMeterObj = $('.audience-score > .mop-ratings-wrap__score >  .articleLink  > .mop-ratings-wrap__percentage');
     let audMeter = audMeterObj && audMeterObj.text().trim();
     let summary = $('.mop-ratings-wrap__text').text().trim()

     rottenTomatoData.push({
      title,
      tomatoMeter,
      audMeter,
      summary,
     });
   }
   const j2cp = new json2csv()
   const csv = j2cp.parse(rottenTomatoData);
   fs.writeFileSync('./rottenTomatoes.csv', csv, "utf-8")
 }
  • Call the function at the end in the index.js file
    dataRepresent();
  • After running the index.js from the command line, you should see the file "rottenTomatoes.csv" getting generated in the project's root directory
   $ node .\index.js

So here we are iterating over the movies array asynchronously and using request-promise npm module we are passing headers, uri and the required parameter like gzip to fetch the raw HTML data. Using cheerio we can parse the data by using jquery selectors to get the data.

Then we push the data into "rottenTomatoData" array and write the data in the file named as "rottenTomatoes.csv" using fs module provided by NodeJS out of the box

So that's it for the day. I will come up with some learnings and will share them with you.

Thanks for reading and please share it across with other folks and keep learning!!

Discussion

pic
Editor guide
Collapse
allnulled profile image
allnulled

With this tool you can reuse your node and your browser js knowledge:

dev.to/allnulled/live-web-scrappin...

Collapse
nitinreddy3 profile image
Nitin Reddy Author

Let me try this as well.

Collapse
allnulled profile image
allnulled

Sure. With it you can see the scrap in live, render Angular/React/Vue/xxx applications, and do asynchronous operations in both, client and local environments, and of course, passing data between them.

I wish I knew more about Electron... because web2os has a cool interface, but poorly got under the hood... but it has turned the de facto solution for my small scraps.