Building a Rotten Tomatoes Web Scraper using Node.js 🕸🍅

Introduction 💁🏻

Are you tired of scrolling through endless movie reviews on Rotten Tomatoes? Do you wish there was a faster way to find out which movies are worth watching and which ones are better off being skipped? Well, have no fear because were about to take a deep dive into the world of web scraping with Node.js, and build our very own Rotten Tomatoes web scraper!

These days, Im learning about web scrapers. So I decided to create a simple cli tool that crawls a small section of the website Rotten Tomatoes and chooses all of the movie cards on the page, iterating over them and choosing the movie title and score for each one.

If youre unfamiliar with Rotten Tomatoes, its a website that offers rating and information about movies, TV series, and celebrities. It is a very popular website with a massive information database. As a result, it is a wonderful location to begin learning about web scraping.

As you may be aware, creating a web scraper consists of four major steps:

Crawling : Crawling is the process of discovering all of the links on a website and adding them to a queue.
Scraping : Scraping is the process of obtaining data from a website.
Parsing : The process of transforming raw data into a structured representation is known as parsing.
Storing : Storing is the process of storing data in a file or database.

And well go through all of them in detail step by step, so buckle up and lets get started.

Prerequisites 👩🏻🏫

And also you should have basic knowledge of crawlers and scrapers in general. Learn here

Let us begin then

First things first, youre going to need to install Node.js if you havent already. I recommend doing this while wearing a clown nose and juggling three oranges. Its important to maintain a sense of humor when doing anything with Node.js, otherwise you might start to take yourself too seriously.

Once youve got Node.js installed, youre going to want to install a few packages using npm. Now, you might be wondering what npm stands for. Does it stand for Node Package Manager? Or maybe Notoriously Painful Middleware? Honestly, nobody knows. Its just one of those things thats best not to question.

So, lets get down to business. Open up your terminal and navigate to the directory where you want to create your project. Then type:

npm init

This will create a new Node.js project and generate a package.json file. This file contains information about your project, such as its name, version, and dependencies. Youll want to make sure to fill out all of the information in this file while wearing a silly hat. Its important to keep things light-hearted.

Now, lets install a few packages. Type the following commands in your terminal:

npm install request
npm install cheerio
npm install fs

The request package is used to make HTTP requests, and the cheerio package is used to parse HTML. Think of it like a really fancy, high-tech blender that turns HTML into data smoothies.

With those packages installed, lets get to the fun part. Were going to write some code!

Open up your favorite code editor and create a new file called scraper.js. Now, lets write some code to scrape Rotten Tomatoes. Copy and paste the following code into your scraper.js file :

const request = require('request');
const cheerio = require('cheerio');
const fs = require('fs');

const URL = 'https://www.rottentomatoes.com/';

request(URL, function (error, response, html) {
if (! error && response.statusCode === 200) {
const $ = cheerio.load(html);

// Example: Get the title and rating of the first movie on the homepage
const firstMovieTitle = $('.mb-movie:nth-of-type(1) h3 a').text();
const firstMovieRating = $('.mb-movie:nth-of-type(1) .tMeterScore').text();

console.log(`Title: ${firstMovieTitle}`);
console.log(`Rating: ${firstMovieRating}\n`);

// Example: Get the titles and ratings of all movies on the homepage
const movieTitles = [];
const movieRatings = [];

$('.mb-movie h3 a').each(function (i, elem) {
movieTitles[i] = $(this).text();
});

$('.mb-movie .tMeterScore').each(function (i, elem) {
movieRatings[i] = $(this).text();
});

for (let i = 0; i < movieTitles.length; i++) {
console.log(`Title: ${
movieTitles[i]
}`);
console.log(`Rating: ${
movieRatings[i]
}\n`);
}

// Example: Save the titles and ratings of all movies on the homepage to a file
const data = [];

$('.mb-movie').each(function (i, elem) {
const title = $(this).find('h3 a').text();
const rating = $(this).find('.tMeterScore').text();

data.push({title, rating});
});

fs.writeFile('movies.json', JSON.stringify(data), function (err) {
if (err)
throw err;

console.log('Data saved to file.');
});
}
});

This script uses the request module to make an HTTP GET request to the Rotten Tomatoes homepage, and then loads the HTML response into a Cheerio instance using cheerio.load(). From there, it uses CSS selectors to extract data from the page, such as the title and rating of the first movie, the titles and ratings of all movies on the homepage, and saves the titles and ratings of all movies on the homepage to a JSON file.

You can modify the script to scrape other pages on Rotten Tomatoes or extract different pieces of data by changing the CSS selectors and modifying the data handling logic

Now, lets run our code. In your terminal, type :

node scraper.js

If all goes well, you should see the title of the Rotten Tomatoes homepage printed to the console. Congratulations, you just scraped a website using Node.js!

Of course, this is just the beginning. There's so much more you can do with web scraping and Node.js. You could scrape reviews for a specific movie, or even build a web app that displays the latest Rotten Tomatoes scores for all the movies currently in theaters. The possibilities are endless.

Just remember to keep things fun and lighthearted. Node.js can be a serious business, but its important to not take yourself too seriously. After all, were just a bunch of silly humans trying to make sense of a chaotic digital world. So put on your clown nose, juggle those oranges, and lets build something amazing!

Conclusion 💭

So there you have it, folks. Building a Rotten Tomatoes web scraper using Node.js. Its easy, its fun, and its a great way to learn how to use Node.js to automate web tasks. And if youre lucky, you might even have a feline coding partner to help you out.

Warning

But before you go off and start building your own web scraper, there are a few things you should keep in mind. First of all, make sure youre not violating any terms of service or copyright laws. Scraping websites without permission can get you into legal trouble, so be sure to check the websites policies before you start scraping.

Additionally, web scraping can put a lot of strain on a websites servers. If youre making too many requests too quickly, you could cause the website to crash or slow down. Make sure youre being respectful of the websites resources and following best practices for web scraping.

Finally, remember that web scraping is not foolproof. Websites can change their HTML structure or add anti-scraping measures at any time, which could break your scraper. Make sure youre keeping an eye on your scraper and updating it as needed to keep up with any changes.

In conclusion, building a Rotten Tomatoes web scraper using Node.js can be a fun and educational experience. Just be sure to follow best practices, respect the websites policies, and dont let your cat take over your coding too often. Happy scraping!

GitHub repo for the article 💻

https://github.com/devangtomar/nodejs-scrapper

Lets connect and chat! Open to anything under the sun 🏖🍹

🐦 Twitter : devangtomar7

🔗 LinkedIn : devangtomar

📚 Stack overflow : devangtomar

🖼 Instagram : be_ayushmann

Medium : Devang Tomar

Hash node : devangtomar

🧑💻 Dev.to : devangtomar