loading...

Introduction to web scraping with Node.js

aurelkurtula profile image aurel kurtula Updated on ・4 min read

For a long time when ever I wanted to try and create websites for practice I would visit a website, open the console and try to get the content I needed - all this to avoid using lorem ipsum, which I absolutely hate.

Few months a go I heard of web scraping, hey better late the never right? And it seems to do a similar thing to what I tried to do manually.

Today I'm going to explain how to web scrape with Node.

Setting up

We'll be using three packages to accomplish this.

  • Axios is a "promise based HTTP client for the browser and node.js" and we'll use it to get html from any chosen website.
  • Cheerio is like jQuery but for the server. We'll use it as a way to pick content from the Axios results.
  • fs is a node module which we'll use to write the fetched content into a JSON file.

Let's start setting up the project. First create a folder, then cd to it in the terminal.

To initialise the project just run npm init and follow the steps (you can just hit enter to everything). When the initial setup is complete you'll have created a package.json file.

Now we need to install the two packages we listed above

npm install --save axios cheerio

(Remember fs is already part of node, we do not need to install anything for it)

You'll see that the above packages are installed under node_modules directory, they are also listed inside the package.json file.

Get the content from a dev.to

Your dev.to profile is at https://dev.to/<username>. Our mission is to get the posts we've written and store them in a JSON file, as you see below:

Create a JavaScript file in your project folder, call it devtoList.js if you like.

First require the packages we installed

let axios = require('axios');
let cheerio = require('cheerio');
let fs = require('fs'); 

Now lets get the contents from dev.to

axios.get('https://dev.to/aurelkurtula')
    .then((response) => {
        if(response.status === 200) {
        const html = response.data;
            const $ = cheerio.load(html); 
    }
    }, (error) => console.log(err) );

In the first line we get the contents from the specified URL. As already stated, axios is promise based, then we check if the response was correct, and get the data.

If you console log response.data you'll see the html markup from the url. Then we load that HTML into cheerio (jQuery would do this for us behind the scenes). To drive the point home let's replace response.data with hard-coded html

const html = '<h3 class="title">I have a bunch of questions on how to behave when contributing to open source</h3>'
const h3 = cheerio.load(html)
console.log(h3.text())

That returns the string without the h3 tag.

Select the content

At this point you would open the console on the website you want to scrape and find the content you need. Here it is:

From the above we know that every article has the class of single-article, The title is an h3 tag and the tags are inside a tags class.

axios.get('https://dev.to/aurelkurtula')
    .then((response) => {
        if(response.status === 200) {
            const html = response.data;
            const $ = cheerio.load(html); 
            let devtoList = [];
            $('.single-article').each(function(i, elem) {
                devtoList[i] = {
                    title: $(this).find('h3').text().trim(),
                    url: $(this).children('.index-article-link').attr('href'),
                    tags: $(this).find('.tags').text().split('#')
                          .map(tag =>tag.trim())
                          .filter(function(n){ return n != "" })
                }      
            });
    }
}, (error) => console.log(err) );

The above code is very easy to read, especially if we refer to the screenshot above. We loop through each node with the class of .single-article. Then we find the only h3, we get the text from it and just trim() the redundant white space. Then the url is just as simple, we get the href from the relevant anchor tag.

Getting the tags is just simple really. We first get them all as a string (#tag1 #tag2) then we split that string (whenever # appears) into an array. Finally we map through each value in the array just to trim() the white space, finally we filter out the any empty values (mostly caused by the trimming).

The declaration of an empty array (let devtoList = []) outside the loop allows us to populate it from within.

That would be it. The devtoList array object has the data we scraped from the website. Now we just want to store this data into a JSON file so that we can use it elsewhere.

axios.get('https://dev.to/aurelkurtula')
    .then((response) => {
        if(response.status === 200) {
            const html = response.data;
            const $ = cheerio.load(html); 
            let devtoList = [];
            $('.single-article').each(function(i, elem) {
                devtoList[i] = {
                    title: $(this).find('h3').text().trim(),
                    url: $(this).children('.index-article-link').attr('href'),
                    tags: $(this).find('.tags').text().split('#')
                          .map(tag =>tag.trim())
                          .filter(function(n){ return n != "" })
                }      
            });
            const devtoListTrimmed = devtoList.filter(n => n != undefined )
            fs.writeFile('devtoList.json', 
                          JSON.stringify(devtoListTrimmed, null, 4), 
                          (err)=> console.log('File successfully written!'))
    }
}, (error) => console.log(err) );

The original devtoList array object might have empty values, so we just trim them away, then we use the fs module to write to a file (above I named it devtoList.json, the content of which the array object converted into JSON.

And that's all it takes!

The code above can be found in github.

Along with scraping dev.to using the above code, I've also scraped books from goodreads and movies from IMDB, the code for which is in the repository.

Posted on Jan 28 '18 by:

aurelkurtula profile

aurel kurtula

@aurelkurtula

I love JavaScript, reading books, drinking coffee and taking notes.

Discussion

markdown guide
 

Great tutorial! Really happy seeing this in Node.js on top of all the Python tuts out there on scraping.

I'd love to see a series of this too - maybe covering topics like how to do pagination, scraping web pages that are using AJAX, etc. Thank for sharing!

 

Thanks Alex

maybe covering topics like how to do pagination, scraping web pages that are using AJAX

Great idea. I can imagine the pagination being kind of easy (though manually changing the page urls). It would involve chaining axios promises/calls and refactoring the same code to keep it DRY.

Scraping Ajax pages, I want to say it can't be done but I have no idea, I'll have to research it. It be cool though

 

AJAX pagination is actually pretty simple. You don't need Cheerio then, since the API already responding in JSON 😂

 

have you ever tried to scrap data then visualize it with only javascript? using svg or d3.js ?
I think I'm gonna try it

 

I'm in the middle of a project like this right now killed-by-police-data.herokuapp.com/ I wish I saw this article before I started though. Did a bunch of crap go try to manually scrape the data. I might rebuild using this though.

 

that's terrifying man , but cool, I intend to create things like that

 

No I haven't, but it's in my to-try list now

 

cool, share it with us when you do :D

 

Funny to bump into your post and to realise that I went to the exact same steps a few months ago!

I didn't write a blog post (though I should have, because it really helps) but after some time playing with Axios and Cheerio and having to face more complex use cases, I eventually decided to create my own library: github.com/mawrkus/jason-the-miner
It's modular, has simple yet powerful schema definitions (including following/paginating at any level) and is extensible.

My experience developing Jason was (and still is) fun, challenging and full of surprises... Starting to scrape is really easy but can get complicated really fast ("What? This is not the content I see with my browser!" "Ach! they blocked my IP, I need a Ninja HTTP client!"), which makes this kind of project, a perfect way to learn Node.js.

 

Do you know of any alternatives for scraping sites that are dynamic/SPA's? I've heard that pupeteer github.com/GoogleChrome/puppeteer may be good for that?

 

When @alexadusei asked I guessed it might not be doable to scrape dynamic content :). But now that I see that API (it says that you can "Crawl a SPA and generate pre-rendered content") I'll definitely try to figure this out

 

Yeah, very handy stuff. One technique people use (scraping AJAX is actually easier than regular scraping!) is using Google Developer Tools and going to the Network tab to see what external API calls the page is using. Then you can grab the information from there, plus more!

 

Cool post!

But if you have it like this:
(error) => console.log(err)

err wont be defined! ;-)

 

Recently I knew the osmosis. I like cheerio, but after I've been working with osmosis, I really think that is more better than cheerio. In my opinion. :)

 

Love it, super useful!!! Thanks a lot!!!!

 
 

Awesome tutorial. Thanks!
Is there an easy way to automate this so when you publish new posts the changes will be reflected in your app?

 

Yes, you could set a timer when you want the scrapping to happen. For example if I wanted to scrape my articled I have it run Thursday and Sunday. If I was scraping twitter I'd have it run every 15 minutes (running it every second would be costly).

If I remember correctly this is how ifttt tasks work. In fact I know because I tried it years back, they don't update when the new content is published but every how ever many minutes/hours.