DEV Community

Introduction to web scraping with Node.js

aurel kurtula on January 28, 2018

For a long time when ever I wanted to try and create websites for practice I would visit a website, open the console and try to get the content I n...

Read full post

Alex Adusei • Jan 29 '18

Great tutorial! Really happy seeing this in Node.js on top of all the Python tuts out there on scraping.

I'd love to see a series of this too - maybe covering topics like how to do pagination, scraping web pages that are using AJAX, etc. Thank for sharing!

aurel kurtula • Jan 29 '18

Thanks Alex

maybe covering topics like how to do pagination, scraping web pages that are using AJAX

Great idea. I can imagine the pagination being kind of easy (though manually changing the page urls). It would involve chaining axios promises/calls and refactoring the same code to keep it DRY.

Scraping Ajax pages, I want to say it can't be done but I have no idea, I'll have to research it. It be cool though

Jithesh. KT • Feb 18 '18

AJAX pagination is actually pretty simple. You don't need Cheerio then, since the API already responding in JSON 😂

Belhassen Chelbi • Jan 28 '18

have you ever tried to scrap data then visualize it with only javascript? using svg or d3.js ?
I think I'm gonna try it

Peter Nguyen • Jan 29 '18

I'm in the middle of a project like this right now killed-by-police-data.herokuapp.com/ I wish I saw this article before I started though. Did a bunch of crap go try to manually scrape the data. I might rebuild using this though.

Belhassen Chelbi • Jan 31 '18

that's terrifying man , but cool, I intend to create things like that

aurel kurtula • Jan 29 '18 • Edited

No I haven't, but it's in my to-try list now

Belhassen Chelbi • Jan 31 '18

cool, share it with us when you do :D

mawrkus • Apr 19 '18

Funny to bump into your post and to realise that I went to the exact same steps a few months ago!

I didn't write a blog post (though I should have, because it really helps) but after some time playing with Axios and Cheerio and having to face more complex use cases, I eventually decided to create my own library: github.com/mawrkus/jason-the-miner
It's modular, has simple yet powerful schema definitions (including following/paginating at any level) and is extensible.

My experience developing Jason was (and still is) fun, challenging and full of surprises... Starting to scrape is really easy but can get complicated really fast ("What? This is not the content I see with my browser!" "Ach! they blocked my IP, I need a Ninja HTTP client!"), which makes this kind of project, a perfect way to learn Node.js.

Martin Nordström • May 8 '18

Cool post!

But if you have it like this:
(error) => console.log(err)

err wont be defined! ;-)

crazy4groovy • Jan 29 '18

Do you know of any alternatives for scraping sites that are dynamic/SPA's? I've heard that pupeteer github.com/GoogleChrome/puppeteer may be good for that?

aurel kurtula • Jan 29 '18

When @alexadusei asked I guessed it might not be doable to scrape dynamic content :). But now that I see that API (it says that you can "Crawl a SPA and generate pre-rendered content") I'll definitely try to figure this out

Alex Adusei • Jan 29 '18

Yeah, very handy stuff. One technique people use (scraping AJAX is actually easier than regular scraping!) is using Google Developer Tools and going to the Network tab to see what external API calls the page is using. Then you can grab the information from there, plus more!

aurel kurtula • Jan 29 '18

Aha, that's clever.

Tiago Celestino • Jan 30 '18

Recently I knew the osmosis. I like cheerio, but after I've been working with osmosis, I really think that is more better than cheerio. In my opinion. :)

Cleyton Chagas • Apr 24 '20

Thanks, nice post!!!

Aaron • Jan 30 '18

Awesome tutorial. Thanks!
Is there an easy way to automate this so when you publish new posts the changes will be reflected in your app?

aurel kurtula • Jan 30 '18

Yes, you could set a timer when you want the scrapping to happen. For example if I wanted to scrape my articled I have it run Thursday and Sunday. If I was scraping twitter I'd have it run every 15 minutes (running it every second would be costly).

If I remember correctly this is how ifttt tasks work. In fact I know because I tried it years back, they don't update when the new content is published but every how ever many minutes/hours.

Juan G De Jesus Torres • Nov 19 '19

Love it, super useful!!! Thanks a lot!!!!

Sina Maleki • Aug 2 '20

nice