💎 How to crawl a static website in Javascript in 4min 💥

#javascript #node #opensource #npm

Prerequisites: Know a little bit about Javascript.

Today's topic is the extraction of data from a static website and then structuring this data into a database or a file on your computer, or even something completely different.

Introduction of Fetch-crawler (Node JS)

Fetch Crawler is designed to provide a basic, flexible and robust API for crawling websites.

The crawler provides simple APIs to crawl static websites with the following features:

Distributed crawling
Configure parallel, retry, max requests, time between requests (to avoid being blocked by the website) ...
Support both depth-first search and breadth-first search algorithm
Stop after a maximum amount of of requests have been executed
Insert Cheerio automatically for scraping
[Promise] support

A complete documentation is available on Github : https://github.com/viclafouch/Fetch-Crawler

The specificity of Fetch-crawler is that it manages requests in parallel (example: 10 requests at the same time and not one by one) which allows for significant time saving.

In other words, this library does everything for you, you just have to configure the various options.

Step by step:

First, install the dependencies required:

# npm i @viclafouch/fetch-crawler

Then, import the module in your js file and use the method launch of FetchCrawler. The only parameter required is a link of your website (or page), here https://github.com.

const FetchCrawler = require('@viclafouch/fetch-crawler')

FetchCrawler.launch({
  url: 'https://github.com'
})

And then run:

# node example-crawl.js

If you run this file with Node JS, it will work, but nothing will happen except until the crawler has finished.

Let's now move on to the basic options and methods to be used to extract data from the website (documentation):

const FetchCrawler = require('@viclafouch/fetch-crawler')

// `$ = Cheerio to get the content of the page
// See https://cheerio.js.org
const collectContent = $ =>
  $('body')
    .find('h1')
    .text()
    .trim()

// After getting content of the page, do what you want :)
// Accept async function
const doSomethingWith = (content, url) => console.log(`Here the title '${content}' from ${url}`)

// Here I start my crawler
// You can await for it if you want
FetchCrawler.launch({
  url: 'https://github.com',
  evaluatePage: $ => collectContent($),
  onSuccess: ({ result, url }) => doSomethingWith(result, url),
  onError: ({ error, url }) => console.log('Whouaa something wrong happened :('),
  maxRequest: 20
})

Okay, let's review the new methods and options included above.

evaluatePage: Function for traversing/manipulating the content of the page. Cheerio is provided to parse markup and it provides a robust API to do that. With it, you can build a specialized function to extract the exact pieces of data you want from the webpage.

onSuccess: If evaluatePage succeeds, what do you want to do? Do what you want (Add to database ? Include the data to a file ? etc..).

onError: A callback called if evaluatePage failes.

maxRequest: It represents the maximum amount of requests you allow your crawler to execute. Pass -1 to disable the limit. But for the example above, we want to stop the crawler after 20 requests (even if they failed ).

For the rest of the configuration, you can find the documentation here.

Hands-on example:

Let's take the example of a video game website: Instant Gaming

Our objective: Recover the data from the video games (on Xbox) put on sale on the website and compile them a JSON file. It can then be reused in projects (example: A Chrome extension that can display this list in real time).

This is what our file example-crawl.js contains.

const fs = require('fs')
const FetchCrawler = require('@viclafouch/fetch-crawler')

// Get all games on xbox platform
const urlToCrawl = 'https://www.instant-gaming.com/en/search/?type%5B0%5D=xbox'
let games = []

// I'm getting an array of each game on the page (name, price, cover, discount)
const collectContent = $ => {
  const content = []
  $('.item.mainshadow').each(function(i, elem) {
    content.push({
      name: $(this)
        .find($('.name'))
        .text()
        .trim(),
      price: $(this)
        .find($('.price'))
        .text()
        .trim(),
      discount: $(this)
        .find($('.discount'))
        .text()
        .trim(),
      cover: $(this)
        .find($('.picture'))
        .attr('src')
    })
  })
  return content
}

// Only url including an exact string
const checkUrl = url => {
  try {
    const link = new URL(url)
    if (link.searchParams.get('type[0]') === 'xbox' && link.searchParams.get('page')) {
      return url
    }
    return false
  } catch (error) {
    return false
  }
}

// Concat my new games to my array
const doSomethingWith = content => (games = games.concat(content))

// Await for the crawler, and then save result in a JSON file
;(async () => {
  try {
    await FetchCrawler.launch({
      url: urlToCrawl,
      evaluatePage: $ => collectContent($),
      onSuccess: ({ result, url }) => doSomethingWith(result, url),
      preRequest: url => checkUrl(url),
      maxDepth: 4,
      parallel: 6
    })
    const jsonResult = JSON.stringify({ ...games }, null, 2)
    await fs.promises.writeFile('examples/example_4.json', jsonResult)
  } catch (error) {
    console.error(error)
  }
})()

All we have to do now is start our crawler and wait a few seconds.

# node example-crawl.js

Here the JSON file we get : https://github.com/viclafouch/Fetch-Crawler/blob/master/examples/example_4.json

As you can see, we get super clean data in our json file. Obviously, the data on the website will change soon, so we could just loop our crawler every 24 hours.

To learn more about the Fetch Crawler package, feel free to check out the documentation.

...

Thanks for reading.

Feel free to contribute with me on this package :)
I built this package because I needed it for a project for Google and the extraction of data was pretty difficult.