DEV Community

Siddhesh Mangela

Posted on Nov 30, 2020

Deno Web Scrapper

#deno #typescript #webscrapper #fetch

You might have created a web scraper with Node.js + request+ cheerio setup or maybe a python one using beautiful soup. This tutorial brings the same to the world of Deno.

In this example, we are scrapping the list of books from

http://books.toscrape.com/

Let's get started, without further ado.

Step 01: app.ts

to start we will create app.ts file and cover the whole code in a try-catch block to take advantage of the first-class await (global async-await).

const url = 'http://books.toscrape.com/';

try {
  console.log(url)
} catch(error) {
  console.log(error);
}

check if code logs the url by running the following command in terminal

deno run app.ts

Step 02: Fetch URL

Deno supports lots of native javascript APIs, Fetch API being one of them which makes request handling easy and dependency-free. Response from fetch is saved in a variable named html.

const url = 'http://books.toscrape.com/';

try {
  const res = await fetch(url);
  const html = await res.text();

  console.log(html)
} catch(error) {
  console.log(error);
}

Deno is secure by default that means to let it access the internet we need to run it with a flag --allow-net

check if code logs the html by running the following command in terminal.

deno run --allow-net app.ts

Step 03: Deno Dom

Deno dom makes it easy to traverse HTML using javascript DOM manipulation methods.

HTML (in text format) that we get with fetch is parsed into a DOMParser object and stored in variable dom. dom variable is traversed to extract page heading from the target site.

import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';

const url = 'http://books.toscrape.com/';

try {
  const res = await fetch(url);
  const html = await res.text();
  const doc: any = new DOMParser().parseFromString(html, 'text/html');

  const pageHeader = doc.querySelector('.header').querySelector('.h1').textContent;

  console.log(pageHeader)
} catch(error) {
  console.log(error);
}

check if code logs “Books to Scrape We love being scraped!” by running the following command in the terminal.

deno run --allow-net app.ts

Bringing it all together

The script picks up the book info by looping over each .product_pod container on the first page and puts it in the books array.

import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';

const url = 'http://books.toscrape.com/';

try {
  const res = await fetch(url);
  const html = await res.text();
  const doc: any = new DOMParser().parseFromString(html, 'text/html');
  const books: any = [];

  const productsPods = doc.querySelectorAll('.product_pod');

  productsPods.forEach((product: any) => {
    const title = product.querySelector('h3').querySelector('a').getAttribute('title');
    const price = product.querySelector('.price_color').textContent;
    const availability = product.querySelector('.availability').textContent.trim();

    books.push({
      title,
      price,
      availability,
    })
  });

  console.log(books);
} catch(error) {
  console.log(error);
}