DEV Community

moose
moose

Posted on • Edited on

Building a crawler

So I haven't written in a while. It's a slow day so I thought, how about we write a web crawler tutorial?

so, this code can easily be switched to js, and you can find a deno_dom replacement in cheerio with node js.

The main thing a crawler needs is valid markup.

tl;dr
[https://github.com/salugi/smooth_crawl]

I am currently building out my project for a deno/rust search engine with mellisearch as part of a larger project.

Since this is a deno backend, I'm taking advantage of the ability to use typescript OOB. Its handy, especially with the ability to create interfaces that can extend others.

What should a web crawler do?

  • Crawl to a depth blindly
  • Be rate limited in the calling
  • Crawl unique links
  • Parse URLs
  • Handle edge cases
  • probably some other stuff

We are going to make one that does all of those.
First, we will look at building a self contained crawler which is painfully slow (but, doesn't DOS a site).

Since this tutorial is using Deno (https://deno.land)
https://deno.land/#installation

super easy.

1 . Let's start with creating the folder structure
mine:

smooth_crawl
|_src
  |models
Enter fullscreen mode Exit fullscreen mode

2 . Open up the smooth crawl directory in your ide of choice

3 . Create entry.ts file

./smooth_crawl/entry.ts
Enter fullscreen mode Exit fullscreen mode

4 . Create base objects.

In our case, the base object will be a ts interface named RecordKey and HttpRecord.

Create RecordKey.ts file

./smooth_crawl/models/RecordKey.ts
Enter fullscreen mode Exit fullscreen mode
  • copy paste the following code

export interface RecordKey {

    id:string,
    creation_date:any,
    archive_object:any

 }

Enter fullscreen mode Exit fullscreen mode
  • Create HttpRecord.ts file
./smooth_crawl/models/HttpRecord.ts
Enter fullscreen mode Exit fullscreen mode
  • Copy paste the following into it
import { RecordKey } from "./RecordKey.ts";

export interface HttpRecord extends RecordKey {

    url:URL,
    response:any,
    response_text:string

}
Enter fullscreen mode Exit fullscreen mode

These serve has data models, they hold data, they do nothing outside of that.

5 . Build http client

Deno uses the fetch API. It is worth noting that since JS is widely used in some sites, a puppeteer implementation would be needed for some sites. We won't focus on puppeteer in this tutorial, mainly relying on the fetch API to handle the HTTP reqs. Below I have annotated some of the code to explain what is going on

  • Create http_client.ts file
./smooth_crawl/http_client.ts
Enter fullscreen mode Exit fullscreen mode
  • Copy and paste the following code:
// @ts-ignore
import {v4} from "https://deno.land/std/uuid/mod.ts";
import {HttpRecord} from "./models/HttpRecord.ts";

/*

returns http text (normally html)

*/

const controller = new AbortController()
const timeoutId = setTimeout(() => controller.abort(), 5000)

export async function get_html_text(unparsed_url:string) : Promise<string> {

    return new Promise(async function (resolve, reject) {

        //parse url cuz symbols

                let parsed_url = new URL(unparsed_url)

        //send get
                await fetch(parsed_url.href,{signal:controller.signal}).then(function (result) {
                    clearTimeout(timeoutId)
                    if (result !== undefined) {

                        //turn result to text.
                        result.text().then(function (text) {

                                resolve(text)

                        }).catch(error => {

                            console.error("get_html_text result.text errored out")

                            reject(error)

                        })

                    }
                }).catch(error => {

                    console.error("get_html_text fetch errored out")

                    reject(error)

                })
    })
}

/*

returns http record

*/

export async function get_http_record(unparsed_url:string) : Promise<HttpRecord> {

    return new Promise(async function (resolve, reject) {

        let parsed_url = new URL(unparsed_url)

        let record : HttpRecord ={
            id:v4.generate(),
            creation_date : Date.now(),
            url:parsed_url,
            response:{},
            response_text:"",
            archive_object:{}
        }

        await fetch(record.url.href,{signal:controller.signal}).then(function (result) {

            clearTimeout(timeoutId)

            if (result !== undefined && result !== null) {

                record.response = result

                // turn result to text.
                result.text().then(function (text) {

                    if (text.length > 1){

                        record.response_text = text

                    }

                    resolve(record)

                }).catch(error => {

                    console.error("get_http_record result.text errored out")

                    reject(error)

                })

            }
        }).catch(error => {

            console.error("get_http_record fetch errored out")

            reject(error)

        })
    })
}
Enter fullscreen mode Exit fullscreen mode

The explanation of these two methods is that they do very similar things, sans one is just returning the text of an HTTP request and the other is returning an HTTP record. The text return, or get_html_text() is lower weight and doesn't unnecessarily create objects. The reason behind this is because this acts as a depth check

Some pages could have, say 50,000 links on a single page right? Sounds ludicrously high, but it is out there if not just to make people work harder. But this first html function acts as a way to be as non committal as possible to depth check the site so we don't blow out.

Now that we have the file created, we need to test it. go back to the entry.ts file in root of the smooth_crawl directory and copy paste this code:

import {get_html_text} from "./src/http_client.ts";

// @ts-ignore
let smoke = Deno.args

// @ts-ignore
let html_text = await get_html_text(smoke[0])

console.log(html_text)

Enter fullscreen mode Exit fullscreen mode

then from the smooth_crawl directory run the command:

deno run --allow-net ./entry.ts https://example.com
Enter fullscreen mode Exit fullscreen mode

if it doesn't error out, we are ready to move into conductor concept and parsing the html.

6 . Create the conductor.ts file

  • Create conductor.ts file
./smooth_crawl/src/conductor.ts
Enter fullscreen mode Exit fullscreen mode
  • Copy paste the following in

import {get_html_text, get_http_record} from "./http_client.ts"
import {catalogue_basic_data, catalogue_links} from "./cataloguer.ts";

import {HttpRecord} from "./models/HttpRecord.ts";
const non_crawl_file = ["jpg", "pdf", "gif", "webm", "jpeg","css","js","png"]

/*

returns http record

archival objects:
link data (all links on a page parsed)
metadata (all metadata tages)

*/

export function conduct_basic_archive(unparsed_url:string) : Promise<HttpRecord> {

    return new Promise<HttpRecord>(async(resolve,reject)=> {

        try {

            let parsed_url = new URL(unparsed_url)
            let record = await get_http_record(parsed_url.href)
            let archival_data : any = await catalogue_basic_data(parsed_url.origin, record.response_text)

            record.archive_object.links = archival_data.link_data
            record.archive_object.meta = archival_data.meta_data

            resolve(record)

        } catch (error) {

            reject(error)

        }

    })

}


/*

harvests links, number_to_gather === number of links to gather

*/

export async function conduct_link_harvest(link:string, link_limit:number, page_limit:number) : Promise<Array<string>> {

    return new Promise<Array<string>>(async (resolve, reject)=>{

        try {

            let links = Array();

            links.push(link)

            for (let i = 0; i < links.length; i++) {

                let url : URL = new URL(links[i])
                // @ts-ignore
                let text : string = await get_html_text(url.href)
                let unharvested_links : Array<URL> = await catalogue_links(url.origin, text)
                let harvested_links : Array<string> = await harvest_links(links, unharvested_links)
                let stop : number = 0;

                if (links.length + harvested_links.length > link_limit){

                    stop = link_limit - links.length

                }else{

                    stop = harvested_links.length

                }

                for (let j = 0; j < stop; j++) {

                    links.push(harvested_links[j])

                }

                if(i >= page_limit){

                    break;

                }

            }

            resolve(links)

        } catch (error) {

            reject(error)

        }

    })

}

function harvest_links(gathered_links: Array<string>, links:Array<any>) : Promise<Array<string>> {

    return new Promise( (resolve, reject) => {

        try {

            let return_array = Array()

            for (const link of links) {

                let should_add = !gathered_links.includes(link.href)
                let file_extension = get_url_extension(link.href)
                let not_in_list = !non_crawl_file.includes(file_extension)

                if (
                    should_add
                    &&
                    not_in_list
                ) {

                    return_array.push(link.href)

                }

            }



            resolve(return_array)

        } catch (error) {

            console.error(error)

        }

    })

}


function get_url_extension( url: string ) {
    //@ts-ignore
    return url.split(/[#?]/)[0].split('.').pop().trim();
}



export async function conduct_worker_harvest(link:string, link_limit:number, page_limit:number) : Promise<Array<string>> {

    return new Promise<Array<string>>(async (resolve, reject)=>{

        try {

            let links = Array();
            let should_break = false

            links.push(link)

            for (let page_index = 0; page_index < links.length; page_index++) {

                let url : URL = new URL(links[page_index])
                // @ts-ignore
                let text : string = await get_html_text(url.href)
                let unharvested_links : Array<URL> = await catalogue_links(url.origin, text)
                let harvested_links : Array<string> = await harvest_links(links, unharvested_links)
                let stop : number = 0;

                if (links.length + harvested_links.length > link_limit){

                    stop = link_limit - links.length
                    should_break = true

                }else{

                    stop = harvested_links.length

                }

                if(page_index >= page_limit ){

                    should_break = true

                }

                for (let j = 0; j < stop; j++) {

                    links.push(harvested_links[j])

                    //publisher.publish_message( { url : harvested_links[j] } )

                }

                if(should_break){

                    break;

                }

            }

            resolve(links)

        } catch (error) {

            reject(error)

        }

    })

}

Enter fullscreen mode Exit fullscreen mode

This will error out. We have to add to it. However, the concept is that the conductor conducts actions. A mishmash on smaller actions that build bigger things.

Example
http call -> crawl -> return object

on to the next step.

7 . create the cataloguer.ts file

  • Create file cataloguer.ts
./smooth_crawl/src/cataloguer.ts
Enter fullscreen mode Exit fullscreen mode
  • Copy paste the following code into it
import {DOMParser} from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';
import {v4} from "https://deno.land/std/uuid/mod.ts";

export async function catalogue_links(origin:string, text:string):Promise<any>{

    return new Promise(function(resolve) {

        try {

            let link_set = Array();

            if(text.length > 1) {

                const document: any = new DOMParser().parseFromString(text, 'text/html');

                if (document === undefined) {

                    let funnel_point = "cataloguer.ts"
                    let funk = "crawl"
                    let error = "unable to interchange gen_object"
                    let id = v4.generate()

                    resolve(link_set)

                } else {

                    let link_jagged_array = Array<Array<any>>(
                        document.querySelectorAll('a'),
                        document.querySelectorAll('link'),
                        document.querySelectorAll('base'),
                        document.querySelectorAll('area')
                    )

                    for (let i = 0; i < link_jagged_array.length; i++) {

                        for (let j = 0; j < link_jagged_array[i].length; j++) {

                            if (link_jagged_array[i][j].attributes.href !== undefined
                                &&
                                link_jagged_array[i][j].attributes.href.length > 0) {

                                link_set.push(link_jagged_array[i][j].attributes.href)

                            }

                        }

                    }

                    link_set = [...new Set(link_set)]

                    // @ts-ignore
                    let fully_parsed_links = link_parse(origin, link_set)

                    resolve(fully_parsed_links)

                }

            }else{
                resolve(link_set)
            }

        }catch(error){

            console.error(error)

        }

    })

}
export async function catalogue_basic_data(origin:string, text:string):Promise<any>{

    return new Promise(function(resolve) {

        try {

            let link_set = Array();

            if(text.length > 1) {

                const document: any = new DOMParser().parseFromString(text, 'text/html');

                if (document === undefined) {

                    console.error("document not defined")

                } else {

                    let link_jagged_array = Array<Array<any>>(
                        document.querySelectorAll('a'),
                        document.querySelectorAll('link'),
                        document.querySelectorAll('base'),
                        document.querySelectorAll('area')
                    )

                    let meta_information = document.querySelectorAll('meta')

                    for (let i = 0; i < link_jagged_array.length; i++) {

                        for (let j = 0; j < link_jagged_array[i].length; j++) {

                            if (link_jagged_array[i][j].attributes.href !== undefined
                                &&
                                link_jagged_array[i][j].attributes.href.length > 0) {

                                link_set.push(link_jagged_array[i][j].attributes.href)

                            }

                        }

                    }

                    link_set = [...new Set(link_set)]

                    // @ts-ignore
                    let fully_parsed_links = link_parse(origin, link_set)
                    let parsed_meta_information = meta_parse(meta_information)

                    let archives = {
                        link_data:fully_parsed_links,
                        meta_data:parsed_meta_information
                    }

                    resolve(archives)

                }

            }else{
                resolve(link_set)
            }

        }catch(error){

            let funnel_point = "cataloguer.ts"
            let funk = "crawl"
            let id = v4.generate()

            console.error(error)
        }

    })

}

function meta_parse(a:Array<any>):Array<string>{
    try {

        let out = Array<any>();

        for (let i = 0; i < a.length; i++) {

            if (a[i].attributes.content !== undefined
                &&
                a[i].attributes.content !== null) {

                let meta_tag = {
                    name: "",
                    content: Array()
                }

                if (a[i].attributes.charset !== undefined) {

                    meta_tag.name = "charset"
                    meta_tag.content.push(a[i].attributes.charset)

                    out.push(meta_tag)

                    continue

                } else if (a[i].attributes.property !== undefined) {

                    meta_tag.name = a[i].attributes.property
                    meta_tag.content = a[i].attributes.content.split(",")

                    out.push(meta_tag)

                    continue

                } else if (a[i].attributes["http-equiv"] !== undefined) {

                    meta_tag.name = a[i].attributes["http-equiv"]
                    meta_tag.content = a[i].attributes.content.split(",")
                    out.push(meta_tag)

                    continue

                } else if (a[i].attributes.name !== undefined) {

                    meta_tag.name = a[i].attributes.name
                    meta_tag.content = a[i].attributes.content.split(",")

                    out.push(meta_tag)

                    continue

                }else {

                    out.push({

                        "meta-related":a[i].attributes.content

                    })

                }

            }

        }

        return out

    }catch(error){

        console.log("crawler-tools.ts")
        Deno.exit(2)

    }

}


function meta_check(a:string):Boolean{

    if (
        /^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.\w{2,3})+$/.test(a) ||
        /(tel:.*)/.test(a)                                      ||
        /(javascript:.*)/.test(a)                               ||
        /(mailto:.*)/.test(a)
    ) {
        return true

    }else{

        return false

    }
}

function link_parse(domain:string, lineage_links: Array<string>):any{

    try {

        let c: Array<any> = new Array()

        if (lineage_links.length > 1) {

            for (let i = 0; i < lineage_links.length; i++) {

                if (
                    !/\s/g.test(lineage_links[i])
                    &&
                    lineage_links[i].length > 0
                ) {

                    let test = lineage_links[i].substring(0, 4)

                    if (meta_check(lineage_links[i])) {

                        continue

                    } else if (/[\/]/.test(test.substring(0, 1))) {

                        if (/[\/]/.test(test.substring(1, 2))) {

                            let reparse_backslash = lineage_links[i].slice(1, lineage_links[i].length)
                            lineage_links[i] = reparse_backslash

                        }


                        c.push(new URL(domain + lineage_links[i]))

                        continue

                    } else if (
                        (/\.|#|\?|[A-Za-z0-9]/.test(test.substring(0, 1))
                            &&
                            !/(http)/.test(test))
                    ) {

                        try {

                            //weed out potential non http protos
                            let url = new URL(lineage_links[i])


                        } catch {

                            let url = new URL("/" + lineage_links[i], domain)

                            c.push(url)

                        }

                        continue

                    } else if (/\\\"/.test(test)) {

                        let edge_case_split_tester = lineage_links[i].split(/\\\"/)
                        lineage_links[i] = edge_case_split_tester[0]

                        if (!/http/.test(lineage_links[i].substring(0, 4))) {

                            let url = new URL("/" + lineage_links[i], domain)

                            c.push(url)

                            continue

                        }
                    } else {

                        try {

                            let link_to_test = new URL(lineage_links[i])
                            let temp_url = new URL(domain)
                            let host_domain = temp_url.host.split(".")
                            let host_tester = host_domain[host_domain.length - 2] + host_domain[host_domain.length - 1]
                            let compare_domain = link_to_test.host.split(".")
                            let compare_tester = compare_domain[compare_domain.length - 2] + compare_domain[compare_domain.length - 1]

                            if (host_tester !== compare_tester) {

                                continue

                            }


                            c.push(link_to_test)

                        } catch (error) {

                            console.error(error)

                        }

                        continue

                    }

                }

            }

        }

        return c

    }catch(err){

        console.error(err)

    }

}
Enter fullscreen mode Exit fullscreen mode

This file uses deno_dom (https://github.com/b-fuze/deno-dom) to catalogue and find links.

It handles the parsing of the different links to be able to make them usable.

next make your entry.ts file look like this


import {get_html_text} from "./src/http_client.ts";
import {conduct_link_harvest} from "./src/conductor.ts";

// @ts-ignore
//let html_text = await get_html_text(smoke[0])
//console.log(html_text)

//smoke[0] is the link
//the other two are set checks, 
//grab 100 links or crawl 20 pages, whichever comes first
let links = await conduct_link_harvest(smoke[0] ,100, 20)
console.log(links)

Enter fullscreen mode Exit fullscreen mode

again, run the following command (may take a little time)
I am using funnyjunk as a test, feel free to use a site you want granted it doesn't have to have js load

deno run --allow-net ./entry.ts https://funnyjunk.com
Enter fullscreen mode Exit fullscreen mode

if this runs, we are ready to move on to the last part of this build.

8 . create operator.ts file

  • create operator.ts file
./smooth_crawl/src/operator.ts
Enter fullscreen mode Exit fullscreen mode
  • copy paste the following code
import {
    conduct_link_harvest,
} from "./conductor.ts"

import {HttpRecord} from "./models/HttpRecord.ts";

import {conduct_basic_archive} from "./conductor.ts";





export async function operate_crawl(url:string,link_limit:number){

    try{

        let crawl_links = await conduct_link_harvest(url,link_limit,50)
        let http_records = new Array<HttpRecord>()

        for(let i = 0; i < crawl_links.length;i++){

            let record = await conduct_basic_archive(crawl_links[i])

            http_records.push(record)

        }

        return http_records

    }catch(error){

        console.error(error)

    }

}
Enter fullscreen mode Exit fullscreen mode

this code is to use the conductor to do many things in a more consolidated way.

now go back to the entry.ts files, delete it's contents and paste in the following

import {operate_crawl} from "./src/operator.ts";


let smoke = Deno.args

try{


    let limit_int = parseInt(smoke[1])
    let url = new URL(smoke[0])

    // @ts-ignore
    let crawled_pages : Array<HttpRecord> = await operate_crawl(url.href, limit_int)


    crawled_pages.forEach(element =>{

        console.log(element.url.href, element.response.status)

    })

} catch (err) {

    console.error(err)

}

Enter fullscreen mode Exit fullscreen mode

now run the command:
(again I use funnyjunk, use what you please)
NOTE: 5 is the amount of pages to crawl

deno run --allow-net ./entry.ts https://funnyjunk.com 5 
Enter fullscreen mode Exit fullscreen mode

That is the crawler. The operate_crawl is a blind crawler for any site to go to a custom depth. This returns the crawled object in the form of a HTTP record (which we created first).

This is a slow way to do it, it crawls blindly unique links, returns a list of objects and is naturally rate limited.

But we can go faster in the same process to get a better archival tool

Below we implement the bookie. The bookie serves as a way to keep track of archival records, like our HttpRecord. This has a function "book_http_record" with could conduct a basic archive and then potentially save it to a database.

9 . Lets create our new bookie

  • create file
./smooth_crawl/bookie.ts
Enter fullscreen mode Exit fullscreen mode
  • copy past the following in
import EventEmitter from "https://deno.land/x/events/mod.ts"
import {conduct_basic_archive} from "./conductor.ts";


export let bookie_emitter = new EventEmitter()

export function book_http_record(unparsed_url : string){
    (async () =>{
        try {

            let parsed_url = new URL(unparsed_url)
            let record =  await conduct_basic_archive(parsed_url.href)

            console.log(record.url.href,"recorded with status",record.response.status)



        }catch (error) {
            let funk = "book_http_record"

            console.error(funk)
            console.error(error)

        }
    })()
}

bookie_emitter.on("book_http_archive", await book_http_record)
Enter fullscreen mode Exit fullscreen mode

this is a deno event emitter

so now we go back to the operator.ts file

add the method

export async function operate_harvest(url:string,link_limit:number,page_limit:number){

    try{

        // let publisher = new PublisherFactory("havest_basic_archive")
        let crawl_links = await conduct_link_harvest(url,link_limit,page_limit)


        for(let i = 0; i < crawl_links.length;i++){

            await new Promise(resolve => setTimeout(resolve, 80))

            bookie_emitter.emit("book_http_archive", crawl_links[i])

        }

        // publisher.close_publisher()

        return crawl_links

    }catch(error){

        console.error(error)

    }

}
Enter fullscreen mode Exit fullscreen mode

That promise acts as a rate limiter in js, a blocking function in non-blocking io, I know. But, you'd get some people mad at you if you go zooming over their site.

now head back to the entry file
and change its contents to this


import {operate_crawl, operate_harvest} from "./src/operator.ts";


let smoke = Deno.args

try{


    let limit_int = parseInt(smoke[2])
    let url = new URL(smoke[1])
    let option = smoke[0].split(/-/)[1]

    switch(option){

        case "sc":

            // @ts-ignore
            let crawled_pages : Array<HttpRecord> = await operate_crawl(url.href, limit_int)


            crawled_pages.forEach(element =>{

                console.log("crawled", element.url.href, "with status", element.response.status)

            })

            break;

        case "eh":

            let harvested_links = await operate_harvest(url.href,limit_int,10)

            // @ts-ignore
            console.log("harvested", harvested_links.length, "links")
            console.log(harvested_links)

            break;

        default:

            console.error("not a valid option")

            break;

    }


} catch (err) {

    console.error(err)

}


Enter fullscreen mode Exit fullscreen mode

Notice we kept the other method. "sc" stands for slow craw and "eh" stands for event harvest.

now run the command:
(again, I use funnyjunk as an example, pick your own)

$ deno run -A --unstable ./entry.ts -sc https://funnyjunk.com 5
Enter fullscreen mode Exit fullscreen mode

if that doesn't error out we are finished.

Now, I wouldn't recommend anyone do thousands of links on a single site with either. One because you may blow your buffer and the over because you may make some server admin super mad.

Top comments (0)