DEV Community

Jordan Hansen
Jordan Hansen

Posted on • Originally published at javascriptwebscrapingguy.com on

Jordan Takes Advantage of Multithreaded I/O in Nodejs

Sample code here

Dead link checking never dies

I have been playing with this repository for three to four weeks now. It kind of feels like I’m checking for dead links with a repository that itself will never die. I’m actually not complaining. It’s nice to use the same code base and try several different methods to try and accomplish the same goal. I can evaulate the performance of each method and decide on which is best.

The first post did dead link checking without taking any advantage of the asynchronous nature of javascript and so it would check a link, wait for that check to complete, and then continue onto the next one. This was incredibly slow. I didn’t realize when I first wrote it how slow it was. As a baseline, checking for dead links on this website (javascriptwebscrapingguy.com), takes ~120 seconds.

The second post was one where I wanted to speed things up and started to use some worker threads. Looking back now I can see how it was pretty rudimentary. I was managing the thread count myself and it was not very efficient. Using 20 threads the way I did it then on this website (javascriptwebscrapingguy.com), takes ~28 seconds.

The third post things were starting to get a lot better. I focused on using a pool queue for the worker threads. The pool would automatically queue the tasks and when one completed, it’d bring in another. The code was a lot cleaner and using it on this website (javascriptwebscrapingguyg.com, still) takes ~20 seconds.

Final form?

I think I’ve reached my final form. It doesn’t require any additional packages besides request. It’s fast. The code is pretty simple. There are a couple main differences in the code.

const options: requestPromise.RequestPromiseOptions = {
            method: 'GET',
            resolveWithFullResponse: true,
            timeout: 10000,
            agentOptions: {
                maxSockets: 4
            }
        };

First one is where I actually make the request calls I just add an agentOptions with a maxSocets option. This will limit the amount of open I/O threads to 4 in the below case.

let links: ILinkObject[] = await getLinks(html, domain, domain);
    const promises: any[] = [];

    for (let i = 0; i < links.length; i++) {
        if (!links[i].status) {
            promises.push(checkLink(links[i], links, domain));
        }
    }

    await Promise.all(promises);

In the initial function, findDeadLinks, I’ll have however many links I have from scraping the domain’s home page. As I loop through it I call checkLink like I was doing before but I don’t block with await. Instead I push it into an array of promises and block below with await Promise.all(promises);.

The next change I’ve made is within the checkLink function. In this case I just changed the function to be recursive and, again, instead of blocking with await I push the return into an array of promises and then wait for them all to resolve with await Promise.all(promises);.

// Replace the link we were checking with the completed object
    let linkToReplaceIndex = links.findIndex(linkObject => linkObject.link === linkObject.link);
    links[linkToReplaceIndex] = linkObject;
    const promises: any[] = [];

    for (let linkToCheck of newLinks) {
        if (links.filter(linkObject => linkObject.link === linkToCheck.link).length < 1) {
            // console.log('pushed in ', linkToCheck.link);
            links.push(linkToCheck);

            promises.push(checkLink(linkToCheck, links, domain));
        }
    }

    await Promise.all(promises);

    return Promise.resolve({ link: linkObject, links: links });

Walking through a real life scenario

I want to try and walk through how this works in practice. Let’s say I land on javascriptwebscrapingguy.com and scrape the home page and find 37 links. I’ll then loop through them and call checkLinks 37 times. I wait for them all to resolve their promises and since checkLinks is calling itself recursively, that won’t happen until we are done.

Each time it calls checkLinks it will get new set of links. If the first link it’s checking is

https://javascriptwebscrapingguy.com/jordan-is-speed-speeding-up-scraping-with-multiple-threads/ it will get all the links from that page and then call checkLinks on those links. When all of those resolve from

https://javascriptwebscrapingguy.com/jordan-is-speed-speeding-up-scraping-with-multiple-threads/ then the promise will resolve and 1/37 will be resolved from the inital loop.

Results

| maxSockets | Speed |
| 1 | 38.373 |
| 1 | 31.621 |
| 1 | 31.4 |
| 2 | 22.687 |
| 2 | 22.644 |
| 2 | 23.101 |
| 3 | 19.552 |
| 3 | 19.578 |
| 3 | 19.121 |
| 4 | 19.702 |
| 4 | 17.768 |
| 4 | 17.884 |
| 4 | 18.103 |
| 4 | 17.353 |
| 5 | 17.686 |
| 5 | 18.743 |
| 5 | 17.599 |
| 5 | 19.006 |
| 5 | 19.607 |
| 8 | 17.97 |
| 8 | 18.278 |
| 8 | 18.764 |
| 10 | 19.821 |
| 10 | 20.177 |
| 10 | 18.481 |

You can see that using more than 4 maxSockets doesn’t seem to improve the performance at all. This is surprising to me and at some point I’d like to investigate why. Why did it work faster when using a thread pool with 20 workers but not multiple I/O connections?

There it is. FAST! Doing it this way I am consistently getting the job done between 17 and 18 seconds. That’s quite a bit faster than using web workers. Pretty cool.

Sample code here

The post Jordan Takes Advantage of Multithreaded I/O in Nodejs appeared first on JavaScript Web Scraping Guy.

Top comments (0)