loading...

Speed up web scrapping using C#

lleonardogr profile image Leonardo Gasparini Romão ・2 min read

This article is a part of web scrapping series using c#:

How to web scrapping using C#
Speed up web scrapping using C#

Now we use parallels to up speed our web scrapping code. It's common to want multiple pages when we getting data from the web, and in the last article I use one page to test web scrapping but, if we need to get a large set of information, we need a better solution.

Use a single process looping all the pages will take so much time to get all data, so another option is use parallels. This is an example to use multiple processes to take data from the web:


var links = new string[] { "https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_fellowship_of_the_ring",
            "https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_two_towers",
            "https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king",
            "https://www.rottentomatoes.com/m/the_hobbit_an_unexpected_journey",
            "https://www.rottentomatoes.com/m/the_hobbit_the_desolation_of_smaug",
            "https://www.rottentomatoes.com/m/the_hobbit_the_battle_of_the_five_armies"
            };

Console.WriteLine("Gettting page from movie...");
Parallel.ForEach(links, new ParallelOptions { MaxDegreeOfParallelism = 4 }, link =>
    {
        using var Client = new WebClient();

        //Download Html from a Url:
        var HtmlRequestResult = Client.DownloadString(link);

        //Load HtmlString to AgilityPack Document
        var Document = new HtmlDocument();
        Document.LoadHtml(HtmlRequestResult);

        //Get movie title, critic score and user score
        var MovieTitle = Document.DocumentNode.Descendants("h1").FirstOrDefault()?.InnerText.Trim();
        var CriticScore = Document.GetElementbyId("tomato_meter_link")?.InnerText.Trim();
        var UserScore = Document.DocumentNode.Descendants("a")
            .FirstOrDefault(x => x.GetAttributeValue("href", "") == "#audience_reviews")?.InnerText.Trim();

        Console.WriteLine(string.Format(" Title:{0} \r\n Critic Score:{1} \r\n User Score:{2}", MovieTitle, CriticScore, UserScore));
    });

Console.WriteLine("Press any key to close the program...");
Console.ReadKey();

The console now prints all the movies, unordered because we are using multiprocess and with this, we have an different behavior for every link that we get.

Alt Text

Well, like any other multiprocess application, we now need to care about how to manage and control our parallelism level, because we can make the code use so much memory or CPU and slow down all our infrastructure, so be careful, besides, we need to have attention to the website that we visit, using multiple processes can be confused as a DDOS attack and block our code, so, don't push links so harder.

Useful sources:

Posted on by:

lleonardogr profile

Leonardo Gasparini Romão

@lleonardogr

Trabalhando com programação desde 2012, desenvolvedor ASP.NET

Discussion

pic
Editor guide