loading...

Create a Simple Web Scraper in C#

rachelsoderberg profile image Rachel Soderberg ・6 min read

Web scraping is a skill that can come in handy in a number of situations, mainly when you need to get a particular set of data from a website. I believe this is used most often in engineering and sciences for retrieving data such as statistics or articles with specific keywords. For this tutorial I will be teaching you how to scrape a website for the latter - articles with specific keywords.

Before we begin, I want to introduce web scraping and some of its limitations. Web scraping is also known as web harvesting or web data extraction and is a method of automatically extracting data from websites over the internet. The method of parsing I will be teaching you today is HTML parsing, which means our web scraper will be looking at the HTML content of a page and extracting the information that matches the class we want to retrieve information from (if this doesn't make sense, don't worry. I'll go into more detail later!) This method of web scraping is limited by the fact that not all web sites store all of their information in html - much of what we see today is dynamic and built after the page has been loaded. In order to see that information a more sophisticated web crawler is required, typically with its own web loader, which is beyond the scope of this tutorial.

I chose to build a web scraper in C# because the majority of tutorials built their web scrapers in Python. Although that is likely the ideal language for the job, I wanted to prove to myself that it can be done in C#. I also hope to help others learn to build their own web scrapers by providing one of only a few C# web scraping tutorials (as of the time of writing).

Building a Web Scraper

The website we will be scraping is Ocean Networks Canada, a website dedicated to providing information about the ocean and our planet. People using this project to scrape the internet for articles and data will find that this website provides a similar model to many other websites they will encounter.

  1. Launch Visual Studio and create a new C# .NET Windows Forms Application.

    Visual Studio New Windows Forms App

  2. Design a basic Form with a Button to start the scraper and a Rich Textbox for printing the results.

    Basic Form Design

  3. Open your NuGet Package Manager by right-clicking your project name in the Solution Explorer and selecting "Manage NuGet Packages". Search for "AngleSharp" and click Install.

    AngleSharp

  4. Add an array of query terms (these should be the words you want your articles to have in the title) and create a method where we will set up our document to scrape. Your code should look like the following:

        private string Title { get; set; }
        private string Url { get; set; }
        private string siteUrl = "https://www.oceannetworks.ca/news/stories";
        public string[] QueryTerms { get; } = {"Ocean", "Nature", "Pollution"};
    
        internal async void ScrapeWebsite()
        {
              CancellationTokenSource cancellationToken = new CancellationTokenSource();
              HttpClient httpClient = new HttpClient();
              HttpResponseMessage request = await httpClient.GetAsync(siteUrl);
              cancellationToken.Token.ThrowIfCancellationRequested();
    
              Stream response = await request.Content.ReadAsStreamAsync();
              cancellationToken.Token.ThrowIfCancellationRequested();
    
              HtmlParser parser = new HtmlParser();
              IHtmlDocument document = parser.ParseDocument(response);
        }
    

    CancellationTokenSource provides a token if a cancellation is requested by a task or thread.
    HttpClient provides a base class for sending HTTP requests and receiving HTTP responses from a URI-identified resource
    HttpResponseMessage represents an HTTP response message and includes the status code and data.
    HtmlParser and IHtmlDocument are AngleSharp Classes that allow you to build and parse documents from website HTML content.

  5. Create another new method to get and display the results from your AngleSharp document. Here we will parse the document and retrieve any articles that match our QueryTerms. This can be tricky, as no two websites use the same HTML naming conventions - it can take some trial and error to get the "articleLink" LINQ query correct:

        private void GetScrapeResults(IHtmlDocument document)
        {
              IEnumerable<IElement> articleLink;
    
              foreach (var term in QueryTerms)
              {
                    articleLink = document.All.Where(x => x.ClassName == "views-field views-field-nothing" && (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower())));
              }
    
              if (articleLink.Any())
              {
                    // Print Results: See Next Step
              }
        }
    

    If you aren't sure what happened here, I'll explain in more detail: We are looping through each of our QueryTerms (Ocean, Nature, and Pollution) and parsing through our document to find all instances where the ClassName is "views-field views-field-nothing" and where the ParentElement.InnerHtml contains the term we're currently querying.

    If you're unfamiliar with how to see the HTML of a webpage, you can find it by navigating to your desired URL, right clicking anywhere on the page, and choosing "View Page Source". Some pages have a small amount of HTML, others have tens of thousands of lines. You will need to sift through all of this to find where the article headers are stored, then determine the class that holds them. A trick I use is searching for part of one of the article headers, then moving up a few lines.

    Document Example

  6. Now, if our query terms were lucrative, we should have a list of several sets of HTML inside of which are our article titles and URLs. Create a new method to print your results to the Rich Textbox.

        public void PrintResults(string term, IEnumerable<IElement> articleLink)
        {
              // Clean Up Results: See Next Step
    
              resultsTextbox.Text = $"{Title} - {Url}{Environment.NewLine}";
        }
    
  7. If we were to print our results as-is, they would come in looking like HTML markup with all the tags, angle braces, and other non-human friendly items. We need to insert a method that will clean up our results before we print them to the form and, like step 5, the markup will vary widely by website.

        private void CleanUpResults(IElement result)
        {
              string htmlResult = result.InnerHtml.ReplaceFirst("        <span class=\"field-content\"><div><a href=\"", "https://www.oceannetworks.ca");
              htmlResult = htmlResult.ReplaceFirst("\">", "*");
              htmlResult = htmlResult.ReplaceFirst("</a></div>\n<div class=\"article-title-top\">", "-");
              htmlResult = htmlResult.ReplaceFirst("</div>\n<hr></span>  ", "");
    
              // Split Results: See Next Step
        }
    

    So what happened here? Well, I examined the InnerHtml of the result object that was coming in to see what extra stuff needed to be removed from what I actually wanted to display - a Title and a URL. Working from left to right, I simply replaced each chunk of html stuff with an empty string or "nothing", then for the chunk between the URL and the title I replaced with a "*" as a placeholder to split the strings on later. Each of these ReplaceFirst() uses will be different on each website, and it may not even work flawlessly on every article on a particular site. You can continue to add new replacements, or just ignore them if they are uncommon enough.

  8. I'm sure you noticed from the previous step that there's one last method to add before we can print a clean result to our textbox. Now that we've cleaned up our result string, we can use our "*" placeholder to split it into two strings - a Title and a URL.

        private void SplitResults(string htmlResult)
        {
              string[] splitResults = htmlResult.Split('*');
              Url = splitResults[0];
              Title = splitResults[1];
        }
    
  9. Finally we have a clean, human-friendly result! If all went well and the articles haven't drastically changed since the time of writing, running your code should provide the following set of results (and more... there was a lot!) that have been scraped by your application from Ocean Networks:

    Web Scrape Results


I hope this tutorial has given you some insight into the world of web scraping. If there's enough interest, I can continue this series and teach you how to set up your application to do a fresh scrape at specific time intervals and send you a newsletter-style email with a day's or week's worth of results.

If you'd like to catch up with me on social media, come find me over on Twitter or LinkedIn and say hello!

Posted on by:

rachelsoderberg profile

Rachel Soderberg

@rachelsoderberg

I'm a Software Developer who loves working with C#.NET and Salesforce. In my free time I lift weights, do martial arts, and play video games.

Discussion

markdown guide
 

I found this useful, but I admit to getting a bit stuck around connecting each of your steps together. To help other's in the future, here's a Gist that links everything together.

I admit it's output isn't as neat as yours, so I have a mistake somewhere... but it's a start. One quick note: it's WPF rather than WinForms, so take that into consideration for all UI-interactions.

 

Follow the link Mathew F. linked to, but edit these lines and everything will work!

Reguarding:
gist.github.com/CodeCommissions/43...

Edit:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower())));

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

TO THIS:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower()))).Skip(1);

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

Take note of the:

.Skip(1)

The reason it was ugly is because the first element in the IEnumerable was not filtered properly so instead of spending lots of time filtering through that mess we simply skip the first element :)

 

Thanks for the fix ^_^
I've updated the Gist to include your suggestion.

 

Follow the link Mathew F. linked to, but edit these lines and everything will work!

Reguarding:
gist.github.com/CodeCommissions/43...

Edit:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower())));

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

TO THIS:

        foreach (var term in QueryTerms)
        {
            articleLink = document.All.Where(x =>
                x.ClassName == "views-field views-field-nothing" &&
                (x.ParentElement.InnerHtml.Contains(term) || x.ParentElement.InnerHtml.Contains(term.ToLower()))).Skip(1);

            //Overwriting articleLink above means we have to print it's result for all QueryTerms
            //Appending to a pre-declared IEnumerable (like a List), could mean taking this out of the main loop.
            if (articleLink.Any())
            {
                PrintResults(articleLink);
            }
        }

Take note of the:

.Skip(1)

The reason it was ugly is because the first element in the IEnumerable was not filtered properly so instead of spending lots of time filtering through that mess we simply skip the first element :)

 

This may be just me but what I look for in a nicely written blog post such as this one, with the title "create-a-simple-web-scraper", is completeness because it should be a fullproof starter for beginners.

The code here doesn't work without adding the missing parts and fixing implied wrong usage suggestions.

 

I'm sorry you cannot get it working, but I built the application from the ground up while writing the post. It absolutely does work and is in its fullest form, there are no missing parts and I'm unsure what you mean by "implied wrong usage suggestions". Could you be more specific?

I would be glad to help you get the application working, can you provide the error you're getting and perhaps a link to your code?

 
 

Does this follow a similar method as I wrote above? I see it's using the HTML Agility Pack library, and I'm not familiar with that.

 

Yes Rachel, these (HTMLAgilityPack) are advanced libraries followed by xpath extractions uses also LINQ. I have written in vast and depth to scrape web sites, myself scraped a number of websites using HTMLAgilityPack. But you explained beautifully to get start with web scraping.

Very cool! I'll have to check it out next time I have some free time for a personal project. Thanks for the recommendation, your articles look very good as well.

Thanks Rachel to taking your time. If any help then please text me.

 

Thanks for the article, can't really get it to work (a bit new to all this) - is there anywhere where a full example of the code is posted or any example you would recommend (tried multiple online but don't seem to work). Cheers

 

Hello! Great to hear you're giving the project a shot!

Unfortunately I didn't create a full example of the code, at the time of writing I was blogging alongside what I built and now the project's much larger than this simple example was.

Where are you running into issues (is there a specific step you can't get working)? Are you getting any errors?

 

I can't run this app can you post code or help me! Please!

 

There is plenty of code in the post, what specifically are you having trouble with?

 

I wrote according to what you put up but when running the program nothing happens!

I would love to be able to help, but I need you to be more specific. Can you include your own code? What errors are you getting?