Oxylabs for Oxylabs

Posted on Nov 7, 2022

Building a Web Scraper in Golang: Complete Tutorial

#webscraping #tutorial #beginners #scraping

Ever wondered how to build a web scraper in Golang? Check out this practical tutorial.

Golang, or Go, is designed to leverage the static typing and run-time efficiency of C and usability of Python and JavaScript, with added features of high-performance networking and multiprocessing. It’s also compiled and excels in concurrency, making it quick.

This article will guide you through the step-by-step process of writing a fast and efficient Golang web scraper that can extract public data from a target website.

Installing Go

To start, head over to the Go downloads page. Here you can download all of the common installers, such as Windows MSI installer, macOS Package, and Linux tarball. Go is open-source, meaning that if you wish to compile Go on your own, you can download the source code as well.

A package manager facilitates working with first-party and third-party libraries by helping you to define and download project dependencies. The manager pins down version changes, allowing you to upgrade your dependencies without fear of breaking the established infrastructure.

Installing Go on macOS

If you prefer package managers, you can use Homebrew on macOS. Open the terminal and enter the following:

brew install go

Installing Go on Windows

On Windows, you can use the Chocolatey package manager. Open the command prompt and enter the following:

choco install golang

Installing Go on Linux

Installing Go on Linux requires five simple steps:

1. Remove previous Go installations (if any) using the following command:

rm -rf /usr/local/go

2. Download the GO for Linux package; head over to the Go downloads page, or use:

wget https://go.dev/dl/go1.19.2.linux-amd64.tar.gz

3. Once the .tar.gz file is downloaded, extract the archive in the /usr/localdirectory through:

tar -C /usr/local -xzf go1.19.2.linux-amd64.tar.gz

4. Add the Go path to the PATH environment variable by adding the following line into $HOME/.profile file, or for a system-wide installation, add it in /etc/profile file:

export PATH=$PATH:/usr/local/go/bin

5. Use the source $HOME/.profile command to apply changes in the environment variable of the .profile file.

Now, you can use the go version command to verify that the Go version is installed.

Once Go is installed, you can use any code editor or an integrated development environment (IDE) that supports Go.

How to install Golang in Visual Studio Code?

While you can use virtually any code editor to write a Go program, one of the most commonly used ones is Visual Studio Code.

For Golang to be supported, you’ll need to install the Go extension. To do that, select the Extensions icon on the left side, type in Go in the search bar, and simply click Install :

Go extension for Visual Studio Code

Once you’ve finished installing the Go extension, you’ll need to update Go tools.

Press Ctrl+Shift+P to open the Show All Commands window and search for Go: Install/Update tools.

Take a look at the image below to see how it looks:

Go tools for Visual Studio Code

After selecting all the available Go tools, click on the OK button to install.

We can also use a separate IDE (e.g., GoLand) to write, debug, compile, and run the Go projects. Both Visual Studio Code and GoLand are available for Windows, macOS, and Linux.

Web scraping frameworks

Go offers a wide selection of frameworks. Some are simple packages with core functionality, while others, such as Ferret, Gocrawl, Soup, and Hakrawler, provide a complete web scraping infrastructure to simplify data extraction. Let’s have a brief overview of these frameworks.

Ferret

Ferret is a fast, portable, and extensible framework for designing Go web scrapers. It’s pretty easy to use as the user simply needs to write a declarative query expressing which data to extract. Ferret handles the HTML retrieving and parsing part by itself.

Gocrawl

Gocrawl is a web scraping framework written in Go language. It gives complete control to visit, inspect, and query different URLs using goquery. This framework allows concurrent execution as it applies goroutines.

Soup

Soup is a small web scraping framework that can be used to implement a Go web scraper. It provides an API for retrieving and parsing the content.

Hakrawler

Hakrawler is a simple and fast web crawler available with Go language. It’s a simplified version of the most popular Golang web scraping framework – GoColly. It’s mainly used to extract URLs and JavaScript file locations.

GoQuery

GoQuery is a framework that provides functionalities similar to jQuery in Golang. It uses two basic Go packages – net/html (a Golang HTML parser) and cascadia (a CSS Selector).

Colly

The most popular framework for writing web scrapers in Go is Colly.

Colly is a fast scraping framework that can be used to write any kind of crawler, scraper, or spider. If you want to know more about differentiating a scraper from a crawler, check this article.

Colly has a clean API, handles cookies and sessions automatically, supports caching and robots.txt, and, most importantly, it’s fast. Colly offers distributed scraping, HTTP request delays, and concurrency per domain.

In this Golang Colly tutorial, we’ll be using Colly to scrape books.toscrape.com. The website is a dummy book store for practicing web scraping.

How to import a package in Golang?

As the name suggests, the import directive imports different packages into a Golang program. For example, the fmt package has definitions of formatted I/O library functions and can be imported using the import preprocessor directive, as shown in the following snippet:

package main

import "fmt"

func main() {

fmt.Println("Hello World")

}

The code above first imports the fmt package and then uses its Println function to display the Hello World text in the console.

We can also import multiple packages using a single import directive, as you can see from the example below:

package main

import (

"fmt"

"math/rand"

)

func main() {

fmt.Println("Hello World")

fmt.Println(rand.Intn(25))

}

Parsing HTML with Colly

To easily extract structured data from the URLs and HTML, the first step is to create a project and install Colly.

Create a new directory and navigate there using the terminal. From this directory, run the following command:

go mod init oxylabs.io/web-scraping-with-go

This will create a go.mod file that contains the following lines with the name of the module and the version of Go. In this case, the version of Go is 1.17:

module oxylabs.io/web-scraping-with-go

go 1.17

Next, run the following command to install Colly and its dependencies:

go get github.com/gocolly/colly

This command will also update the go.mod file with all the required dependencies as well as create a go.sum file.

We are now ready to write the web scraper code file. Create a new file, save it as books.go and enter the following code:

package main

import (

"encoding/csv"

"fmt"

"log"

"os"

"github.com/gocolly/colly"

)

func main() {

// Scraping code here

fmt.Println("Done")

}

The first line is the name of the package. Next, there are some built-in packages being imported as well as Colly itself.

The main() function is going to be the entry point of the program. This is where we’ll write the code for the web scraper.

Sending HTTP requests with Colly

The fundamental component of a Colly web scraper is the Collector. The Collector makes HTTP requests and traverses HTML pages.

The Collector exposes multiple events. We can hook custom functions that execute when these events are raised. These functions are anonymous and pass as a parameter.

First, to create a new Collector using default settings, enter this line in your code:

c := colly.NewCollector()

There are many other parameters that can be used to control the behavior of the Collector. In this example, we are going to limit the allowed domains. Change the line as follows:

c := colly.NewCollector(

colly.AllowedDomains("books.toscrape.com"),

)

Once the instance is available, the Visit() function can be called to start the scraper. However, before doing so, it’s important to hook up to a few events.

The OnRequest event is raised when an HTTP request is sent to a URL. This event is used to track which URL is being visited. Simple use of an anonymous function that prints the URL being requested is as follows:

c.OnRequest(func(r colly.Request) {

fmt.Println("Visiting", r.URL)

})

Note that the anonymous function being sent as a parameter here is a callback function. It means that this function will be called when the event is raised.

Similarly, OnResponse can be used to examine the response. The following is one such example:

c.OnResponse(func(r colly.Response) {

fmt.Println(r.StatusCode)

})

The OnHTML event can be used to take action when a specific HTML element is found.

Locating HTML elements via CSS selector

The OnHTML event can be hooked using the CSS selector and a function that executes when the HTML elements matching the selector are found.

For example, the following function executes when a title tag is encountered:

c.OnHTML("title", func(e colly.HTMLElement) {

fmt.Println(e.Text)

})

This function extracts the text inside the title tag and prints it. Putting together all we have gone through so far, the main() function is as follows:

func main() {

c := colly.NewCollector(

colly.AllowedDomains("books.toscrape.com"),

)

c.OnHTML("title", func(e colly.HTMLElement) {

fmt.Println(e.Text)

})

c.OnResponse(func(r colly.Response) {

fmt.Println(r.StatusCode)

})

c.OnRequest(func(r colly.Request) {

fmt.Println("Visiting", r.URL)

})

c.Visit("https://books.toscrape.com/")

}

This file can be run from the terminal as follows:

go run books.go

The output will be as follows:

Visiting https://books.toscrape.com/

200

All products | Books to Scrape - Sandbox

Extracting the HTML elements

Now that we know how Colly works let’s modify OnHTML to extract the book titles and prices.

The first step is to understand the HTML structure of the page.

The books are in the <article> tags

Each book is contained in an article tag that has a product_pod class. The CSS selector would be .product_pod.

Next, the complete book title is found in the thumbnail image as an altattribute value. The CSS selector for the book title would be .image_container img.

Finally, the CSS selector for the book price would be .price_color.

The OnHTML can be modified as follows:

c.OnHTML(".product_pod", func(e colly.HTMLElement) {

title := e.ChildAttr(".image_container img", "alt")

price := e.ChildText(".price_color")

})

This function will execute every time a book is found on the page.

Note the use of the ChildAttr function that takes two parameters: the CSS selector and the name of the attribute – it isn’t subtle. A better idea would be to create a data structure to hold this information. In this case, we can use struct as follows:

type Book struct {

Title string

Price string

}

The OnHTML will be modified as follows:

c.OnHTML(".product_pod", func(e colly.HTMLElement) {

book := Book{}

book.Title = e.ChildAttr(".image_container img", "alt")

book.Price = e.ChildText(".price_color")

fmt.Println(book.Title, book.Price)

})

For now, this web scraper is simply printing the information to the console, which isn’t particularly useful. We’ll revisit this function when it’s time to save the data to a CSV file.

Handling pagination

First, we need to locate the “next” button and create a CSS selector. For this particular site, the CSS selector is .next > a. Using the selector, a new function can be added to the OnHTML event. In this function, we’ll convert a relative URL to an absolute URL. Then, we’ll call the Visit() function to crawl the converted URL:

c.OnHTML(".next > a", func(e colly.HTMLElement) {

nextPage := e.Request.AbsoluteURL(e.Attr("href"))

c.Visit(nextPage)

})

The existing function that scrapes the book information will be called on all of the resulting pages as well. No additional code is needed.

Now that we have the data from all of the pages, it’s time to save it to a CSV file.

Writing data to a CSV file

The built-in CSV library can be used to save the structure to CSV files. If you want to save the data in JSON format, you can use the JSON library as well.

To create a new CSV file, enter the following code before creating the Colly collector:

file, err := os.Create("export.csv")

if err != nil {

log.Fatal(err)

}

defer file.Close()

This will create export.csv and delay closing the file until the program completes its cycle.

Next, add these two lines to create a CSV writer:

writer := csv.NewWriter(file)

defer writer.Flush()

Now, it’s time to write the headers:

headers := []string{"Title", "Price"}

writer.Write(headers)

Finally, modify the OnHTML function to write each book as a single row:

c.OnHTML(".product_pod", func(e colly.HTMLElement) {

book := Book{}

book.Title = e.ChildAttr(".image_container img", "alt")

book.Price = e.ChildText(".price_color")

row := []string{book.Title, book.Price}

writer.Write(row)

})

That’s all! The code for the Golang web scraper is now complete.

Run the file by entering the following in the terminal:

go run books.go

This will create an export.csv file with 1,000 rows of data.

Scheduling tasks with GoCron

For some tasks, you might want to schedule a web scraper to extract data periodically or at a specific time. You can do that by using your OS's schedulers or a high-level scheduling package usually available with the language you're using.

To schedule a Go scraper, you can use OS tools like Cron or Windows Task Scheduler. Alternatively, you can equip a high-level GoCron task scheduling package available with Golang. It's essential to keep in mind that scheduling a scraper through OS-provided schedulers limits the portability of the code. However, the GoCron task scheduler package solves this problem and works well with almost all operating systems.

GoCron is a task scheduling package available in Golang for running specific codes at a particular time. It offers similar functionalities as Python's job scheduling module named schedule.

Scheduling a task with GoCron requires a package to be installed with Golang, which you can do by using the following command:

go get github.com/go-co-op/gocron

The next step is to write a GoCron script to schedule our code. Let's look at the following code example to understand how GoCron scheduler works:

package main

import (

"fmt"

"time"

"github.com/go-co-op/gocron"

)

func My_Task_1() {

fmt.Println("Hello Task 1")

}

func main() {

my_scheduler := gocron.NewScheduler(time.UTC)

my_scheduler.Every(5).Seconds().Do(My_Task_1)

my_scheduler.StartAsync()

my_scheduler.StartBlocking()

}

The code above schedules the My_task_1 function to run every 5 seconds. Moreover, we can start the GoCron scheduler in two modes: asynchronous mode and blocking mode.

StartAsync() will start the scheduler asynchronously, while the StartBlocking() method will start the scheduler in blocking mode by blocking the current execution path.

Side note: The above code example starts the GoCron scheduler in both the asynchronous and the blocking modes. However, we can choose either of these as per our requirements.

Let’s schedule our Golang web scraper code example using the GoCron scheduling module.

package main

import (

"encoding/csv"

"fmt"

"log"

"os"

"time"

"github.com/go-co-op/gocron"

"github.com/gocolly/colly"

)

type Book struct {

Title string

Price string

}

func BooksScraper() {

fmt.Println("Start scraping")

file, err := os.Create("export.csv")

if err != nil {

log.Fatal(err)

}

defer file.Close()

writer := csv.NewWriter(file)

defer writer.Flush()

headers := []string{"Title", "Price"}

writer.Write(headers)

c := colly.NewCollector(

colly.AllowedDomains("books.toscrape.com"),

)

c.OnHTML(".product_pod", func(e colly.HTMLElement) {

book := Book{}

book.Title = e.ChildAttr(".image_container img", "alt")

book.Price = e.ChildText(".price_color")

row := []string{book.Title, book.Price}

writer.Write(row)

})

c.OnResponse(func(r colly.Response) {

fmt.Println(r.StatusCode)

})

c.OnRequest(func(r *colly.Request) {

fmt.Println("Visiting", r.URL)

})

c.Visit("https://books.toscrape.com/")

}

func main() {

my_scheduler := gocron.NewScheduler(time.UTC)

my_scheduler.Every(2).Minute().Do(BooksScraper)

my_scheduler.StartBlocking()

}

Summary

The code used in this article ran in less than 12 seconds. Executing the same task in Scrapy, which is one of the most optimized modern frameworks for Python, took 22 seconds. If speed is what you prioritize for your web scraping tasks, it’s a good idea to consider Golang in tandem with a modern framework such as Colly. You can click here to find the complete code used in this article for your convenience.

Top comments (3)

Michael Morasch • Nov 7 '22

great write up with good clarifications for terms in between!
One question I have is if any of the web scrapers you mentioned allows for functionality similar to playwright in the sense that it waits for all javascript and scripts to execute?

Oxylabs Oxylabs • Nov 14 '22

Thank you so much, Michael!

Actually we have a whole tutorial on Web Scraping With Playwright you might want to check that one out :) If it won't clarify what you are asking, let me know!

Oxylabs Oxylabs • Nov 7 '22

If you have any questions, please leave a comment and we will make sure to answer as quickly as possible! :)