Originally posted on divrhino.com
Sometimes some things just don't have an API. In those kinds of cases, you can always just write a little web scraper to help you get the data you need. In this tutorial, we're going to learn how to build a web scraper. We will also learn how to save our scraped data into a JSON file. We're going to be working with Go and the Colly package. The Colly package will allow us to crawl, scrape and traverse the DOM.
Prerequisites
To follow along, you will need to have Go installed.
Setting up project directory
Let's get started. First, change into the directory where our projects are stored. In my case this would be the "Sites" folder, it may be different for you. Here we will create our project folder called rhino-scraper
cd Sites
mkdir rhino-scraper
cd rhino-scraper
In our rhino-scraper
project folder, we'll create our main.go
file. This will be the entry point of our app.
touch main.go
Initialising go modules
We will be using go modules
to handle dependencies in our project.
Running the following command will create a go.mod
file.
go mod init example.com/rhino-scraper
We're going to be using the colly package to build our webscraper, so let's install that now by running:
go get github.com/gocolly/colly
You will notice that running the above command created a go.sum
file. This file holds a list of the checksum and versions for our direct and indirect dependencies. It is used to validate the checksum of each dependency to confirm that none of them have been modified.
In the main.go
file we created earlier, let's set up a basic package main
and func main()
.
package main
func main() {}
Analysing the target page structure
For this tutorial we will be scraping some rhino facts from FactRetriever.com.
Below is a screenshot taken from the target page. We can see that each fact has a simple structure consisting of an id and a description.
Creating the fact struct
In our main.go
file, we can write a Fact struct type to represent the structure of a rhino fact. A fact will have:
- an ID that will be of type
int
, and - a description that will be of type
string
.
The Fact struct type
, the ID
field and the Description
field are all capitalised because we want them to be available outside of package main
.
package main
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {}
Inside of func main, we will create an empty slice to hold our facts. We will initialise it with length zero and append to it as we go. This slice will only be able to hold Facts.
package main
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {
allFacts := make([]Fact, 0)
}
Using the Colly package
We will be importing a package called colly to provide us with the methods and functionality we'll need to build our web scraper.
package main
import "github.com/gocolly/colly"
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {
allFacts := make([]Fact, 0)
}
Using the colly package, let's create a new collector and set it's allowed domains to be factretriever.com
package main
import "github.com/gocolly/colly"
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {
allFacts := make([]Fact, 0)
collector := colly.NewCollector(
colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
)
}
HTML structure of a list of facts
If we inspect the HTML structure, we will see that the facts are list items inside an unordered list that has the class of factsList
. Each fact list item has been assigned an id
. We will use this id
later.
Now that we know what the HTML structure is like, we can write some code to traverse the DOM. The colly package makes use of a library called goQuery to interact with the DOM. goQuery is like jQuery, but for Golang.
Below is the code so far. We will go over the new lines, step-by-step
package main
import (
"fmt"
"log"
"strconv"
"github.com/gocolly/colly"
)
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {
allFacts := make([]Fact, 0)
collector := colly.NewCollector(
colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
)
collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
factId, err := strconv.Atoi(element.Attr("id"))
if err != nil {
log.Println("Could not get id")
}
factDesc := element.Text
fact := Fact{
ID: factId,
Description: factDesc,
}
allFacts = append(allFacts, fact)
})
}
So, here's what's happening:
- We import the
fmt
,log
andstrconv
packages - We are using the
OnHTML
method. It takes two arguments. The first argument is a target selector and the second argument is a callback function that is called everytime a target selector is encountered - In the body of the
OnHTML
, we create a variable to store the ID of each element that is iterated over - The ID is currently of type
string
, so we usestrconv.Atoi
to convert it to typeint
- The
strconv.Atoi
method returns an error as it's second return value, so do some basic error handling - We create a variable called
factDesc
to store thedescription
text of each fact. Based on the Fact struct type we established earlier, we are expecting the fact description to be of typestring
. - Here, we create a new Fact struct for every list item we iterate over
- Then we append the Fact struct to the allFacts slice
Begin crawling and scraping
We want to have some visual feedback to let us know that our scraper is actually visiting the page. Let's do that now.
package main
import (
"fmt"
"log"
"strconv"
"github.com/gocolly/colly"
)
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {
allFacts := make([]Fact, 0)
collector := colly.NewCollector(
colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
)
collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
factId, err := strconv.Atoi(element.Attr("id"))
if err != nil {
log.Println("Could not get id")
}
factDesc := element.Text
fact := Fact{
ID: factId,
Description: factDesc,
}
allFacts = append(allFacts, fact)
})
collector.OnRequest(func(request *colly.Request) {
fmt.Println("Visiting", request.URL.String())
})
collector.Visit("https://www.factretriever.com/rhino-facts")
}
Here's what's happening:
- We use
fmt.Println
to output aVisting
message whenever we request a URL - We use the
Visit()
method to give our programme a starting point
If we run our program in the terminal now, by using the command
go run main.go
It will tell us that our collector visited the rhino facts page on Fact retriever.com
Saving our data to JSON
We may want to use our scraped data in another place. So let's save it to a JSON file.
package main
import (
"fmt"
"io/ioutil"
"log"
"os"
"strconv"
"github.com/gocolly/colly"
)
type Fact struct {
ID int `json:"id"`
Description string `json:"description"`
}
func main() {
allFacts := make([]Fact, 0)
collector := colly.NewCollector(
colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
)
collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
factId, err := strconv.Atoi(element.Attr("id"))
if err != nil {
log.Println("Could not get id")
}
factDesc := element.Text
fact := Fact{
ID: factId,
Description: factDesc,
}
allFacts = append(allFacts, fact)
})
collector.OnRequest(func(request *colly.Request) {
fmt.Println("Visiting", request.URL.String())
})
collector.Visit("https://www.factretriever.com/rhino-facts")
writeJSON(allFacts)
}
func writeJSON(data []Fact) {
file, err := json.MarshalIndent(data, "", " ")
if err != nil {
log.Println("Unable to create json file")
return
}
_ = ioutil.WriteFile("rhinofacts.json", file, 0644)
}
Here's what's happening in the code above:
- We import the
ioutil
package so we can to write to a file - We import the
os
package - The OS package provides an interface to operating system functionality
- Let's create a function called writeJSON that takes in one parameter of the type slice of fact
- Inside the function body, let's use
MarshalIndent
to marshal the data we pass in - The
MarshalIndent
method returns the JSON encoding of data and also returns an error - Some error handling. If we get an error here, we will just print a log message saying we were unable to create a JSON file
- We can then use the
WriteFile
method it provides us to write our JSON-encoded data to a file called"rhinofacts.json"
- This file does not exist yet, so the
WriteFile
method will create it with the permissions code of 0644.
Our WriteJSON function is ready to use. We can call it and pass allFacts
to it.
Now if we go back to the terminal and run the command go run main.go
, all our scraped rhino facts will be saved in a JSON file called "rhinofacts.json"
.
Conclusion
In this tutorial, you learnt how to build a web scraper with Go and the Colly package. If you enjoyed this article and you'd like more, consider following Div Rhino on YouTube.
Congratulations, you did great. Keep learning and keep coding!
divrhino / rhino-scraper
Learn how to build a web scraper with Go and colly. Video tutorial available on the Div Rhino YouTube channel.
Rhino Scraper
- Text tutorial: https://divrhino.com/articles/build-webscraper-with-go-and-colly/
- Video tutorial: https://www.youtube.com/watch?v=4VSno5bK9Uk
Top comments (0)