Scraping is cool.
You can automate lots of things – scrape the jobs you want, scrape the articles, create content aggregators, send the things you need to your email, save it to a file...plenty of options.
Just be respectful to the websites you are scraping and check their policies regarding scraping.
Requirements:
Python, BeautifulSoup4, requests.
The URL we will be using is for educational purposes. Creator itself says it can be used for practicing your scraping skills:
http://books.toscrape.com/
Installation
Create a python file where you will be writing the code for scraping – I will name mine scraper.py.
As usual, we are using pip to install requests and beautifulsoup4 modules.
Go to your terminal where venv is activated and type:
pip install requests beautifulsoup4
Requests module helps us to send requests to a webpage and see the response.
BeautifulSoup will help us with the scraping and filtering data we want.
Let's Start
Firstly, we will use requests module to access the URL.
scraper.py
import requests
page = requests.get("http://books.toscrape.com/")
print(page)
When we run the script, we will get:
<Response [200]>
which is telling us we have accessed the URL (status code 200 == OK).
Now, we will start using BeautifulSoup.
Edit scraper.py
file:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, "html.parser")
print(soup)
print(soup.prettify())
We have created the soup instance.
Next, we want to see what we are looking at and from where we are gathering the data.
For that, we have two options (and it’s up to you to choose what works better for you):
1. Printing the soup instance to a terminal
When we print the soup
variable, we will get the html code of the page we are accessing.
I have written two options for printing the results (print(soup)
and print(soup.prettify()
).
The difference – if we print the soup
itself, we will get ugly, cluttered code.
If we use prettify()
, we will get clean html
code.
2. Using DevTools
I prefer the second option.
When we open the URL we want to scrape in the browser, use the right click of the mouse and select Inspect
.
DevTools
will appear. Select the Elements
option and reload the page. You will have html
code of the page shown in Elements
section. This is the type you are looking for:
`
All products | Books to Scrape - Sandbox
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
<meta name="created" content="24th Jun 2016 09:29" />
<meta name="description" content="" />
<meta name="viewport" content="width=device-width" />
<meta name="robots" content="NOARCHIVE,NOCACHE" />...
`
Now, if you haven’t used html
before, do short research – it's not complicated.
For scraping purposes, we should know the basics like the tags, ids, classes, etc. so we can grab what we need.
In the begging, it will take more time to get familiar with the html
itself. But...like everything else, it will get better with practice.
Let's Scrape
What we want to do here is to get the book title, the price of the book, and to check if the book is available or not.
Firstly, we want to see which html
tag has the content we want to scrape.
After researching the html
, we can notice the <article>
tag. To be more precise, the <article>
tag with the class=”product_pod”
.
That specific tag has everything we need – the title, the price, and the availability of the book.
The information of each book is wrapped in <article>
tag with class="product_pod"
.
We have our target – we just need to have all of them in one variable for easier scraping.
Let’s add the following line to our scraper.py
file:
results = soup.find_all("article", class_="product_pod")
In our soup
instance we are looking for all of the <article>
tags that have class="product_pod"
and we are storing them to results
variable.
When we print results
, we will get a list-like variable.
Now, let's see how one element of the results
looks like, so we can know what to do.
Add the following to your file and run the script:
result = results[0]
print(result)
Terminal:
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
In stock
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
Now that we have printed only one result, we see it contains <h3>
tag with the title we want.
Also, we can see we have a <p>
tag with the class="price_color"
which contains the price of the book, and we have <p>
tag with the class="instock availability"
which tells us if the book is in stock or not.
Now we will access the information we need.
Let's start extracting data from the result
variable which contains data related to just one book (a bit later, we will do it for all the books from the page we are scraping).
The book title
To your file add:
title_element = result.find(“h3”)
When we print title_element
, we will get the following:
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
As we can see, the full title is in <a>
tag. We need to access the title itself and we are doing it by accessing the <a>
tag in title_element
and searching for the title itself. Add to your file:
title = title_element.find("a")["title"]
print(title)
will now give us “A Light in the Attic”.
We solved the book title.
The price
We saw the price is wrapped in <p>
tag with the class="price_color"
. Add to your file the following:
price_element = result.find("p", class_="price_color")
When we print the price_element
, we get:
<p class="price_color">£51.77</p>
We now need to get the text from the tag, and clean the text from any spacing. Add the following line to your file:
price = price_element.text.strip()
When we print the price, we will have nicely formatted string - £51.77
.
The availability of the book
Similarly to the book’s price, we will get the availability of the book. Add:
availability_element = result.find("p", class_="instock availability")
Which will give us the following result:
<p class="instock availability">
<i class="icon-ok"></i>
In stock
</p>
Now we are cleaning the result by adding:
available = available_element.text.strip()
Which will give us string “In stock”.
Ok, we know how to get the title, the price and the availability of the book.
The final code
Since the results
variable from the beginning is a list-like variable, we can go through a loop and do this for each book in the loop (for each result in results).
Let’s change our code a little bit by adding a for loop and printing the information for each book in the list. Change your file to look like this:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, "html.parser")
results = soup.find_all("article", class_="product_pod")
for result in results:
# looping through results and storing the values to a separate variables.
# the book title.
title_element = result.find("h3")
title = title_element.find("a")["title"]
# the price of the book.
price_element = result.find("p", class_="price_color")
price = price_element.text.strip()
# availability of the book
available_element = result.find("p", class_="instock availability")
available = available_element.text.strip()
print(f"The book title is: {title}.")
print(f"The book price: {price}.")
print(f"The book is: {available}.\n")
When we run the script, we will get the following results in the terminal:
The book title is: A Light in the Attic.
The book price: £51.77.
The book is: In stock.
The book title is: Tipping the Velvet.
The book price: £53.74.
The book is: In stock.
The book title is: Soumission.
The book price: £50.10.
The book is: In stock.
...
...
...
The book title is: Libertarianism for Beginners.
The book price: £51.33.
The book is: In stock.
The book title is: It's Only the Himalayas.
The book price: £45.17.
The book is: In stock.
There it is.
Scraping itself is not super difficult, just requires some practice (like everything else).
Also, it can be done in different ways depending on what you need and what you prefer.
Hope someone will find this post helpful.
Top comments (0)