DEV Community

Jelena
Jelena

Posted on

Playing with BeautifulSoup (spiders might not be so scary after all)

Scraping is cool.
You can automate lots of things – scrape the jobs you want, scrape the articles, create content aggregators, send the things you need to your email, save it to a file...plenty of options.
Just be respectful to the websites you are scraping and check their policies regarding scraping.
Requirements:
Python, BeautifulSoup4, requests.

The URL we will be using is for educational purposes. Creator itself says it can be used for practicing your scraping skills:
http://books.toscrape.com/

Installation

Create a python file where you will be writing the code for scraping – I will name mine scraper.py.
As usual, we are using pip to install requests and beautifulsoup4 modules.
Go to your terminal where venv is activated and type:
pip install requests beautifulsoup4

Requests module helps us to send requests to a webpage and see the response.
BeautifulSoup will help us with the scraping and filtering data we want.

Let's Start

Firstly, we will use requests module to access the URL.
scraper.py

import requests


page = requests.get("http://books.toscrape.com/")
print(page)
Enter fullscreen mode Exit fullscreen mode

When we run the script, we will get:
<Response [200]>
which is telling us we have accessed the URL (status code 200 == OK).
Now, we will start using BeautifulSoup.

Edit scraper.py file:

import requests
from bs4 import BeautifulSoup

page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, "html.parser")
print(soup)
print(soup.prettify())
Enter fullscreen mode Exit fullscreen mode

We have created the soup instance.
Next, we want to see what we are looking at and from where we are gathering the data.

For that, we have two options (and it’s up to you to choose what works better for you):

1. Printing the soup instance to a terminal

When we print the soup variable, we will get the html code of the page we are accessing.
I have written two options for printing the results (print(soup) and print(soup.prettify()).
The difference – if we print the soup itself, we will get ugly, cluttered code.
If we use prettify(), we will get clean html code.

2. Using DevTools

I prefer the second option.
When we open the URL we want to scrape in the browser, use the right click of the mouse and select Inspect.
DevTools will appear. Select the Elements option and reload the page. You will have html code of the page shown in Elements section. This is the type you are looking for:
`



All products | Books to Scrape - Sandbox
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    <meta name="created" content="24th Jun 2016 09:29" />
    <meta name="description" content="" />
    <meta name="viewport" content="width=device-width" />
    <meta name="robots" content="NOARCHIVE,NOCACHE" />...
Enter fullscreen mode Exit fullscreen mode

`

Now, if you haven’t used html before, do short research – it's not complicated.
For scraping purposes, we should know the basics like the tags, ids, classes, etc. so we can grab what we need.
In the begging, it will take more time to get familiar with the html itself. But...like everything else, it will get better with practice.

Let's Scrape

What we want to do here is to get the book title, the price of the book, and to check if the book is available or not.

Firstly, we want to see which html tag has the content we want to scrape.
After researching the html, we can notice the <article> tag. To be more precise, the <article> tag with the class=”product_pod”.
That specific tag has everything we need – the title, the price, and the availability of the book.
The information of each book is wrapped in <article> tag with class="product_pod".
We have our target – we just need to have all of them in one variable for easier scraping.
Let’s add the following line to our scraper.py file:

results = soup.find_all("article", class_="product_pod")
Enter fullscreen mode Exit fullscreen mode

In our soup instance we are looking for all of the <article> tags that have class="product_pod" and we are storing them to results variable.
When we print results, we will get a list-like variable.
Now, let's see how one element of the results looks like, so we can know what to do.
Add the following to your file and run the script:

result = results[0]
print(result)
Enter fullscreen mode Exit fullscreen mode

Terminal:

<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>

        In stock

</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
Enter fullscreen mode Exit fullscreen mode

Now that we have printed only one result, we see it contains <h3> tag with the title we want.
Also, we can see we have a <p> tag with the class="price_color" which contains the price of the book, and we have <p> tag with the class="instock availability" which tells us if the book is in stock or not.
Now we will access the information we need.

Let's start extracting data from the result variable which contains data related to just one book (a bit later, we will do it for all the books from the page we are scraping).

The book title

To your file add:

title_element = result.find(h3)  
Enter fullscreen mode Exit fullscreen mode

When we print title_element, we will get the following:

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
Enter fullscreen mode Exit fullscreen mode

As we can see, the full title is in <a> tag. We need to access the title itself and we are doing it by accessing the <a> tag in title_element and searching for the title itself. Add to your file:

title = title_element.find("a")["title"]
Enter fullscreen mode Exit fullscreen mode

print(title) will now give us “A Light in the Attic”.

We solved the book title.

The price

We saw the price is wrapped in <p> tag with the class="price_color". Add to your file the following:

price_element = result.find("p", class_="price_color")
Enter fullscreen mode Exit fullscreen mode

When we print the price_element, we get:

<p class="price_color">£51.77</p>
Enter fullscreen mode Exit fullscreen mode

We now need to get the text from the tag, and clean the text from any spacing. Add the following line to your file:

price = price_element.text.strip()
Enter fullscreen mode Exit fullscreen mode

When we print the price, we will have nicely formatted string - £51.77.

The availability of the book

Similarly to the book’s price, we will get the availability of the book. Add:

availability_element = result.find("p", class_="instock availability")
Enter fullscreen mode Exit fullscreen mode

Which will give us the following result:

<p class="instock availability">
<i class="icon-ok"></i>
        In stock
</p>
Enter fullscreen mode Exit fullscreen mode

Now we are cleaning the result by adding:

available = available_element.text.strip()
Enter fullscreen mode Exit fullscreen mode

Which will give us string “In stock”.
Ok, we know how to get the title, the price and the availability of the book.

The final code

Since the results variable from the beginning is a list-like variable, we can go through a loop and do this for each book in the loop (for each result in results).
Let’s change our code a little bit by adding a for loop and printing the information for each book in the list. Change your file to look like this:

import requests
from bs4 import BeautifulSoup


page = requests.get("http://books.toscrape.com/")

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find_all("article", class_="product_pod")

for result in results:
    # looping through results and storing the values to a separate variables.
    # the book title.
    title_element = result.find("h3")
    title = title_element.find("a")["title"]
    # the price of the book.
    price_element = result.find("p", class_="price_color")
    price = price_element.text.strip()
    # availability of the book
    available_element = result.find("p", class_="instock availability")
    available = available_element.text.strip()

    print(f"The book title is: {title}.")
    print(f"The book price: {price}.")
    print(f"The book is: {available}.\n")
Enter fullscreen mode Exit fullscreen mode

When we run the script, we will get the following results in the terminal:

The book title is: A Light in the Attic.
The book price: £51.77.
The book is: In stock.

The book title is: Tipping the Velvet.
The book price: £53.74.
The book is: In stock.

The book title is: Soumission.
The book price: £50.10.
The book is: In stock.
...
...
...
The book title is: Libertarianism for Beginners.
The book price: £51.33.
The book is: In stock.

The book title is: It's Only the Himalayas.
The book price: £45.17.
The book is: In stock.
Enter fullscreen mode Exit fullscreen mode

There it is.
Scraping itself is not super difficult, just requires some practice (like everything else).
Also, it can be done in different ways depending on what you need and what you prefer.
Hope someone will find this post helpful.

Top comments (0)