DEV Community

loading...

Extract all the data! – 02 – Python scrapy tutorial for beginners

davidmm1707 profile image David MM🐍 Originally published at letslearnabout.net Updated on ・7 min read

Original post: Python scrapy tutorial for beginners – 02 – Extract all the data!

In our last lesson, we have created our first Scrapy spider and we have scraped a few fields from the book. But we also learnt that every item has a URL with more detailed data. Let's see how to extract all the data in different ways.

In this post you will learn how to:

  • Scrap items on their own page
  • Extract routes with relative URLs
  • Select elements by tag, class, partial class and siblings elements
  • Extract information from tables
  • Use callbacks to other Scrapy class methods


Our actual spider

On our last lesson, our spider was able to extract the title, price, image URL and book URL. Let me remember the code:

import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):

        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            title = book.xpath('.//h3/a/@title').extract_first()
            price = book.xpath('.//div/p[@class="price_color"]/text()').extract_first()
            image_url = self.start_urls[0] + book.xpath('.//img[@class="thumbnail"]/@src').extract_first()
            book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()

            yield {
                'title': title,
                'price': price,
                'Image URL': image_url,
                'Book URL': book_url,
            }

If you don't know how to create a Scrapy project and spider, please, go to the first lesson: Creating your first spider

This spider is going to be our starting point, but instead of extracting title, price, image and book URL, we are going to extract the book URL, and then parse from that URL, not from our the one on start_urls.


Using Scrapy to get to the detailed book URL

Take the whole spider, and remove everything related to title, image and price. Remove the yield. This should be your spider now:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):

        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            book_url = self.start_urls[0] + book.xpath('.//h3/a/@href').extract_first()

Right now we are getting all the books and extracting its URL. Now, for each book, we are going to use a new method. Parse method is called automatically when the spider starts, but we can create our own methods.

As we have the Book URL we can create another request, that's it, a petition to the server. But instead of the base URL books.toscrape.com, we are going to use the book's URL. Add this to your script:

# Old code
        for book in all_books:
            book_url = self.start_urls[0] + 
                book.xpath('.//h3/a/@href').extract_first()
# New code 
            yield scrapy.Request(book_url, callback=self.parse_book)

    def parse_book(self, response):
        print(response.status)    

We use the Scrapy method Request _to request a new HTML to the server. That HTML is the one stored at _book_url. The callback, the method that we are going to run after we get the response, it is a new method: parse_book.

Run the code and each time you will get a bunch of 200, the status code of success:


As we did on the parse method, we are going to extract the data from each own book URL. Open one random book, for example, Sharp Objects

We are going to use this one as a model and every book will be scraped the same way.

We have a lot to choose from! Why don't we start from the title?

Extracting data – The easy ones

Right-click on the title, select inspect and look where it is located. It's just the only h1 _tag after a _div. Pretty easy. Let's find one h1 _after a _div, and extract the text. Then, we store it in a variable:

    def parse_book(self, response):
        title = response.xpath('//div/h1/text()').extract_first()
        print(title)

Let's run the code and print the title:

Easy, right?

Before, we just had the main URL and loop over the articles to extract the data.

Now we have the main URL and loop over the articles to extract the URL, then request the new URL and we extract the data. One additional step in another method. This is all it takes.

Let's keep going. Locate the image and right-click it and then inspect it. Seems like we have a partial URL again!

extracting data - image url

Luckily you have learnt a lot in our first lesson and you know how to create the final URL by getting the partial URL and adding the base URL. Why don't you give it a try?

Doesn't matter if you don't succeed at the first try. Get the URL, add the base URL and print the result until you find it.

This is how I did it:

class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
# New 'base_url' variable
    base_url = 'http://books.toscrape.com'

    .....
    def parse_book(self, response):
        title = response.xpath('//div/h1/text()').extract_first()

        relative_image = response.xpath('//div[@class="item active"]/img/@src').extract_first()
        final_image = self.base_url + relative_image.replace('../..', '')

As always, print final_image to see that you have a proper URL. You know the drill.

Let"s get the price.

The 'contains' selector

Right-click the price, inspect it and you can see that it is inside a p tag with a price_color class.

The problem is that every item at the bottom section of 'Products you recently viewed' have that too!

We not only need to search for the price searching for a p tag with the price_color class inside a div, that div also need to have a product_main class!

But that is just one part of the class:

extracting data - containing class

We can use a selector to search for an item that its class contains a string. Instead of using the whole class, "col-sm-6 product_main", we are only search for product_main.

Here's the code:

        price = response.xpath(
            '//div[contains(@class, "product_main")]/p[@class="price_color"]/text()').extract_first()

We look for a div that its class contains product_main, then we get the text inside the p with price_color class.

Print the price and run the code again to check it is working.

Now, your turn: Scrape the stock (The text that says ' In stock (X available) '). Use the technique you just have seen and do it yourself.

Here's my solution:

       stock = response.xpath(
            '//div[contains(@class, "product_main")]/p[contains(@class, "instock")]/text()').extract()[1].strip()

This time we have 2 elements, so I extract the desired and I remove the empty spaces with python .strip().

Let's extract the ratings. Right-click on the stars and we have this:

extracting data - @class

Every star has a icon-star class, but if you watch the previous div, you can see that all the stars are wrapped around in div with star-rating Four class. Four is the rating.

Try to extract it. Just get a p that contains the _star-rating _class and get that class. Remove the extra text we don't need.

Here's my code:

stars = response.xpath(
            '//div/p[contains(@class, "star-rating")]/@class').extract_first().replace('star-rating ', '')

Family matters – Siblings

The description is a tricky one:

The p tag has no class! How can we select it?

Well, we can't… But we can select the previous element, div id="product_description", then select the next HTML node, or it sibling. Like this:

description = response.xpath('//div[@id="product_description"]/following-sibling::p/text()').extract_first()

We select the div with the id product_description, then we go to the next p sibling and we select and extract the text. Phew!

Tables

As if you didn't had enough with contains and _siblings, _now we have tables!

Don't you worry, I have you covered.

We need to select the table, the row or tr, _then the position of said row, and then the value, in this case, _td. After the selection, we get the text as usual. Let me do the first one, UPC:

upc = response.xpath(
            '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first()

Print it and run the spider. This is how we extract data from tables. Now it's your turn:

Extract the price excl tax, price inc tax and tax. As we did on the first spider, yield the result as we did on the first spider.

Do it yourself and don't look here unless needed.

price_excl_tax = response.xpath(
            '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first()
        price_inc_tax = response.xpath(
            '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first()
        tax = response.xpath(
            '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first()

        yield {
            'Title': title,
            'Image': final_image,
            'Price': price,
            'Stock': stock,
            'Stars': stars,
            'Description': description,
            'Upc': upc,
            'Price excl tax': price_excl_tax,
            'Price incl tax': price_inc_tax,
            'Tax': tax,
        }

And that's it! Run the spider but this time, store the file into a file.

scrapy crawl spider -o books_detailed.json

Open the new file and make sure everything is in order.


Conclusion

Congratulations! You managed to improve your spider!

Now you know how to get elements the normal way, by attribute as class or id, by partial attributes, siblings elements, tables, etc and you can extract all the details from all the books!

Well, at least, from all the books on the main page. Wouldn't be nice to manage to extract all the books, going page by page, until every single book is scraped?

Don't you worry, you can know how to do it on the third lesson of this tutorial: How to get to the next page


Final code on Github

Reach to me on Twitter

My Youtube tutorial videos

Previous video: 01 – Creating your first spider

Discussion (1)

pic
Editor guide
Collapse
khoithinh profile image
Khoi Thinh

Hi David,

Old code

for book in all_books:
book_url = self.start_urls[0] +
book.xpath('.//h3/a/@href').extract_first()

New code

yield scrapy.Request(book_url, callback=self.parse_book)

def parse_book(self, response):
print(response.status)

I got an error saying book_url is not defined, can you post full code of this section?