David MM👨🏻‍💻

Posted on Sep 12, 2019 • Edited on Sep 14, 2019 • Originally published at letslearnabout.net

How to go to the next page - 03 - Python scrapy tutorial for beginners

#python #scrapy #tutorial

Original post Python Scrapy tutorial for beginners – 03 – How to go to the next page

Python Scrapy tutorial for beginners - 03

On our last lesson, extracting all the data with Scrapy, we managed to get all the books URL and then extracted the data from each one. We were limited to the books on the main page, as we didn't know how to go to the next page using Scrapy.

Until now.

In this post you will learn how to:

Navigate to the 'next page'
Solve routing problems
Extract all the data of every book available

Our game-plan

Initially we just listed all the book URLs and then, one by one, we extracted the data.

As we had 20 books, we just listed 20 book URLs, and then parsed those 20 URLs, yielding the result.

We just need to add another step.

Now, we'll list 20 book URLs, parse them, and then, if there is a 'Next' page, we'll navigate to it to repeat the process, listing and yielding the new 20 book URLs, until there are no more pages.

In our Beautiful Soup tutorial we used the same strategy:

And that's what we are going to start using right now.

Checking if there is a 'next page' available

Let's start from the code we used in our second lesson, extract all the data:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    base_url = 'http://books.toscrape.com/'

    def parse(self, response):
        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            book_url = self.start_urls[0] + 
                book.xpath('.//h3/a/@href').extract_first()

            yield scrapy.Request(book_url, callback=self.parse_book)

    def parse_book(self, response):
        title = response.xpath('//div/h1/text()').extract_first()

        relative_image = response.xpath('//div[@class="item active"]/img/@src').extract_first()
        final_image = self.base_url + relative_image.replace('../..', '')

        price = response.xpath(
            '//div[contains(@class, "product_main")]/p[@class="price_color"]/text()').extract_first()
        stock = response.xpath(
            '//div[contains(@class, "product_main")]/p[contains(@class, "instock")]/text()').extract()[1].strip()
        stars = response.xpath(
            '//div/p[contains(@class, "star-rating")]/@class').extract_first().replace('star-rating ', '')
        description = response.xpath(
            '//div[@id="product_description"]/following-sibling::p/text()').extract_first()
        upc = response.xpath(
            '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first()
        price_excl_tax = response.xpath(
            '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first()
        price_inc_tax = response.xpath(
            '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first()
        tax = response.xpath(
            '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first()

        yield {
            'Title': title,
            'Image': final_image,
            'Price': price,
            'Stock': stock,
            'Stars': stars,
            'Description': description,
            'Upc': upc,
            'Price after tax': price_excl_tax,
            'Price incl tax': price_inc_tax,
            'Tax': tax,
        }

Since this is currently working, we just need to check if there is a 'Next' button after the for loop is finished. Right-click on the next button:

The next page URL is inside an a tag, within a li tag. You know how to extract it, so create a _next_page_url _ we can navigate to. Beware, it is a partial URL, so you need to add the base URL. As we did it before, you can do it yourself. Give it a try.

This is how I did it:

        for book in all_books:
            book_url = self.start_urls[0] + 
                book.xpath('.//h3/a/@href').extract_first()

            yield scrapy.Request(book_url, callback=self.parse_book)

        # New code:
        next_page_partial_url = response.xpath(
            '//li[@class="next"]/a/@href').extract_first()

        next_page_url = self.base_url + next_page_partial_url
        yield scrapy.Request(next_page_url, callback=self.parse)

Run the code with scrapy crawl spider -o next_page.json and check the result.

What's going on? There is only 20 elements in the file! Let's check the logging to see what's going on.

We managed to get the first 20 books, but then, suddenly, we can't get more books…

The books.toscrape.com is a website made by Scraping Hub to train people on web scraping, and they have little traps you need to notice. Compare the successful URLs (blue underline) with the failed ones (red underline). There is a _/catalogue _missing on each routing. They didn't add it to make you fail.

Let's solve that problem.

Solving the 'book' routing problem

As /catalogue is missing from some URLs, let's have a check: If the routing doesn't have it, let's prefix it to the partial URL. As simple as that.

Try it on your own before continuing. You can check my code here:

        for book in all_books:
            book_url = book.xpath('.//h3/a/@href').extract_first()

            if 'catalogue/' not in book_url:
                book_url = 'catalogue/' + book_url

            book_url = self.base_url + book_url

Let's run the code again! It should work, right? scrapy crawl spider -o next_page.json

Now we have more books! But only 40. We managed to get the first 20, then the next 20. Then, something happened. We didn't get the third page from the second one. Let's go to the second page and see what's going on with the next button and compare it with the first one (and its link to the second one)

We have the same problem we had with the books: Some links have /catalogue, some others don't.

Solving the 'next' routing problem

As we have the same problem, we have the same solution. One you can solve easily. Why don't you try? Again, you just need to check the link and prefix /catalogue in case that sub-string isn't there.

If you couldn't solve it, this is my solution:

next_page_partial_url = response.xpath(
            '//li[@class="next"]/a/@href').extract_first()

        if next_page_partial_url:
            if 'catalogue/' not in next_page_partial_url:
                next_page_partial_url = "catalogue/" + next_page_partial_url

            next_page_url = self.base_url + next_page_partial_url
            yield scrapy.Request(next_page_url, callback=self.parse)

You can see the pattern: We get the partial URL, we check if /catalogue is missing and if it does, we add it. Then, we add the base_url and we have our absolute URL.

Run the spider again: scrapy crawl spider -o next_page.json.

Now we have our 1000 books. Every single one. 🙂

This is the final code:

# -*- coding: utf-8 -*-
import scrapy


class SpiderSpider(scrapy.Spider):
    name = 'spider'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']
    base_url = 'http://books.toscrape.com/'

    def parse(self, response):
        all_books = response.xpath('//article[@class="product_pod"]')

        for book in all_books:
            book_url = book.xpath('.//h3/a/@href').extract_first()

            if 'catalogue/' not in book_url:
                book_url = 'catalogue/' + book_url

            book_url = self.base_url + book_url

            yield scrapy.Request(book_url, callback=self.parse_book)

        next_page_partial_url = response.xpath(
            '//li[@class="next"]/a/@href').extract_first()

        if next_page_partial_url:
            if 'catalogue/' not in next_page_partial_url:
                next_page_partial_url = "catalogue/" + next_page_partial_url

            next_page_url = self.base_url + next_page_partial_url
            yield scrapy.Request(next_page_url, callback=self.parse)


    def parse_book(self, response):
        title = response.xpath('//div/h1/text()').extract_first()

        relative_image = response.xpath(
            '//div[@class="item active"]/img/@src').extract_first()
        final_image = self.base_url + relative_image.replace('../..', '')

        price = response.xpath(
            '//div[contains(@class, "product_main")]/p[@class="price_color"]/text()').extract_first()
        stock = response.xpath(
            '//div[contains(@class, "product_main")]/p[contains(@class, "instock")]/text()').extract()[1].strip()
        stars = response.xpath(
            '//div/p[contains(@class, "star-rating")]/@class').extract_first().replace('star-rating ', '')
        description = response.xpath(
            '//div[@id="product_description"]/following-sibling::p/text()').extract_first()
        upc = response.xpath(
            '//table[@class="table table-striped"]/tr[1]/td/text()').extract_first()
        price_excl_tax = response.xpath(
            '//table[@class="table table-striped"]/tr[3]/td/text()').extract_first()
        price_inc_tax = response.xpath(
            '//table[@class="table table-striped"]/tr[4]/td/text()').extract_first()
        tax = response.xpath(
            '//table[@class="table table-striped"]/tr[5]/td/text()').extract_first()

        yield {
            'Title': title,
            'Image': final_image,
            'Price': price,
            'Stock': stock,
            'Stars': stars,
            'Description': description,
            'Upc': upc,
            'Price after tax': price_excl_tax,
            'Price incl tax': price_inc_tax,
            'Tax': tax,
        }

Conclusion

You hit a milestone today. Now you are able to extract every single element from a website.

You have learnt that you need to get all the elements on the first page, scrap them individually, and how to go to the next page to repeat this process. Let me show the diagram once again:

And not only that. This example was a tricky one as we had to check if the partial URL had /catalogue to add it.

Normally, paginating websites with Scrapy is easier as the 'next' button contains the full URL, so this example was even harder than normal and yet you managed to get it!

But… what if I tell you that this can be even easier than what we did?

Instead of grabbing your pitchfork and heading to my home, go to the fourth lesson where you will learn how to scrape every single item in an even easier way using crawlers.

My Youtube tutorial videos

Final code on Github

Reach to me on Twitter

Previous lesson: 02 – Creating your first spider