zchtodd

Posted on Jun 16, 2021 • Originally published at theparsedweb.com

7 Tips for Building an Amazon Scraper

#python #scrapy

In this post I want to share some lessons learned while scraping Amazon product pages. For this project, I used the Scrapy framework to get a huge head start, but there were still plenty of pitfalls to overcome.

If you want to go straight to the code, you can find it on GitHub, along with instructions for setting it up.

My goal was to scrape thousands of product detail pages across several categories. The scraper stores product info in a PostgreSQL database. This data includes the name and price of the product, but also covers unstructured metadata that varies by category, such as CPU/RAM specs in the Computer category.

So, what do I wish I'd known at the beginning?

Build a spider to handle each product category

Starting out with a single spider class seemed like the natural choice. If you spend any time jumping between product categories, however, you'll soon realize that many of them have their own distinct layout.

Even when two product categories seem similar, their markup can have subtle differences. The path to madness lies in maintaining a single spider class that is aware of all these variations. Pretty soon, fixing a bug in one product category just introduces new bugs for other categories.

It's better to think of Amazon as many separate, if similar, websites. This may lead to some code duplication across spiders, but it makes each spider class much cleaner and more focused. Each product category is a start_url in Scrapy terms, making it easy to split up the crawling between different spider classes.

    start_urls = {
        "https://www.amazon.com/Exercise-Equipment-Gym-Equipment/b?ie=UTF8&node=3407731": ExerciseEquipmentSpider,
        "https://www.amazon.com/computer-pc-hardware-accessories-add-ons/b?ie=UTF8&node=541966": ComputerPCHardwareSpider,
    }

    for start_url, spider_class in start_urls.items():
        crawler.crawl(
            spider_class, start_urls=[start_url], allowed_domains=["amazon.com"],
        )
        crawler.start()

Use a residential proxy network

Like many big websites, Amazon has countermeasures in place to prevent scraping.

An IP address that sends thousands of requests will quickly end up on a blacklist. You'll need to route your requests through a proxy network to scrape at scale.

Luckily, setting this up with Scrapy is easy. Registering a middleware class that adds a proxy value to the request metadata is all it takes to route every request through a proxy.

class CustomProxyMiddleware:
    def process_request(self, request, spider):
        request.meta["proxy"] = os.environ["PROXY_URL"]

Proxy networks mostly come in two flavors: Data-center and Residential. Roughly speaking, data-center proxies are fast and cheap, while residential networks are slower and more expensive. Although a data-center proxy was my first choice, I soon found that many IP addresses are already blacklisted.

I had far better luck with a residential proxy network. Your mileage may vary with different proxy vendors, but in my case using a data-center proved impractical.

Rely on Scrapy middleware to filter requests

Preventing Scrapy from going off the rails and crawling unnecessary pages was a bigger part of the challenge than I expected. On a big website like Amazon, it's very easy for Scrapy to follow one wrong link into oblivion.

Category pages have very complex markup, making it difficult to come up with a selector that will target only the correct links. XPath is flexible enough to do the job, but like regular expressions, quickly become unreadable. Instead of forcing the selectors to do all the work, I used a middleware class to filter out unwanted requests.

The builtin IgnoreRequest exception can be raised to avoid making a request. I quickly built up a list of "rabbit hole" links that were getting queued up and causing my crawl to ride off into the sunset instead of visiting product pages.

import os

from scrapy.linkextractors import IGNORED_EXTENSIONS
from scrapy.exceptions import IgnoreRequest


class IgnoredExtensionsMiddleware:
    IGNORE_PATTERNS = (
        "/stores/",
        "gp/profile",
        "gp/product",
        "gp/customer-reviews",
        "product-reviews",
        "ask/answer",
    )

    def process_request(self, request, spider):
        for pattern in self.IGNORE_PATTERNS:
            if pattern in request.url:
                raise IgnoreRequest()

        if request.url.lower().endswith(tuple(IGNORED_EXTENSIONS)):
            raise IgnoreRequest()

Scrapy Rules and LinkExtractors are your friend

Cramming all of the logic into a single parse method was a recipe for a jumbled mess. Refactoring the crawling logic to make use of rules and link extractors cleaned up the spider class considerably. Link extractors define what part(s) of the page Scrapy should examine for links to follow. This logic, in combination with the middleware, controls how the crawler moves through the website.

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class ComputerPCHardwareSpider(CrawlSpider):
    name = "computer_pc_hardware_spider"

    rules = (
        Rule(
            LinkExtractor(
                restrict_xpaths="//div[contains(@class, 'bxc-grid__container')]//*[img]"
            )
        ),
        Rule(LinkExtractor(restrict_text="See all results")),
        Rule(
            LinkExtractor(restrict_xpaths="//div[contains(@class, 's-main-slot')]//h2"),
            callback="parse_item",
        ),
    )

The XPath expression passed into restrict_xpath determines what part of the page to extract links from. Any link tag that is a descendant of a node that satisfies the XPath condition will be pulled into the crawl.

As long as you subclass CrawlSpider, it's unnecessary to have a parse callback for every request. Only the final rule, which covers links to product pages, has a defined parse method in this case.

Save data using an item pipeline

The crawler saves product details to a PostgreSQL database. To do that, I use an item pipeline, which allows for a nice separation of concerns. An item pipeline keeps logic involved in saving data apart from the crawling logic.

The pipeline.py file contains all of the database related code for the project.

class DatabasePipeline:
    def __init__(self):
        engine = sa.create_engine(
            "postgresql://{}:{}@{}:5432/{}".format(
                os.environ["POSTGRES_USER"],
                os.environ["POSTGRES_PASSWORD"],
                os.environ["POSTGRES_HOST"],
                os.environ["POSTGRES_DB"],
            )
        )

        Base.metadata.create_all(engine)
        Session = sessionmaker(bind=engine)

        self.session = Session()

    def process_item(self, item, spider):
        try:
            self.session.execute(
                insert(InventoryItem)
                .values([item])
                .on_conflict_do_nothing()
            )
        except:
            self.session.rollback()

        self.session.commit()

The process_item method will be invoked whenever the spider parse yields a non-request object. By convention, Scrapy expects parse functions to yield either an item or another request to be issued. When a product is scraped, the spider yields a dictionary, which in turn is inserted into the database.

Initially, I thought calling commit for each item would be too slow, but it was insignificant compared to the network requests.

Use AUTOTHROTTLE early on while debugging

Scrapy is fast, and early on when I wasn't sure if my crawler was following the right links, this meant burning proxy bandwidth on the wrong pages. By the time you realize something is wrong, the crawler is already pretty far off course.

Luckily, the AUTOTHROTTLE_ENABLED setting is an easy way to slow things down until your crawler is rock solid. The autothrottle setting adapts request rates based on the response time of the domain you're crawling. This results in a significant slow down, especially if you're behind a proxy.

There is also the DOWNLOAD_DELAY setting that will add a specified number of seconds to each request. Either one will work if you just need to slow things down while you're in debugging mode.

Use XPath for selecting elements

CSS selectors simply don't cut it when it comes to scraping sites like Amazon. Class names are often repeated on elements you don't want to include, and useful IDs are a rarity. Quite often, the only way to target the correct links is to pattern match against a sub-tree of elements, instead of one element in isolation.

        Rule(
            LinkExtractor(restrict_xpaths="//div[contains(@class, 's-main-slot')]//h2"),
            callback="parse_item",
        )

This rule, for instance, matches against not just a class name, but also requires that an h2 reside somewhere in the sub-tree of the parent.

XPath can be a little daunting, but the flexibility is worth spending time to become familiar with it.

Checkout the Amazon scraper code

The code is freely available for anyone to use or alter. You can check it out on GitHub for additional usage instructions.

DEV Community