DEV Community

Cover image for How To Scrape Amazon at Scale With Python Scrapy, And Never Get Banned
Ian Kerins
Ian Kerins

Posted on • Updated on

How To Scrape Amazon at Scale With Python Scrapy, And Never Get Banned

With thousands of companies offering products and price monitoring solutions for Amazon, scraping Amazon is big business.

But for anyone who’s tried to scrape it at scale you know how quickly you can get blocked.

So in this article, I’m going to show you how I built a Scrapy spider that searches Amazon for a particular keyword, and then goes into every single product it returns and scrape all the main information:

  • ASIN
  • Product name
  • Image url
  • Price
  • Description
  • Available sizes
  • Available colors
  • Ratings
  • Number of reviews
  • Seller rank

With this spider as a base, you will be able to adapt it to scrape whatever data you need and scale it to scrape thousands or millions of products per month. The code for the project is available on GitHub here.

What We Will Need?

Obviously, you could build your scrapers from scratch using a basic library like requests and Beautifulsoup, but I choose to build it using Scrapy.

The open-source web crawling framework written in Python, as it by far the most powerful and popular web scraping framework amongst large scale web scrapers.

Compared to other web scraping libraries such as BeautifulSoup, Selenium or Cheerio, which are great libraries for parsing HTML data, Scrapy is a full web scraping framework with a large community that has loads of built-in functionality to make web scraping as simple as possible:

  • XPath and CSS selectors for HTML parsing
  • data pipelines
  • automatic retries
  • proxy management
  • concurrent requests
  • etc.

Making it really easy to get started, and very simple to scale up.

Proxies

The second thing that was a must, if you want to scrape Amazon at any type of scale is a large pool of proxies and the code to automatically rotate IPs and headers, along with dealing with bans and CAPTCHAs. Which can be very time consuming if you build this proxy management infrastructure yourself.

For this project I opted to use Scraper API, a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.

Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be.

Monitoring

Lastly, we will need some way to monitor our scraper in production to make sure that everything is running smoothly. For that we're going to use ScrapeOps, a free monitoring tool specifically designed for web scraping.

Live demo here: ScrapeOps Demo

ScrapeOps Dashboard


Getting Started With Scrapy

Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:

pip install scrapy
Enter fullscreen mode Exit fullscreen mode

Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“amazon_scraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:

scrapy startproject amazon_scraper
Enter fullscreen mode Exit fullscreen mode

Here is what you should see

├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created
Enter fullscreen mode Exit fullscreen mode

Similar to Django when you create a project with Scrapy it automatically creates all the files you need. Each of which has its own purpose:

  1. Items.py is useful for creating your base dictionary that you import into the spider
  2. Settings.py is where all your settings on requests and activating of pipelines and middlewares happen. Here you can change the delays, concurrency, and lots more things.
  3. Pipelines.py is where the item yielded by the spider gets passed, it’s mostly used to clean the text and connect to databases (Excel, SQL, etc).
  4. Middlewares.py is useful when you want to modify how the request is made and scrapy handles the response.

Creating Our Amazon Spider

Okay, we’ve created the general project structure. Now, we’re going to develop our spiders that will do the scraping.

Scrapy provides a number of different spider types, however, in this tutorial we will cover the most common one, the Generic Spider.

To create a new spider, simply run the “genspider” command:

# syntax is --> scrapy genspider name_of_spider website.com 
scrapy genspider amazon amazon.com
Enter fullscreen mode Exit fullscreen mode

And Scrapy will create a new file, with a spider template.

In our case, we will get a new file in the spiders folder called “amazon.py”.

import scrapy

class AmazonSpider(scrapy.Spider):
    name = 'amazon'
    allowed_domains = ['amazon.com']
    start_urls = ['http://www.amazon.com/']

    def parse(self, response):
        pass
Enter fullscreen mode Exit fullscreen mode

We're going to remove the default code from this (allowed_domains, start_urls, parse function) and start writing our own code.

We’re going to create four functions:

  1. start_requests - will send a search query Amazon with a particular keyword.
  2. parse_keyword_response - will extract the ASIN value for each product returned in the Amazon keyword query, then send a new request to Amazon to return the product page of that product. It will also move to the next page and repeat the process.
  3. parse_product_page - will extract all the target information from the product page.
  4. get_url - will send the request to Scraper API so it can retrieve the HTML response.

With a plan made, now let’s get to work…

Send Search Queries To Amazon

The first step is building start_requests, our function that sends search queries to Amazon with our keywords. Which is pretty simple…

First let’s quickly define a list variable with our search keywords outside the AmazonSpider.

queries = ['tshirt for men', ‘tshirt for women’]
Enter fullscreen mode Exit fullscreen mode

Then let's create our start_requests function within the AmazonSpider that will send the requests to Amazon.

To access Amazon’s search functionality via a URL we need to send a search query “k=SEARCH_KEYWORD” :

https://www.amazon.com/s?k=<SEARCH_KEYWORD>
Enter fullscreen mode Exit fullscreen mode

When implemented in our start_requests function, it looks like this.

## amazon.py

queries = ['tshirt for men', ‘tshirt for women’]

class AmazonSpider(scrapy.Spider):

    def start_requests(self):
        for query in queries:
            url = 'https://www.amazon.com/s?' + urlencode({'k': query})
            yield scrapy.Request(url=url, callback=self.parse_keyword_response)
Enter fullscreen mode Exit fullscreen mode

For every query in our queries list, we will urlencode it so that it is safe to use as a query string in a URL, and then use scrapy.Request to request that URL.

Since Scrapy is async, we will use yield instead of return, which means the functions should either yield a request or a completed dictionary. If a new request is yielded it will go to the callback method, if an item is yielded it will go to the pipeline for data cleaning.

In our case, if scrapy.Request it will activate our parse_keyword_response callback function that will then extract the ASIN for each product.


Scraping Amazon’s Product Listing Page

The cleanest and most popular way to retrieve Amazon product pages is to use their ASIN ID.

ASIN’s are a unique ID that every product on Amazon has. We can use this ID as part of our URLs to retrieve the product page of any Amazon product like this...

https://www.amazon.com/dp/<ASIN>
Enter fullscreen mode Exit fullscreen mode

We can extract the ASIN value from the product listing page by using Scrapy’s built-in XPath selector extractor methods.

XPath is a big subject and there are plenty of techniques associated with it, so I won’t go into detail on how it works or how to create your own XPath selectors. If you would like to learn more about XPath and how to use it with Scrapy then you should check out the documentation here.

Using Scrapy Shell, I’m able to develop a XPath selector that grabs the ASIN value for every product on the product listing page and create a url for each product:

products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
Enter fullscreen mode Exit fullscreen mode

Next, we will configure the function to send a request to this URL and then call the parse_product_page callback function when we get a response. We will also add the meta parameter to this request which is used to pass items between functions (or edit certain settings).

def parse_keyword_response(self, response):
        products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})
Enter fullscreen mode Exit fullscreen mode

Extracting Product Data From Product Page

Now, we’re finally getting to the good stuff!

So after the parse_keyword_response function requests the product pages URL, it passes the response it receives from Amazon to the parse_product_page callback function along with the ASIN ID in the meta parameter.

Now, we want to extract the data we need from a product page like this.

Amazon Product Page

To do so we will have to write XPath selectors to extract each field we want from the HTML response we receive back from Amazon.

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()

Enter fullscreen mode Exit fullscreen mode

For scraping the image url, I’ve gone with a regex selector over a XPath selector as the XPath was extracting the image in base64.

With very big websites like Amazon, who have various types of product pages what you will notice is that sometimes writing a single XPath selector won’t be enough. As it might work on some pages, but not on others.

In cases like these, you will need to write numerous XPath selectors to cope with the various page layouts. I ran into this issue when trying to extract the product price so I needed to give the spider 3 different XPath options:

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()

        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        if not price:
            price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
                    response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()

Enter fullscreen mode Exit fullscreen mode

If the spider can't find a price with the first XPath selector then it moves onto the next one, etc.

If we look at the product page again, we will see that it contains variations of the product in different sizes and colors. To extract this data we will write a quick test to see if this section is present on the page, and if it is we will extract it using regex selectors.

temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
            json_acceptable = s.replace("'", "\"")
            di = json.loads(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

Enter fullscreen mode Exit fullscreen mode

Putting it all together, the parse_product_page function will look like this, and will return a JSON object which will be sent to the pipelines.py file for data cleaning (we will discuss this later).

def parse_product_page(self, response):
        asin = response.meta['asin']
        title = response.xpath('//*[@id="productTitle"]/text()').extract_first()
        image = re.search('"large":"(.*?)"',response.text).groups()[0]
        rating = response.xpath('//*[@id="acrPopover"]/@title').extract_first()
        number_of_reviews = response.xpath('//*[@id="acrCustomerReviewText"]/text()').extract_first()
        price = response.xpath('//*[@id="priceblock_ourprice"]/text()').extract_first()

        if not price:
            price = response.xpath('//*[@data-asin-price]/@data-asin-price').extract_first() or \
                    response.xpath('//*[@id="price_inside_buybox"]/text()').extract_first()

        temp = response.xpath('//*[@id="twister"]')
        sizes = []
        colors = []
        if temp:
            s = re.search('"variationValues" : ({.*})', response.text).groups()[0]
            json_acceptable = s.replace("'", "\"")
            di = json.loads(json_acceptable)
            sizes = di.get('size_name', [])
            colors = di.get('color_name', [])

        bullet_points = response.xpath('//*[@id="feature-bullets"]//li/span/text()').extract()
        seller_rank = response.xpath('//*[text()="Amazon Best Sellers Rank:"]/parent::*//text()[not(parent::style)]').extract()
        yield {'asin': asin, 'Title': title, 'MainImage': image, 'Rating': rating, 'NumberOfReviews': number_of_reviews,
               'Price': price, 'AvailableSizes': sizes, 'AvailableColors': colors, 'BulletPoints': bullet_points,
               'SellerRank': seller_rank}
Enter fullscreen mode Exit fullscreen mode

Iterating Through Product Listing Pages

We’re looking good now…

Our spider will search Amazon based on the keyword we give it and scrape the details of the products it returns on page 1. However, what if we want our spider to navigate through every page and scrape the products of each one.

To implement this, all we need to do is add a small bit of extra code to our parse_keyword_response function:

def parse_keyword_response(self, response):
        products = response.xpath('//*[@data-asin]')

        for product in products:
            asin = product.xpath('@data-asin').extract_first()
            product_url = f"https://www.amazon.com/dp/{asin}"
            yield scrapy.Request(url=product_url, callback=self.parse_product_page, meta={'asin': asin})

        next_page = response.xpath('//li[@class="a-last"]/a/@href').extract_first()
        if next_page:
            url = urljoin("https://www.amazon.com",next_page)
            yield scrapy.Request(url=product_url, callback=self.parse_keyword_response)

Enter fullscreen mode Exit fullscreen mode

After the spider has scraped all the product pages on the first page, it will then check to see if there is a next page button. If there is, it will retrieve the url extension and create a new URL for the next page. Example:

https://www.amazon.com/s?k=tshirt+for+men&page=2&qid=1594912185&ref=sr_pg_1

Enter fullscreen mode Exit fullscreen mode

From there it will restart the parse_keyword_response function using the callback and extract the ASIN IDs for each product and extract all the product data like before.

Testing The Spider

Now that we’ve developed our spider it is time to test it. Here we can use Scrapy’s built-in CSV exporter:

scrapy crawl amazon -o test.csv
Enter fullscreen mode Exit fullscreen mode

All going good, you should now have items in test.csv, but you will notice there are 2 issues:

  1. the text is messy and some values are lists
  2. we are getting 429 responses from Amazon which means Amazon is detecting us that our requests are coming from a bot and is blocking our spider.

Issue number two is the far bigger issue, as if we keep going like this Amazon will quickly ban our IP address and we won’t be able to scrape Amazon.

In order to solve this, we will need to use a large proxy pool and rotate our proxies and headers with every request. For this we will use Scraper API.


Connecting Your Proxies With Scraper API

As discussed, at the start of this article Scraper API is a proxy API designed to take the hassle out of using web scraping proxies.

Instead of finding your own proxies, and building your own proxy infrastructure to rotate proxies and headers with every request, along with detecting bans and bypassing anti-bots you just send the URL you want to scrape the Scraper API and it will take care of everything for you.

To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.

Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.

For this project I integrated the API by configuring my spiders to send all our requests to their API endpoint.

To do so, I just needed to create a simple function that sends a GET request to Scraper API with the URL we want to scrape.

API = ‘<YOUR_API_KEY>’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

Enter fullscreen mode Exit fullscreen mode

And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url).

def start_requests(self):
       ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)

def parse_keyword_response(self, response):
       ...
       …
      yield scrapy.Request(url=get_url(product_url), callback=self.parse_product_page, meta={'asin': asin})
        ...
       …
       yield scrapy.Request(url=get_url(url), callback=self.parse_keyword_response)
Enter fullscreen mode Exit fullscreen mode

A really cool feature with Scraper API is that you can enable Javascript rendering, geotargeting, residential IPs, etc. by simply adding a flag to your API request.

As Amazon changes the pricing data and supplier data shown based on the country you are making the request from we're going to use Scraper API's geotargeting feature so that Amazon thinks our requests are coming from the US. To do this we need need to add the flag "&country_code=us" to the request, which we can do by adding another parameter to the payload variable.

def get_url(url):
    payload = {'api_key': API, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url
Enter fullscreen mode Exit fullscreen mode

You can check out Scraper APIs other functionality here in their documentation.

Next, we have to go into the settings.py file and change the number of concurrent requests we’re allowed to make based on the concurrency limit of our Scraper API plan. Which for the free plan is 5 concurrent requests.

## settings.py

CONCURRENT_REQUESTS = 5
Enter fullscreen mode Exit fullscreen mode

Concurrency is the number of requests you are allowed to make in parallel at any one time. The more concurrent requests you can make the faster you can scrape.

Also, we should set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

## settings.py

CONCURRENT_REQUESTS = 5
RETRY_TIMES = 5

# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY
Enter fullscreen mode Exit fullscreen mode

Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy
Enter fullscreen mode Exit fullscreen mode

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}
Enter fullscreen mode Exit fullscreen mode

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

ScrapeOps Dashboard


Cleaning Data With Pipelines

The final step we need to do is to do a bit of data cleaning using the pipelines.py file as the text is messy and some values are lists.

class TutorialPipeline:

    def process_item(self, item, spider):
        for k, v in item.items():
            if not v:
                item[k] = ''  # replace empty list or None with empty string
                continue
            if k == 'Title':
                item[k] = v.strip()
            elif k == 'Rating':
                item[k] = v.replace(' out of 5 stars', '')
            elif k == 'AvailableSizes' or k == 'AvailableColors':
                item[k] = ", ".join(v)
            elif k == 'BulletPoints':
                item[k] = ", ".join([i.strip() for i in v if i.strip()])
            elif k == 'SellerRank':
                item[k] = " ".join([i.strip() for i in v if i.strip()])
        return item
Enter fullscreen mode Exit fullscreen mode

After the spider has yielded a JSON object, the item is passed to the pipeline for the item to be cleaned.

To enable the pipeline we need to add it to the settings.py file.

## settings.py

ITEM_PIPELINES = {'tutorial.pipelines.TutorialPipeline': 300}
Enter fullscreen mode Exit fullscreen mode

Now we are good to go. You can test the spider again by running the spider with the crawl command.

scrapy crawl amazon -o test.csv
Enter fullscreen mode Exit fullscreen mode

This time you should see that the spider was able to scrape all the available products for your keyword without getting banned.

If you would like to run the spider for yourself or modify it for your particular Amazon project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.

Top comments (7)

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt • Edited

As discussed, at the start of this article Scraper API is a proxy API designed to take the hassle out of using web scraping proxies.

This is probably the most important thing for web scraping website that doesn't have robots.txt; or you want to go beyond that (therefore proxy rotating and User Agent spoofing).

I can see that web scraping is good when the web admin does not provide a public API, but as an admin myself, I can see that security and server load control comes first, even rather than access by end users (therefore poor human user experience sometimes).

I can see that there is Javascript rendering as well, which is nice for web automation, like handling JavaScript forms.

429 responses

When without proxy, this is as simple as knowing how to rate limit, though. This is very important when you access a public API as well. (That is web admin totally allows you to access, but that don't want their server overloaded. Which is not web scraping.)

I very recently have to sent ~500 PUT requests (not GET) to the API server, but I still have to wait 10 minutes for them to finish...

Collapse
 
smaug profile image
smaug

hi everyone ,
i'm interest that project. but i can't product send mysql db.
i was try different codes but its not work.

can you share the mysql print codes for this project?

Collapse
 
iankerins profile image
Ian Kerins

You need to create a Item Pipeline like this in your pipelines.py file.

# pipelines.py 

import mysql.connector

class SaveMySQLPipeline:

    def __init__(self):
        self.conn = mysql.connector.connect(
            host = 'localhost',
            user = 'root',
            password = '*******',
            database = 'dbname'
        )

        ## Create cursor, used to execute commands
        self.cur = self.conn.cursor()

        ## Create quotes table if none exists
        self.cur.execute("""
        CREATE TABLE IF NOT EXISTS products (
            id int NOT NULL auto_increment, 
            asin text,
            title text,
            image VARCHAR(255),
            PRIMARY KEY (id)
        )
        """)



    def process_item(self, item, spider):

        ## Define insert statement
        self.cur.execute(""" insert into quotes (asin, title, image) values (%s,%s,%s)""", (
            item["asin"],
            item["Title"],
            item["MainImage"]
        ))

        ## Execute insert of data into database
        self.conn.commit()


    def close_spider(self, spider):

        ## Close cursor & connection to database 
        self.cur.close()
        self.conn.close()
Enter fullscreen mode Exit fullscreen mode

And then enable it in your settings.py file.

# settings.py

ITEM_PIPELINES = {
   'tutorial.pipelines.TutorialPipeline': 300
   'tutorial.pipelines.SaveMySQLPipeline': 350,
}

Enter fullscreen mode Exit fullscreen mode
Collapse
 
smaug profile image
smaug

firstly thank you for answer.

'tutorial.pipelines.TutorialPipeline': 300 <<< I guess a comma is needed here.

secondly, the price information cannot be scraped in the above codes. what could be the reason for this?

For example, can we register mysql by scraping 3 different prices of the following product?

amazon.com/dp/B07KSJLQCD

List Price: $25.00
Price: $17.45
Lightning deal " if any"

MS -> My Telegram t.me/smesut

Thread Thread
 
iankerins profile image
Ian Kerins

To scrape those extra pricing details you will need to find the selectors for them and add those field to the item.

When I open that page, I don't see the fields as Amazon is probably only showing them based on the geography you are in.

So if you create new selectors for those prices you want and add them to the item, then you can update the mysql storage pipeline to store that data as well.

Collapse
 
alex24409331 profile image
alex24409331

also as a side solution I am using on demand web scraping service e-scraper.com it is extract data from Amazon in eCommerce friendly format.

Collapse
 
kegaua profile image
kegaua

When I start a project with scrapy the "tutorial" folder is missing. Why is that?