(I'm not condoning anything illegal, this is for educational purposes only)
Introduction
Scrapy is one of the best web scraping frameworks in Python, it's easy to use, fast and packed with features.
But what if you we wanted to scrape multiple pages recursively? Such as product pages.
Well the easiest way is by adding a simple callback to a Request
function.
Here’s a code snippet inside a Scrapy project crawling a website with listed products such as Amazon, eBay and Etsy:
def parse(self, response):
links = response.css('a.s-item__link::attr(href)').getall()
for link in links:
yield Request(url=link, callback=self.parse_item)
next_page = response.css('a.pagination___next.icon-link::attr(href)').get()
if next_page:
print('Next page: %s' % next_page)
yield Request(url=next_page, callback=self.parse)
def parse_item(self, response):
title = response.xpath('//h1[@class="x-item-title___mainTitle"]/span/text()').get()
price = response.xpath('//span[@id="prcIsum"]/text()').get()
yield {'title':title,
'price':price}
How it works?
First, it gets the links of each items listed on a products page using this line of code:
links = response.css('a.s-item__link::attr(href)').getall()
It then loops through each one of those links, sends a request with the yield
statement and does a callback to our parse_item
function with self.parse_item
:
for link in links:
yield Request(url=link,callback=self.parse_item)
Inside the parse_item
function, it gets the title
and price
of the item and returns them with the yield
statement:
def parse_item(self, response):
title = response.xpath('//h1[@class="x-item-title__mainTitle"]/span/text()').get()
price = response.xpath('//span[@id="prcIsum"]/text()').get()
yield {'title':title,
'price':price}
Since the parse
function is still running, our code then gets the link to the next page, requests it with a yield
statement and does a callback to our parse
function with self.parse
and starts all over again:
next_page = response.css('a.pagination__next.icon-link::attr(href)').get()
if next_page:
print('Next page:%s' % next_page)
yield Request(url=next_page,callback=self.parse)
Conclusion
There you go, it's that simple!
Scraping product pages recursively with Scrapy can be implemented as easily as adding a callback to a Request
function.
Top comments (0)