The goal of scraping is to extract data from websites. Without Scrapy Items, we return unstructured data in the form of Python dictionaries: An easy way to introduce typos and return faulty data.
Luckily, Scrapy provides us with the Item class: A class we can inherit to make our data more structured and stronger, yielding a Python object.
In this post you will learn how to:
- Create Scrapy Items
- Use them to return a structured object
While you can use your own Scrapy projects for this tutorial, I'll recommend you to follow along by using the last version of this tutorial series, where we added Rules and a LinkExtractor to our spider.
Clone the Github Repo, and you are set to go!
To use our Item, first we need to create it and… it is already done!
On the root project, we have an items.py file with the skeleton of an item:
Besides the scrapy import and a valuable link, we have nothing there. Let's solve that!
Our BooksItem it is going to be the class we are going to use for every scraped element. Think about it like a blueprint that tells you what we are going to need.
And what do we need on each item? A title, an image, a price… Exactly what our spider.py file yields.
Copy the elements and paste them inside the class, and assign them 'scrapy.Field()':
class BooksItem(scrapy.Item): title = scrapy.Field() final_image = scrapy.Field() price = scrapy.Field() stock = scrapy.Field() stars = scrapy.Field() description = scrapy.Field() upc = scrapy.Field() price_excl_tax = scrapy.Field() price_inc_tax = scrapy.Field() tax = scrapy.Field()
That's it! We are done!
Our BooksItem class is created. The only fields we can add are the ones we explicitly wrote inside the class. Let's test that theory.
Let's see that our theory is solid. Load the scrapy shell (with scrapy shell on your terminal), import the item and create an object with some fields. Nothing wrong happens, as every field is optional.
But then, try to create another object with non-existing fields. You'll get an error:
Our theory is right: We can only add the existing fields that we declared on our Item.
We talked enough about the Item, let's use it.
Open your items.py (finally!) and add the import on top of the file:
# -*- coding: utf-8 -*- from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from ..items import BooksItem # New line import scrapy
Then, inside the parser method, create an object somewhere. For example, I created it after every data is scraped:
.... '//table[@class="table table-striped"]/tr/td/text()').extract_first() price_excl_tax = response.xpath( '//table[@class="table table-striped"]/tr/td/text()').extract_first() price_inc_tax = response.xpath( '//table[@class="table table-striped"]/tr/td/text()').extract_first() tax = response.xpath( '//table[@class="table table-striped"]/tr/td/text()').extract_first() book = BooksItem() # New line
Now we have a nice yield returning a dictionary with all the data.
And then, assign each field to the book object. And then, yield the object instance:
book = BooksItem() book['title'] = title book['final_image'] = final_image book['price'] = price book['stock'] = stock book['stars'] = stars book['description'] = description book['tax'] = tax yield book
That's enough. Let's run the code. Run scrapy crawl spider -o scrapy_item_version.json and wait until the spider is done.
As always, we have our 1000 books, this time, with a stronger and more solid code, by using Items:
It is easy to make your spiders less buggy, and one of the easier improvements are using Scrapy Items. The Item class let us inherit a class that enables us to use Scrapy classes that by declaring its fields. To use them, we just need to:
- Create an Item by specifying the fields it is going to have
- Import the class created
- Create an instance of that class
- For every field extracted, add it to the Item instance
- Finally, return the object instance.
This opens the door to the Item Pipeline, which processes the item scraped. We can tell how Scrapy should process the scraped item, for example, cleaning it, validating the fields and more.
And of course, we'll learn that and more on our next lesson*.
*The sixth lesson is being built right now. Thanks for your patience.