DEV Community

Cover image for Automate Spider Creation in Scrapy with Jinja2 and JSON
Ajit Kumar
Ajit Kumar

Posted on

Automate Spider Creation in Scrapy with Jinja2 and JSON

Today, I am going to share about my web scarping automation that i have been working recently which is plan to scrape date from thousands of websites. Yes, working on a large scale scraping project.

Problem :
You got a new data scraping project which involves scraping data from many similar websites. For example, data about books from various books related websites.Or Chocolate information from many sources.

You might think, what is problem here, it can be done using scrapy or any other similar tools or frameworks!
It is same as any normal data scraping project but wait ! What if the sources are from 100, or 1000 !
Are you still going to use "scrapy genspider 1000 times and then going to edit and modify each spiders one after another?

Solution :

Create your baseline scrapy project have 1-2 spiders with all other components like items, itemsloader, pipeline, middlewares, and suitable settings and then write an spider creation automation script which will have a template (based on your 1-2 spiders) file and a json file having source details and selectors for each items (per source). Once, you run the script, spiders for all included sources will be created.

Steps:

Steps are going to be very simple, as mentioned abstractly in solution part. Let's have a brief list here (including virtual environment creation):

  1. Create a Virtual environment for Python (Conda or venv)
  2. Install scrapy
  3. Create a scrapy project
  4. Create first simple spider for one source
  5. Test and create other components like items, itemloader, pipelines etc.

Once your scraping project is ready with 1-2 spiders and other components. Now move to automation of spider creation (not the scraping process)

  1. Create a json file in which each of attributes of the item (for example: title, author, price for Books data) will be the key and selector (div.booktitle) for them will be the value. And also include source information like sitename, urls which can be use to create spiders class.
  2. Give proper path to store spiders and path of template and json and run your script, if all good then you will have spiders for all the added sources.
  3. Test 1-2 spiders created by script by running it and if anything wrong , fix it and if it is template-wise, then reflect same to template and you can rerun the script.
  4. You automated the spiders creation and you can keep adding spiders for new sources in batches.

I know some of the steps may look confusing or intimidating but don't worry, that is a summary and if you want you can try and test your skills. Otherwise, Now i am going to explain each steps in details.

Let's break the article in three main parts:
1) Environment setup and basic spider
2) Advance scrapy components

3) Spiders Automation

If you are aware of first two parts, you can jump directly to the third parts Spiders Automation (You can get code of first two steps from Github repo)

1) Environment setup and basic spider

*1. Create a Virtual environment *

Python Eco-system has many options for virtual environment creation however, conda and venv is two popular approaches.

You can create a virtual environment using either conda or venv with Python 3.10 or later:

Example my env name is scrapy-demo

Using venv:

$python -m venv myenv

$python -m venu scrapy-demo
Enter fullscreen mode Exit fullscreen mode

Using conda:

$conda create --name myenv python=3.10

$ conda create --name scrapy-demo python=3.10
Enter fullscreen mode Exit fullscreen mode

Once you have created the environment you can activate them accordingly:

#if you are using conda

$conda activate scrapy-demo

#if you are using venv

$source scrapy-demo/bin/activate

Enter fullscreen mode Exit fullscreen mode

*2. Install scrapy *
Once you have crated and activated the virtual environment, now it is time to install scrapy and start the project. Before that let create a folder for our new scrapy project and move inside it for all future operation.

$mkdir scrapy-spiders-automation-demo 
$cd scrapy-spiders-automation-demo 
Enter fullscreen mode Exit fullscreen mode

Install scrapy (Offical website) either using pip or conda (Follow for detailed instructions):

#pip 
(scrapy-demo)~/projects/scrapy-spiders-automation-demo$pip install scrapy

#conda
(scrapy-demo)~/projects/scrapy-spiders-automation-demo$conda install conda-forge::scrapy

Enter fullscreen mode Exit fullscreen mode

3. Create a scrapy project

Creating a scrapy project is not mandatory to create and run a spider, however, it is better to have a project for larger project. So, let's create a project using scrapy startproject: (from the inside on the project folder** scrapy-spiders-automation-demo**)

(scrapy-demo)$scrapy startproject spiders_automation_demo

Enter fullscreen mode Exit fullscreen mode

This is will create a project

New Scrapy project 'spiders_automation_demo',

using template directory '.../anaconda3/envs/scrapy-demo/lib/python3.10/site-packages/scrapy/templates/project',

created in:
..../scrapy-spiders-automation-demo/spiders_automation_demo

You can start your first spider with:

$cd spiders_automation_demo
$scrapy genspider example example.com
Enter fullscreen mode Exit fullscreen mode

Here, automated message about the template directory is very important, i.e., the scrapy use a template for creating boilerplate code files for the projects. The same is used for creating spiders by commands like scrapy genspider example example.com. In the part 3, we are going to use same concepts for automating spiders creation with our own values and code.

spiders_automation_demo/
├── scrapy.cfg
└── spiders_automation_demo
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

3 directories, 7 files
Enter fullscreen mode Exit fullscreen mode

4. Create a spider

Now, it is time to create the first spider. Before that let's state the scraping task again. We want to scrape data about quotes from website. For this demo, we are going to use the popular source https://quotes.toscrape.com/ and to demostrate the use of template and automation of spider, I have created a simple website similar to quotes. So, we can use following link for other source i.e. https://quotes-scrape.netlify.app/

We are going to use genspider from scrapy tool command to create the first spider.

(scrapy-demo)$scrapy genspider Quotes https://quotes.toscrape.com/

Enter fullscreen mode Exit fullscreen mode

The above command when gets executed within the project folder and the environment, it create a spider with the given name (Quotes) and initialize the domains and starturls variable within the spiders. So, you will see following information after command execution.

Created spider 'Quotes' using template 'basic' in module:
spiders_automation_demo.spiders.Quotes

And, a file name Quotes.py will be created under the folder spiders/ which will have following boilerplate code:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "Quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        pass
Enter fullscreen mode Exit fullscreen mode

You can observe that it has very limited code and a lots to left out to be written by developer. And, you must have observe that this code is created using basic template of scrapy "Created spider 'Quotes' using template 'basic' in module:". So, later we are going to use same approach but will be using our own template populated with more code specific to our scraping task so we will need minimum modifications to make spider works for specific source.

Now, let's go ahead and complete the first spider to scrape quotes. We are going to get only two values for each quote:title and author. So, the spider code will looks like below after adding required code.

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "Quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        #each quote is within <div class="quote" ...>
        quotes = response.css("div.quote")
        for quote in quotes:
            #each quote text is within <span class="text" ...>
            title=quote.css("span.text::text").get()
            #each author info is within <small class="author" ...>
            author =quote.css("small.author::text").get()
            yield{
                'title':title,
                'author':author
            }

Enter fullscreen mode Exit fullscreen mode

You can download the code from the repo Link or copy from gist Link.

The above code is based on the understanding of html that is outcome of 'inspect' option in chrome. You can see the source of page in the below screenshot:

The source code of Quote:Blue Box (Individual quote), Red Box (title/text of quote) and Green Box (author information)

You can observe that each quote is under a <div class='quote'></div> (blue rectangle) so we can get the html of all quotes by using "div.quote" selector and so the line quotes = response.css("div.quote") will provides all quotes and then within loop, we can extract text and author for each quote by using "span.text::text" and "small.author::text" selector respectively. You must have noticed the "::text" with title and author while not in quotes, it is because, we just want the text i.e. value within each tag.

Once your spider is ready, you can run your spider to scrape the data using scrapy crawl command by just passing the spider name as argument i.e. Quotes in following example: (you must be in the scrapy project folder so scrapy knows the path of spiders)

(scrapy-demo)$scrapy crawl Quotes

Enter fullscreen mode Exit fullscreen mode

If you see following output on your terminal:

2024-07-27 18:56:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'title': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}
2024-07-27 18:56:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'title': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}
2024-07-27 18:56:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'title': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2024-07-27 18:56:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://quotes.toscrape.com/>
{'title': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably


Enter fullscreen mode Exit fullscreen mode

Congratulations ! you first spider is ready and scraping the data.

2) Advance scrapy components

*5. Test and create other components *
Now, we have a basic spider for scraping quotes from the given web source. Let's make the scraping system more robust by creating other components of a scraper like item, itemsloaer, pipelines etc.

5.1. items, **
The **Item
in scrapy is kind of structure for data that we are crawling. It is very helpful to enforce various cleaning and schema kind of things in whole scraping process and help to reduce errors in data scraping.

The process of creating item is very easy. Your project folder already have a items.py file (spiders_automation_demo/items.py), so you just need to add a class (QuoteItem()) for your item and you can do that by inheriting scrapy.Item class. The items.py file for the Quotes.pyspider will look like below: (# For more detail about Item, see documentation)

#spiders_automation_demo/items.py
import scrapy

class QuoteItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
Enter fullscreen mode Exit fullscreen mode

You can name your item class anything, in the example, we have used QuoteItem().

Now, you item is ready and to use this item in your spider (Quotes.py), you need to make few changes in the previous code as follows:

import scrapy
from spiders_automation_demo.items import QuoteItem


class QuotesSpider(scrapy.Spider):
    name = "Quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        #each quote is within <div class="quote" ...>
        quotes = response.css("div.quote")
        for quote in quotes:
            #create your item object
            quote_item = QuoteItem()
            #each quote text is within <span class="text" ...>
            title=quote.css("span.text::text").get()
            #each author info is within <small class="author" ...>
            author =quote.css("small.author::text").get()

            #add your selector to your item

            quote_item['title']=title
            quote_item['author']= author
            #yield your item
            yield quote_item

Enter fullscreen mode Exit fullscreen mode

You can observe that we have made following changes to previous spdier code:

  • import your item to the spider (note: the import path may change as per your project name)
    from spiders_automation_demo.items import QuoteItem

  • Create an object of for your item
    quote_item = QuoteItem()

  • Add your selector to the item (you can do it directly instead of using title and author variable like:

title=quote.css("span.text::text").get()
quote_item['title']=title
#the above two lines can be replaced by a single line
quote_item['title']=quote.css("span.text::text").get()
Enter fullscreen mode Exit fullscreen mode
  • The last change will be yielding item object instead of individual attributes. yield quote_item

That's all; your spider is now using item and it is more strict about data being scraped and named in the the process. To test this, you can modify 'title' to 'text' in items.py and run the spider. You will see following error:

KeyError: 'QuoteItem does not support field: title'

The error is self explained, that in our spider we are trying to get value for title but there is not title in our QuoteItem. Such checks help to reduce the error in large data scraping project. Later you will see that same item def. can be used for data cleaning or enforcing data schema or monitoring etc.

You can copy the code for items.py and modified spider using the gist items.py and spider

*5.2. itemloader *

The item loader is kind of per-processing of scraped data before storing it. For example, suppose you want all want all quote in titlecase or uppercase or remove any special character etc. You can do it after scraping and storing the data, however, such processing is easier before storing and itemLoader provide efficient ways of doing it in a neatly.

Now, question is how to use itemLoader? There are three places/arrangements to use it:

  1. Along with item code (you can even merge itemLoader logic and item together) in items.py file.
  2. Along with spider code in Quotes.py
  3. External file within your project i.e. itemsloaders.py

For smaller project either, first or second option is okay but for larger project it is better to have a external file for itemLoader and this method, i am going to use.

So, just create a python file within your project as same level as items.py file i.e. spiders_automation_demo/itemsloader.py and add following code:

#spiders_automation_demo/itemsloader.py
#required imports
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst

#define your methods to apply on item value
def to_title_case(title):
    return title.title()

#define your loader class
class QuoteLoader(ItemLoader):
    default_output_processor = TakeFirst()
    title_in = MapCompose(str.strip, to_title_case)
    author_in = MapCompose(str.strip, to_title_case)
Enter fullscreen mode Exit fullscreen mode

So, your itemLoader is ready and now you need to modify your spider Quote.py to use itemloader.

import scrapy
from spiders_automation_demo.items import QuoteItem
from spiders_automation_demo.itemsloader import QuoteLoader


class QuotesSpider(scrapy.Spider):
    name = "Quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        #each quote is within <div class="quote" ...>
        quotes = response.css("div.quote")
        for quote in quotes:
            #create your item object
            #quote_item = QuoteItem()
            #create your loader
            loader = QuoteLoader(item=QuoteItem(), selector=quote)
            #add your selector to your item
            #each quote text is within <span class="text" ...>
            loader.add_css('title',"span.text::text")
            #each author info is within <small class="author" ...>
            loader.add_css('author',"small.author::text")  

            #yield your item via the loader using .load_item()
            yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode
  • import your itemloader to your spider
    from scrapy.loader.processors import TakeFirst,MapCompose

  • create the loader object using your itemloader
    loader = QuoteLoader(item=QuoteItem(), selector=quote)

  • Make changes with selector to use loader


loader.add_css('title',"span.text::text")
loader.add_css('author',"small.author::text")

Enter fullscreen mode Exit fullscreen mode
  • Change your yield to use loader loader.load_item()

Now, run and test your updated spider and you will observer that all quote text are in title case, for example: The original text of quote

“Try not to become a man of success. Rather become a man of value.”

is converted to TitleCase as

“Try Not To Become A Man Of Success. Rather Become A Man Of Value.”

5.3. pipelines

We have made enough progress with our data scraping task with spiders. Now, Let's update the spiders to enable pipelines within our project.

The pipelines option in scrapy is great place to perform various operation after an item is scraped and loaded by itemloader. Pipelines can have more advance level of data cleaning and processing during data scraping process. For example, droping an item based on any given condition, such as duplicate items (same title for the quote) or missing value (no value for author). Such requirements can be configure via pipeline. You can read more about pipeline from scrapy document at Link

For our example, I am to create a pipeline for only keeping those quote which have either 'love' or 'life' in the title, so if either of these two words is not available then that quote will not be saved or process further.

Any pipeline can be created within the pipelines.py file within the project that was created as boilerplate code after creating the project i.e. (spiders_automation_demo/pipelines.py)

from scrapy.exceptions import DropItem

class FilterQuotesPipelineLoveOrLife:
    def process_item(self, item, spider):
        # Check if 'love' or 'life' is in the text
        if 'love' in item['title'].lower() or 'life' in item['title'].lower():
            return item
        else:
            raise DropItem(f"Quote without 'love' or 'life' in title text: {item['title']}")

Enter fullscreen mode Exit fullscreen mode

The pipeline code is simple, you need to create a class for an individual pipeline such as FilterQuotesPipelineLoveOrLife

and add logic in process_item() class method.

Enable the Pipeline: Unlike item or itemloader, pipeline does not work by importing it to spider. But, you need to enable your pipeline in settings.py (again, this file already exist in your project folder and created during project initialization process). You can add your newly created pipeline as follows (find the commented part in the settings.py and modify accordingly:

# Enable the custom pipeline
ITEM_PIPELINES = {
    'spiders_automation_demo.pipelines.FilterQuotesPipelineLoveOrLife': 300,
}

Enter fullscreen mode Exit fullscreen mode

That's all, your pipeline for filtering quotes text with 'love' or 'life' is ready and you don't need to modify your previous spider. You can run the spider and test your pipeline. In scrapy log, you will see following "Warning" message (and if you store your data, you will find only two quotes are stored.):

WARNING: Dropped: Quote without 'love' or 'life' in title text: “It Is Our Choices, Harry, That Show What We Truly Are, Far More Than Our Abilities.”
{'author': 'J.K. Rowling',
'title': '“It Is Our Choices, Harry, That Show What We Truly Are, Far More '
'Than Our Abilities.”'}

5.4. Feeds

Till now, we have seeing the output of our spider run on terminal and this is only good while testing but in a real project you need to store the crawled/scraped data.

The Scrapy tool allows parameter/argument (-o or -O) to pass filename to store the output to specified file through command line, such as:

$scrapy crawl quotes -o quotes.json  

$scrapy crawl quotes -O quotes.json  

Enter fullscreen mode Exit fullscreen mode

There is difference with -o and -O (i.e. lowercase and uppercase)

  • '-o': Appends data to the file if it already exists
  • '-O': Overwrites the file if it exists.

However, in real projects, we wish to specified the filename, path and file type in a better ways so it can be easy to monitor and maintain. Scrapy provides easy way for achiving it by configuring FEEDS in settings.py (similar to pipeline configuration). You need to add following python dictionary to settings.py and it will handle to make scraped data persistence by storing it as specified.

We are using spider name to store scraped data (thinking to track data by spiders) and allow to overwrite if filename already exist (similar to -O argument through command line).

FEEDS = {
    '%(name)s_quotes.json': {
        'format': 'json',
        'overwrite': True,
    },
}

Enter fullscreen mode Exit fullscreen mode

Explanation

  • %(name)s_quotes.json: This uses the spider's name as a prefix for the output file.
  • format: 'json': Specifies that the output format is JSON.
  • overwrite: True: Ensures the file is overwritten if it already exists.

Now, if you run your spider at the end there will be one json file created (Quotes_quotes.json) and it will have all scraped and filtered quotes. Such as:

[
  {
    "title": "“There Are Only Two Ways To Live Your Life. One Is As Though Nothing Is A Miracle. The Other Is As Though Everything Is A Miracle.”",
    "author": "Albert Einstein"
  },
  {
    "title": "“It Is Better To Be Hated For What You Are Than To Be Loved For What You Are Not.”",
    "author": "André Gide"
  }
]

Enter fullscreen mode Exit fullscreen mode

5.5. Settings

By now, you have already used settings.py for pipeline and feeds , however there are many more configurations that can be set or unset via settings.py and it apply project-wise. So, you also have option to create setting specific to spider within spider code using custom_settings option. However, we are not going to use this for this demo. You can explore own your own.

if you have opened settings.py, you would have seen many enable options by default. Such as:


BOT_NAME = "spiders_automation_demo"

SPIDER_MODULES = ["spiders_automation_demo.spiders"]
NEWSPIDER_MODULE = "spiders_automation_demo.spiders"

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

Enter fullscreen mode Exit fullscreen mode

We are working with sandbox or dummy source website so we don't need to do many configurations including ROBOTSTXT_OBEY, many sites don't allow crawling and that is notify using robot.txt file so if ROBOTSTXT_OBEY=True then scrapy will not scrpe that and if you want to enforce crawling you can set it False (ROBOTSTXT_OBEY=True).

robots.txt is a convention, so by it's own it don't stop scraping but just an ethical approach to respect robots.txt

3) Spiders Automation

Now your scraping project is ready with 1 spider with all other required components. So, now let's move to understand automation process and how to create spiders to scrape quotes from other sources such as from Other dummy site:

Manual process:

Let's first go through the manual process. You have two options:

  1. Create new spider using scrapy genspider and copy the parse method
$scrapy genspider QuotesAjitSource https://quotes-scrape.netlify.app/

Enter fullscreen mode Exit fullscreen mode
import scrapy

class QuoteajitsourceSpider(scrapy.Spider):
    name = "QuotesAjitSource"
    allowed_domains = ["quotes-scrape.netlify.app"]
    start_urls = ["https://quotes-scrape.netlify.app/"]

    def parse(self, response):
        pass

Enter fullscreen mode Exit fullscreen mode
  1. Create a python file and copy paste code from previous spider

In either approach, you will be doing copy, paste and modify the code for new spider i.e. for every new source. And, if you remember, we started with the problem of a scraping project with more than 1000 sources.

Let's follow the manual process like previous source. Use inspect options in Chrome.

Quote and respective html for using in Selector

##spiders_automation_demo/spiders/QuotesAjitSource.py
import scrapy
from spiders_automation_demo.items import QuoteItem
from spiders_automation_demo.itemsloader import QuoteLoader

class QuoteajitsourceSpider(scrapy.Spider):
    name = "QuotesAjitSource"
    allowed_domains = ["quotes-scrape.netlify.app"]
    start_urls = ["https://quotes-scrape.netlify.app/"]

    def parse(self, response):        
        #each quote is within <div class="quote" ...>
        quotes = response.css("div.quote-container")
        for quote in quotes:
            #create your item object
            #quote_item = QuoteItem()
            #create your loader
            loader = QuoteLoader(item=QuoteItem(), selector=quote)
            #add your selector to your item
            #each quote text is within <span class="text" ...>
            loader.add_css('title',"h2.title::text")
            #each author info is within <small class="author" ...>
            loader.add_css('author',"p.author::text")  

            #yield your item via the loader using .load_item()
            yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode

The output json file will be empty because there are not quotes with 'love' or 'life' word in it. So, let's modify our filter pipeline and add 'work' to it.

from scrapy.exceptions import DropItem

class FilterQuotesPipelineLoveOrLife:
    def process_item(self, item, spider):
        # Check if 'love' or 'life' is in the text
        if 'love' in item['title'].lower() or 'life' in item['title'].lower() or 'work' in item['title'].lower():
            return item
        else:
            raise DropItem(f"Quote without 'love' or 'life' or 'work' in title text: {item['title']}")


Enter fullscreen mode Exit fullscreen mode

Now, open your output json file and you will have following quotes:

[
  {
    "title": "If I Find 10,000 Ways Something Won'T Work, I Haven'T Failed. I Am Not Discouraged, Because Every Wrong Attempt Discarded Is Another Step Forward.",
    "author": "— Thomas Edison"
  },
  {
    "title": "As A Cure For Worrying, Work Is Better Than Whisky.",
    "author": "— Thomas Edison"
  }
]

Enter fullscreen mode Exit fullscreen mode

Now, you have two spiders, so let's explore what is the difference between these spiders. We have used diff tools to highlights the difference between these two spiders. You can also check through the link.

The difference between two spiders for two soruces.

if you observe the difference between spiders code, you will notice following:

  1. Spider specific values / metadata
  2. Spider class name
  3. name of spider
  4. value in allowed_domains
  5. Value in start_urls

  6. Source specific values, mainly selectors value

  • Selector value for main div holding individual quote
  • Selector value for title holding individual quote title text
  • Selector value for author holding individual quote author text

Automatic process: Using a template for spdier and json

1. Create a json file

Create a json file in which each of attributes of the item (for example: title, and author) will be the key and selector (div.title) for them will be the value. And also include source information like sitename, urls which can be use to create spiders class.

{
  "spidername": "Quotes Ajit Source2",
  "start_urls": ["https://quotes-scrape.netlify.app/"],
  "quote_div_main": "div.quote-container",
  "title_selector": "h2.title",
  "author_selector": "p.author"
}

Enter fullscreen mode Exit fullscreen mode

2. Create a template file

We are going to Jinja2 template (Documentation) , so let start with any of spider code and modify it to adopt Jinja2 template.

#spiders_automation_demo/templates/spiders/quotespider_template.jinja2
import scrapy
from spiders_automation_demo.items import QuoteItem
from spiders_automation_demo.itemsloader import QuoteLoader

class {{ spiderclass}}(scrapy.Spider):
    name = "{{spidername}}"
    allowed_domains = ["{{allowed_domains}}"]
    start_urls = {{start_urls}}

    def parse(self, response):
        #each quote is within <div class="quote" ...>
        quotes = response.css("{{quote_div_main}}")
        for quote in quotes:
            #create your item object
            #quote_item = QuoteItem()
            #create your loader
            loader = QuoteLoader(item=QuoteItem(), selector=quote)
            #add your selector to your item
            #each quote text is within <span class="text" ...>
            loader.add_css('title',"{{title_selector}}::text")
            #each author info is within <small class="author" ...>
            loader.add_css('author',"{{author_selector}}::text")  

            #yield your item via the loader using .load_item()
            yield loader.load_item()
Enter fullscreen mode Exit fullscreen mode

3. Create python script to generate spider( using template and json file created in previous stop)

#bluk_gen_spiders.py
import json
from jinja2 import Environment, FileSystemLoader
from urllib.parse import urlparse
import os

# Load JSON configuration
with open('sources.json') as f:
    sources = json.load(f)

# Set up Jinja2 environment
env = Environment(loader=FileSystemLoader('spiders_automation_demo/templates/spiders/'))
template = env.get_template('quotespider_template.jinja2')   

# Folder where you want to save the spider file
folder_name = 'spiders_automation_demo/spiders'

for config in sources:
    # Extract values and manipulate strings
    spidername = config['spidername'].title().replace(" ", "")
    spiderclass = spidername.capitalize()+'Spider'
    start_urls = config['start_urls']
    parsed_url = urlparse(start_urls[0])
    allowed_domains = parsed_url.netloc

    # Render the template with values
    output = template.render(
        spiderclass=spiderclass,
        spidername=spidername,
        allowed_domains=allowed_domains,
        start_urls=start_urls,
        quote_div_main=config['quote_div_main'],
        title_selector=config['title_selector'],
        author_selector=config['author_selector']
    )



    # Ensure the folder exists
    os.makedirs(folder_name, exist_ok=True)

    # Save the rendered template to a new Python file in the specified folder
    file_path = os.path.join(folder_name, f'{spiderclass}.py')
    with open(file_path, 'w') as f:
        f.write(output)

    print(f"Spider generated: {file_path}")


Enter fullscreen mode Exit fullscreen mode

Given proper path to store spiders (folder_name = 'spiders_automation_demo/spiders') and path of template (env = Environment(loader=FileSystemLoader('spiders_automation_demo/templates/spiders/'))) and json and run your script. For example:

(scrapy-demo)$python bluk_gen_spiders.py 

Enter fullscreen mode Exit fullscreen mode

If all good then you will have spiders for all the added sources under the spiders folder of the project. (careful with the script location)

Now, you can run and test 1-2 spiders created by generation script and if anything wrong , fix it (either in generation script or in the template file) and you can rerun the script.

Now, out of curiosity , just have a look on both spiders code Manual versus generated by script using template and json. Diff link You can notice that there is no difference in code (single and double quotes in starts_url can be fixed too :))

Difference between source code of Manual vs Automated spiders code.

Congratulations !! You have automated the spiders creation and you can keep adding spiders for new sources in batches by just adding five values in the sources.json.

For example, if you find two other sources for quotes and want to add spiders for each:
1) Brainy Quote URL

2) AZ quotes [URL]

Then, you just need to inspect the respective webpages and modify the sources.json file as follow:

[
  {
    "spidername": "Quote Ajit Source",
    "start_urls": ["https://quotes-scrape.netlify.app/"],
    "quote_div_main": "div.quote-container",
    "title_selector": "h2.title",
    "author_selector": "p.author"
  },
  {
    "spidername": "Brainy Quote",
    "start_urls": ["https://www.brainyquote.com/topics/life-quotes"],
    "quote_div_main": "div.grid-item.qb.clearfix.bqQt",
    "title_selector": "div",
    "author_selector": "a.bq-aut.qa_109542.oncl_a"
  },

  {
    "spidername": "AZ Quotes",
    "start_urls": ["https://www.azquotes.com/quotes/topics/life.html"],
    "quote_div_main": "div.wrap-block",
    "title_selector": "p a.title",
    "author_selector": "div.author a"
  }
]

Enter fullscreen mode Exit fullscreen mode

Now, run your spider generation script, again as follow and now you will have three new spiders created (you can add code to not create spiders if name already exist).

(scrapy-demo)$python bluk_gen_spiders.py 
Enter fullscreen mode Exit fullscreen mode

Note: However the newly created spider for Brainy Quote will give a 403 error (crawling restriction) so data can't be scrape with current settings and bypassing restriction is another topic for some other day.

You run spider for AZ quotes and you will get following data in the json file (AzQuotes_quotes.json):

(scrapy-demo)$scrapy crawl AzQuotes

Enter fullscreen mode Exit fullscreen mode
[
  {
    "title": "The Happiness Of Your Life Depends Upon The Quality Of Your Thoughts.",
    "author": "Marcus Aurelius"
  },
  {
    "title": "Without Forgiveness Life Is Governed By... An Endless Cycle Of Resentment And Retaliation.",
    "author": "Roberto Assagioli"
  },
  {
    "title": "The Tragedy Of Life Is What Dies Inside A Man While He Lives.",
    "author": "Albert Schweitzer"
  },

......truncated 
]

Enter fullscreen mode Exit fullscreen mode

Note: The article is in revision status so there may be some typos, grammatical mistakes and technical errors. So, let me know via comment if you find any technical errors, so i will fixed in revised version.

Top comments (0)