nanellooo

Posted on Sep 11, 2023

Let’s Create an End-to-End Web Scraping Pipeline With Scrapy!

#python #webscraping #sql #dataengineering

Introduction

Web scraping has become an indispensable tool for gathering data, allowing developers and data enthusiasts access to valuable information from the web. Tools like BeautifulSoup4 and Selenium are user-friendly tools that make this task as simple as possible, especially for one-off scripts and basic workflows.

However, web scraping is often just the first step in a broader process of Extract, Transform, Load (ETL). As your needs grow, so too will the number of custom scripts. With no framework to organize these one-off scripts, this will inevitably lead to a confusing mess down the road.

As Albert Einstein once said, “Everything should be made as simple as possible, but not simpler.” That’s where Scrapy comes in!

In this tutorial, I’m going to walk you through a web scraping ETL process using Scrapy that gathers quotes, like that Einstein quote, and loads them into an SQLite database. We’ll be using Quotes to Scrape as our target scraping site:

We’ll go over the following:

Creating a virtual environment for Python.
Setting up Scrapy, a Python web scraping framework.
Building a web scraper using Scrapy to extract quotes from a website.
Configuring a Scrapy pipeline to process and store scraped data.
Creating a SQLite database using Python.
Storing scraped data in an SQL database.
(For Fun) Analyzing scraped data with Pandas and Matplotlib

Step 1: Creating a Virtual Environment

Before we dive in, a good idea to create a clean and isolated Python environment using a virtual environment. This ensures that the packages and dependencies for your Scrapy project won’t interfere with your system-wide Python installation, and it will automatically Scrapy’s CLI to the virtual environment’s PATH for maximum ease of use.

Here are the steps to create a virtual environment:

Open a terminal or command prompt on your computer.
Navigate to the directory where you want to create your Scrapy project. You
Once you’re in the desired directory, run the following command to create a virtual environment named quotesenv (you can replace quotesenv with your preferred name):

python -m venv quotesenv

After running this command, you’ll have a new directory named myenv (or your chosen name) in your project directory. This directory contains a clean Python environment where you can install packages without affecting your system-wide Python installation.

Now that you have a virtual environment set up, you can proceed to the next step: installing Scrapy and creating your Scrapy project.

Step 2: Installing Scrapy and Creating Your First Project

It’s time to install Scrapy and create a Scrapy project for our web scraping endeavor.

Activate your virtual environment if it’s not already activated. You can do this by running:

source quotesenv/bin/activate

Replace quotesenv with the name of your virtual environment if it's different.

Now, you can install Scrapy inside your virtual environment using pip:

pip install scrapy

Once Scrapy is installed, you can create a new Scrapy project using the following command:

scrapy startproject quotes\_project

This command will create a directory structure for your Scrapy project, including all the necessary files and boilerplate to get you started:

\>quotes\_project/  
    scrapy.cfg  
    quotes\_project/  
        \_\_init\_\_.py  
        items.py  
        middlewares.py  
        pipelines.py  
        settings.py  
        spiders/  
            \_\_init\_\_.py  
            quotes\_spider.py

Here’s how they all work together:

scrapy.cfg: This is the Scrapy project configuration file. It contains settings and configurations for your Scrapy project.
quotes_project(directory): This is the Python package for your Scrapy project.
init.py: This is an empty Python file that makes the directory a Python package.
items.py: This is where you define the structure of the items that will hold the scraped data. You create item classes with fields that correspond to the data you want to scrape.
middlewares.py: This file is used to define custom middleware components for your Scrapy project. Middleware can modify requests and responses during the scraping process.
pipelines.py: Here, you can define data processing pipelines to process and store the scraped data. You can implement actions like storing data in databases, exporting data to files, or performing additional processing.
settings.py: This is the main configuration file for your Scrapy project. You can set various project-specific settings, including user agent, concurrency, and more.
spiders(directory): This directory contains your web-scraping spiders.
init.py: This is an empty Python file that makes the directory a Python package.
quotes_spider.py: This is an example spider file where you define the spider to crawl and scrape data from websites. You create classes that inherit from scrapy.Spider and define how the spider navigates and extracts data from web pages. We'll create our first one together.

This folder and file structure provide a clear organization for your Scrapy project, separating configuration, item definitions, spider code, and data processing logic.

Now that you have Scrapy installed and your project set up, let’s move on to defining a spider to scrape quotes from a website. Stay tuned for Step 3!

Step 3: Creating the Item Class For Our Data Structure

In this step, we’ll define the structure of the item that will hold the scraped data. Scrapy uses Items to structure and store the data you extract from websites.

Open the [items.py](http://items.py/) file in your Scrapy project directory. This file is automatically generated when you create your Scrapy project using the scrapy startproject command.

In [items.py](http://items.py/), you'll notice the CLI has already created the QuotesScraperItem class, which inherits from scrapy.Item. Add the following code and remove the pass statement to define the structure of the QuotesScraperItem:

import scrapy  
class QuotesScraperItem(scrapy.Item):  
    text = scrapy.Field()  
    author = scrapy.Field()

This code snippet defines an item class with two fields: title and author. You could also grab the tags data, but that's beyond the scope of this tutorial. These fields correspond to the data we'll scrape from the website.

With the item structure defined, we’re ready to move on to creating the spider, which is the class responsible for the extract portion of our ETL pipeline. I’ve done the work of inspecting the target website and finding the selectors for the data we’ll be scraping, too.

Step 4: Defining a Scrapy Spider to Scrape Quotes

A Scrapy spider is a class that contains the rules for how to navigate and extract data from a website. Let’s get started:

Create a new file named quotes_spider.py inside the quotes_project/quotes_project/spiders directory. This is where we'll define our spider.
Edit quotes_spider.py and add the code below

import scrapy  
from quotes\_project.items import QuotesScraperItem #made in the previous step  
class QuotesSpider(scrapy.Spider):  
            name = "quotes"  
            start\_urls = \[  
                'http://quotes.toscrape.com/page/1/',  
            \]  
            def parse(self, response):  
                for quote in response.css('div.quote'):  
                    item = QuotesScraperItem()  
                    item\['text'\] = ''.join(quote.css('span.text::text').get())  
                    item\['author'\] = ''.join(quote.css('span small::text').get())  
                    yield item  
                next\_page = response.css('li.next a::attr(href)').get()  
                if next\_page is not None:  
                    yield response.follow(next\_page, self.parse)

This code defines a Scrapy spider named “quotes” that starts at the specified URL and scrapes quotes and their authors. It also continues to do so for each page until there are no more pages left. The “”.join() string function is necessary because the ‘text’ field will return a list of strings rather than one string, and this combines it for us easily.

To run the spider and see it in action, use the following command in your project’s root directory:

scrapy crawl quotes

Now, your spider will start scraping quotes from the website. But you have nowhere to put it… Fear not, that comes next!

Step 5: Saving Scraped Data to a SQLite Database

Now that you’ve successfully scraped data from the web, it’s time to store that data in a database for future use and analysis. In this step, we’ll set up an SQLite database as the destination for the data in our pipeline automatically when we run the pipeline.

In Scrapy, pipelines are responsible for processing scraped data. We’ll create a custom pipeline to insert our scraped items into the SQLite database we just created.

In your Scrapy project folder, navigate to the quotes_scraper directory (or your project name) and open the [pipelines.py](http://pipelines.py/) file.
Define the following pipeline class at the end of the file.

import sqlite3  
            from itemadapter import ItemAdapter              
            class QuotesToSQLitePipeline:  
                def \_\_init\_\_(self):  
                    self.conn = sqlite3.connect('quotes.db')  
                    self.cursor = self.conn.cursor()  
                def process\_item(self, item, spider):  
                    adapter = ItemAdapter(item)  
                    self.cursor.execute('''  
                    CREATE TABLE IF NOT EXISTS quotes (  
                        id INTEGER PRIMARY KEY AUTOINCREMENT,  
                        text TEXT,  
                        author TEXT  
                    )  
                ''')  
                    self.cursor.execute('''  
                        INSERT INTO quotes (text, author)  
                        VALUES (?, ?)  
                    ''', (adapter\['text'\], adapter\['author'\]))  
                    self.conn.commit()  
                    return item  
                def close\_spider(self, spider):  
                    self.conn.close()

This pipeline class creates (if it doesn’t yet exist) or connects to the SQLite database, inserts scraped data into the quotes table and closes the connection when the spider is done. The ItemAdapter allows us to data within an Item without having to import that Item itself (less tightly coupled, yay!).

Step 3: Enabling the Pipeline

To enable your custom pipeline, add it to the [settings.py](http://settings.py/) file in your Scrapy project folder:

\# settings.py  
\# ...  
        # Configure item pipelines  
        ITEM\_PIPELINES = {  
           'quotes\_scraper.pipelines.QuotesToSQLitePipeline': 300,  # Adjust the priority if needed  
        }  
        # ...

In this example, I’ve set the pipeline’s priority to 300, but you can adjust it as necessary.

Step 4: Running Your Spider and Saving Data

Now that everything is set up, you can run your spider again and see how Scrapy saves scraped data to the SQLite database:

scrapy crawl quotes

Scrapy will execute your spider and use the custom pipeline to save the scraped quotes and authors to the quotes.db database, which will now be in your project folder.

That’s it for Step 4! You’ve successfully set up an SQLite database and configured a pipeline to save your scraped data. In the next steps, we’ll cover how to retrieve and analyze the data.

Step 5: Retrieving Data from the SQLite Database

You’ve successfully scraped and saved data to an SQLite database using Scrapy. In this step, we’ll explore how to retrieve that data from the database and perform basic queries.

Step 1: Connecting to the Database

To retrieve data from the SQLite database, we need to establish a connection and create a cursor object, then perform a select, and finally print what’s returned from the cursor. Create a file called retrieve_quotes.py in the project’s root folder and copy in the following:

import sqlite3  

def retrieve\_quotes():  
  # Create a context manager for the SQLite connection  
  with sqlite3.connect('quotes.db') as conn:  
      # Create a cursor within the context manager  
      cursor = conn.cursor   
      # Execute an SQL query to retrieve all quotes  
      cursor.execute('SELECT \* FROM quotes')                      
      # Fetch all the results from the cursor; fetchall() function returns a list.  
      quotes = cursor.fetchall(  
      # Display the retrieved data  
      for quote in quotes:  
          print(quote)  
if \_\_name\_\_ == "\_\_main\_\_":  
  retrieve\_quotes()

Step 2: Running the Script

To run the script and retrieve data from the database, use the following command or run it in your IDE:

python retrieve\_data.py

The script will connect to the database, retrieve the data, and display it in the terminal.

That’s it for Step 5! You’ve learned how to retrieve data from the SQLite database where you saved your scraped quotes and authors. In the next steps, we’ll dive deeper into data analysis and visualization.

Step 6: (For Fun) Data Analysis and Visualization

In this optional step, you can explore your scraped data further by performing data analysis and visualization. We’ll use Python libraries like Pandas and Matplotlib to accomplish this. This is barely enough data to bother with, but I thought it’d be nice to show this step to complete the most minimal of MVPs for a complete Scrapy pipeline use case!

Step 1: Importing Required Libraries

First, make sure you have the necessary Python libraries installed. You can install them using pip:

pip install pandas matplotlib

Step 2: Analyzing Data

Now, let’s perform some basic analysis of the scraped data. We’ll use Pandas to load the data from the SQLite database into a DataFrame and then calculate some statistics. In the project’s root folder, create a Python script called data_analysis.py and add the following code. This will print the total number of records using pandas.

import pandas as pd  
import sqlite3  
\# Connect to the SQLite database  
conn = sqlite3.connect('quotes.db')  
\# Load data into a Pandas DataFrame  
df = pd.read\_sql\_query('SELECT \* FROM quotes', conn)  
\# Calculate the number of quotes  
total\_quotes = len(df)  
\# Display the total number of quotes  
print(f'Total Number of Quotes: {total\_quotes}')

Step 3: Visualizing Data

Next, let’s create a simple bar chart to visualize the distribution of quotes by author. We’ll use Matplotlib for this purpose. Add the import to the top of data_analysis.py and copy-paste everything from the body of the code below to the end of data_analysis.py:

import matplotlib.pyplot as plt  
\# Group quotes by author and count the occurrences  
author\_counts = df\['author'\].value\_counts()  
\# Create a bar chart  
plt.figure(figsize=(12, 6))  
author\_counts.plot(kind='bar')  
plt.title('Quotes by Author')  
plt.xlabel('Author')  
plt.ylabel('Number of Quotes')  
plt.xticks(rotation=45)  
plt.tight\_layout()  
plt.show()

Step 4: Running the Script

Run the script using the following command:

python data\_analysis.py

After running the script, you’ll see the total number of quotes displayed in the console, and a bar chart showing the distribution of quotes by author will pop up.

That concludes Step 6, where you have the option to perform data analysis and visualization on your scraped data.

Recap and Next Steps

Congratulations! You’ve completed my tutorial on web scraping and data extraction process using Scrapy. At this point, you will have:

Created a virtual environment for your Python project to manage dependencies.
Set up a Scrapy project and defined a web scraping spider.
Extracted quotes and authors from a website using Scrapy.
Created an SQLite database and stored the scraped data.
Implemented a Scrapy pipeline to automate data storage.
Performed data analysis and visualization on the scraped data.

Next Steps

Now that you have a solid foundation in web scraping with Scrapy, you can explore more advanced topics and real-world applications. In the coming weeks, I will be writing and linking in tutorials that build off this simple pipeline. They’ll cover such topics as:

Scraping data from multiple websites and combining it.
Handling different data formats and structures.
Scheduling web scraping tasks to run at specific intervals.
Implementing user authentication for scraping behind login pages.
Scaling up your scraping efforts with distributed systems.

Feel free to apply the knowledge and techniques you’ve learned here to gather data for your projects, research, or data analysis tasks. If you make something cool out of a Scrapy pipeline after reading this, please let me know! I’d love to see it!

Scrape responsibly, and have fun!

DEV Community