Steadylearner

Posted on May 31, 2021 • Edited on Jun 23, 2023 • Originally published at steadylearner.com

How to use Python Scrapy to scrap a website with examples

#webscraping #tutorial #python #beginners

In this post, we will learn how to use Python Scrapy.

We will use Rust notification website This Week In Rust as an example. If you are a Rust developer, you will find you can easily extract only the parts you want from its pages.

Otherwise, use another website you want.

Prerequistes

I will suppose you already have experience with Python.

It will be helpful for you to spend a few hours to read Scrapy documentations.

Setup Python development environment
Inspect the website
Write Python Scrapy code
Conclusion

You can skip 1. if you already have Scrapy development environment ready.

1. Setup Python development environment

We will start by setting Scrapy development environment with pip. Use this command.

$python3 -m venv scrapy

It will make a structure similar to this in your machine with directory name scrapy.

bin  include  lib  lib64  pyvenv.cfg  share

We don't have to care for others and our interest will be only bin/activate file to use virutalenv. We should activate Python development environment to with it.

You will have more Scrapy projects later and making alias for it will save your time. Use this command.

$vim ~/.bashrc

Then, include the code similar to this.

alias usescrapy="source /home/<youraccount>/Desktop/code/scrapy/bin/activate"

You should find the equivalent part of /home/youraccount/Desktop/code/ with $pwd command if you want to use this. Then, $source ~/.bashrc and you can use this Python development environment with $usescrapy only whenever you want.

Type $usescrapy and $pip install ipython scrapy. It will install the minimal dependencies to use Python Scrapy.

If you want to reuse the exactly same packages later, use these commands.

$pip freeze > requirements.txt to extract the list of them.
$pip install -r requirements.txt to install them later.

2. Inspect the website

I hope you already visited the Rust notification website or other websites you want to crawl.

Refer to the processes used here with Scrapy Tutorial and apply it later to a website you want to scrap.

I will assume you already know how to use browser inspector and familiar with CSS and HTML.

The purpose of This Week In Rust is to give you useful links relevant to Rust for each week.

It has the recent issue links in the homepage.

When you visit each of them, you will see the list of links for blog posts, crates(packages in Rust), call for participation, events, jobs etc.

Back to its homepage and use your browser inspector with CRTL+SHIFT+I and find how its html is structured. You can see that it is just simple static website with a CSS framework.

Inspect This week in Rust of publication part. Then, you will find many html tags similar to this.

<a href="https://this-week-in-rust.org/blog/this-week-in-rust/">This Week in Rust</a>

Collecting those links to follow will be our main job for this page. They will be the entry points to the pages with target informations that we will scrap.

Visit one of them. When you inspect jobs parts and others you want to scrap, you will see that they structure similar to this.

Our main target will be href to help you find job titles and job links for them. It is the part of a tag that are wrapped with li and its parent element ul.

You can see that ul is also followed by h1 or h2 tags with ids. Knowing how html tags are organized for the data we want to scrap will help you test the Scrapy code we will write in the next part.

3. Write Python Scrapy code

We set up development environment and have the information ready to use with the previous parts. What left is to write the Python code for Scrapy.

Before that, use shell command from Scrapy CLI to test how the Scrapy programm will see the webpage.

$scrapy shell https://this-week-in-rust.org

Use another website you want to scrap if you have any. Then, the console will become the Ipython mode with information similar to this.

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at>
[s]   item       {}
[s]   request    <GET https://this-week-in-rust.org>
[s]   response   <200 https://this-week-in-rust.org>
[s]   settings   <scrapy.settings.Settings>
[s]   spider     <DefaultSpider 'default'>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

Use $view(response) first to verify your target websites can be read by Scrapy. For example, if the website is rendered with JavaScript, it may not work well and you should find more documentation for that.

With This Week In Rust, there will be no problem because it is just a normal static website.

You can play with Scrapy shell mode with request, response etc. For example, use response.body, response.title. Then, exit it with quit()and start your Scrapy project.

Use $scrapy startproject notification rust.

It will automatically generate Scrapy project folder with rust and project name notification and will show message similar to this in your console.

    cd rust
    scrapy genspider example example.com

You can use $scrapy startproject -h for more information.

Follow the instruction.

Then, use command similar to $scrapy genspider this_week_in_rust this-week-in-rust.org/.

It should have created spiders/this_week_in_rust.py in your machine. Then, We will write code for the spider(this_week_in_rust.py).

Edit it similar to this.

import scrapy

class ThisWeekInRustSpider(scrapy.Spider):
    name = 'this_week_in_rust'
    start_urls = ['https://this-week-in-rust.org/']

    # 1.
    def parse(self, response):

        # Or test it with $scrapy shell https://this-week-in-rust.org/
        for href in response.css("div.custom-xs-text-left > a::attr(href)").getall():
            # 1.
            # print("page")
            # print(href)

            yield response.follow(href, self.parse_jobs)

    # 2.
    def parse_jobs(self, response):
        date = ".".join(response.url.split("/")[4:7]).replace(".","-")

        # Or test it with $scrapy shell https://this-week-in-rust.org/blog/<date>/<text>
        job_titles = response.css("#rust-jobs ~ p ~  ul > li > a::text").getall()
        job_urls = response.css("#rust-jobs ~ p ~ ul > li > a::attr(href)").getall()
        jobs = { **dict(zip(job_titles, job_urls)) }
        # 2.
        # print("\n")
        # print(date)
        # print(jobs)
        # jobs = { "job": len(job_titles), **dict(zip(job_titles, job_urls)) }

        # 3.
        jobs = { "total_jobs": len(job_titles), **dict(zip(job_titles, job_urls)) }

        # sorted(list, key = lambda i: i["Posts"], reverse = True)
        yield {
            "date": date,
            **jobs,
        }

        # yield {
        #     "date": date,
        #     "jobs": jobs,
        # }

We just converted the information we get from the previous part into Python code with Scrapy.

1. we extract the publication page links to follow with CSS Selectors. div.custom-xs-text-left is to help it to select href part in a tags.

We extract all links to follow through so we use getall().

Then, we define what to do with them with parse_post_and_jobs callback function.

2. This is payload of all these processes. We extract date of the publication, the total number of them, titles and other important datas of Rust jobs to make the information useful.
Then, we turn it into JSON format with Python API.

You can see the pattern that only id part such as #news-blog-posts, #rust-jobs are different and others are repeated.

You can easily include events, call for participation etc from the website if you want to scrap other parts.

3. We return the data we want to use here.

Your code will be different from this if you used other websites but the main processes to find what you want will be similar.

Get the links to follow to visit the payload webpages.
Extract the information you want at each page.

Test it work with $scrapy crawl this_week_in_rust -o rust_jobs.json.

Then, you can verify the result similar to this structure.

[
  {"date": "", "total_jobs": "", "job_name": "job_link"}
]

It may not be ordered well by date. Therefore, make Python file similar to this if you want.

# sort_json.py
import os
import sys

import json

target = sys.argv[1]

with open(target, 'r') as reader:
    json_str = reader.read()
    json_lists = json.loads(json_str) # dict, read

    with open(target, 'w+') as writer:
        sorted_list = sorted(json_lists, key = lambda i: i["date"], reverse = True) # only work for list of dicts
        json_sorted_str = json.dumps(sorted_list, indent=4) # write
        writer.write(json_sorted_str)

print(f"Sort {target} by dates.")

Use it with $python sort_json.py rust_jobs.json and it will organize the JSON file by dates.

You should comment or remove sort_json.py from your Scrapy project if you want to use this project later.

4. Conclusion

In this post, we learnt how to use Python Scrapy. If you followed this post well, what you need later will be just use $scrapy genspider and edit the Python file(spider) made from it.

I hope this post be helpful for other Rust developers who wait for This Week In Rust every week and also for people who want to learn Python Scrapy.

If you need to hire a developer, you can contact me.

Thanks.