Scrapfly for Scrapfly

Posted on Mar 6, 2024

Web Scraping Dynamic Websites With Scrapy Playwright

#webscraping #scrapy #playwright #headlessbrowsers

Scrapy is a widely used web scraping library with convenient and comprehensive architecture support for the common web scraping processes. However, it lacks a major feature: JavaScript rendering.

In this tutorial, we'll explore Selenium Playwright. A Scrapy integration that allows scraping dynamic web pages with Scrapy. We'll explain web scraping with Scrapy Playwright through an example project and how to use it for common scraping use cases, such as clicking elements, scrolling and waiting for elements. Let's dive in!

What is Scrapy Playwright?

scrapy-playwright is an integration between Scrapy and Playwright. It enables scraping dynamic web pages with Scrapy by processing the web scraping requests using a Playwright instance.

Scrapy Playwright allows for accessing the used Playwright pages, which enables most of the Playwright features such as:

Simulating mouse and keyboard actions.
Waiting for events, load states and HTML elements.
Taking screenshots.
Executing custom JavaScript code.

How to Install Scrapy Playwright?

To web scrape with Scrapy Playwright, we'll have to install a few Python libraries:

Scrapy: For creating a Scrapy project and executing the scraping spiders.
scrapy-playwright: A middleware for processing the requests using Playwright.
Playwright: The Playwright Python API for automating the headless browsers.

The above libraries can be installed using the pip command:

pip install scrapy scrapy-playwright playwright

After running the above command, install the Playwright headless browser binaries and dependencies:

playwright install chromium
playwright install-deps chromium

The above command will install the related Chrome binaries. However, we can specify other browser engines: firefox or webkit.

🙋‍ Note that scrapy-playwright relies on the asyncio SelectorEventLoop. So, to use Scrapy Playwright on Windows, we have to use WSL. An interface for running Linux environments in Windows. For the installation instructions, refer to the official Microsoft guide.

How to Scrape with Scrapy Playwright?

In this section, we'll go over a step-by-step tutorial on creating a Scrapy project, integrating it with Playwright and creating a scraping Spider to extract data using Playwright.

This Scrapy Playwright tutorial will briefly cover the basics of Scrapy. For further details, refer to our dedicated guide on Scrapy.

Setting Up Scrapy Project

Let's start out with creating a new Scrapy project through the Scrapy commands:

$ scrapy startproject reviewgather reviewgather-scraper
#                     ^ name       ^ project directory

Executing the above command will create a Scrapy project in the reviewgather-scraper folder. Let's navigate to its directory and inspect the created project files:

$ cd reviewgather-scraper
$ tail
.
├── reviewgather
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py 
│   └── spiders
│       ├── __init__.py 
└── scrapy.cfg

Now that the Scrapy project is ready. Let's power it with Playwright!

Integrating Playwright With Scrapy

Setting Playwright with Scrapy is fairly straightforward. All we have to do is add these two lines to the settings.py file in the Scrapy project:

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

Also, enable the AsyncioSelectorReactor by making sure that the following line exists in the same file and add it if not:

TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"

Our Scrapy project can now use Playwright. Let's create the first Scrapy Playwright scraping spider to put it into evaluation!

Creating Scraping Spider

In this Scrapy Playwright tutorial, we'll scrape review data from web-scraping.dev:

        <img src="https://scrapfly.io/blog/content/images/2024/02/web-scraping.dev-review-data-1.webp" alt="webpage with review data" title=""><figcaption>Reviews on web-scraping.dev</figcaption>

To scrape the above review data, we have to create a Scrapy spider:

$ scrapy genspider reviews web-scraping.dev
#                  ^ name  ^ domain to scrape

The above Scrapy command will generate a spider named reviews.py with a boilerplate code:

import scrapy


class ReviewsSpider(scrapy.Spider):
    name = "reviews"
    allowed_domains = ["web-scraping.dev"]
    start_urls = ["https://web-scraping.dev"]

    def parse(self, response):
        pass

The starting point of the above spider is start_urls, which is used for crawling purposes. Since our scraping target is only one page, we'll change it to a start_requests function and request the target web page with Playwright:

import scrapy


class ReviewsSpider(scrapy.Spider):
    name = "reviews"
    allowed_domains = ["web-scraping.dev"]

    def start_requests(self):
        url = "https://web-scraping.dev/testimonials"
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True
            }
        )

    def parse(self, response):
        reviews = response.css("div.testimonial")
        for review in reviews:
            yield {
                "rate": len(review.css("span.rating &gt; svg").getall()),
                "text": review.css("p.text::text").get()
            }

Let's go through the above spider changes:

Add a start_requests function and along with the target page URL.
Request the URL using the scrapy.Request method and add the playwright parameter to the request metadata to process it with Playwright.
Update that parse() callback function to parse the review data on the page by iterating and extracting them using CSS selectors.

The next step is executing the reviews spider and save the results:

scrapy crawl reviews --output reviews.json

The above command will create a reviews.json with the data extracted:

[
    {"rate": 5, "text": "We've been using this utility for years - awesome service!"},
    {"rate": 5, "text": "This Python app simplified my workflow significantly. Highly recommended."},
    {"rate": 4, "text": "Had a few issues at first, but their support team is top-notch!"},
    {"rate": 5, "text": "A fantastic tool - it has everything you need and more."},
    {"rate": 5, "text": "The interface could be a little more user-friendly."},
    {"rate": 5, "text": "Been a fan of this app since day one. It just keeps getting better!"},
    {"rate": 4, "text": "The recent updates really improved the overall experience."},
    {"rate": 3, "text": "A decent web app. There's room for improvement though."},
    {"rate": 5, "text": "The app is reliable and efficient. I can't imagine my day without it now."},
    {"rate": 1, "text": "Encountered some bugs. Hope they fix it soon."}
]

Cool, our Scrapy Playwright scraping spider extracted the review data! However, it only contains the data from the first review page. To load and scrape more reviews, we have to scroll down the page. To do this, let's have a closer look at configuring Scrapy Playwright and automating the headless browser!

Implement Common Scraping Cases With Scrapy Selenium

In the following sections, we'll explore configuring Playwright with the Scrapy setup and controlling the Plawright headless browser for common web scraping use cases.

The scrapy-playwright middleware supports most of the Playwright methods. This means that we can apply the regular Playwright features in Scrapy. For further details on these features, refer to our dedicated guide on Playwright.

Configuring Scrapy Playwright

Before we explore using Scrapy Playwright to execute different web scraping tasks, let's have a look at configuring the Playwright browser and its context first.

The scrapy-playwright middleware allows for defining global Playwright configuration through the settings.py file in the Scrapy project:

#settings.py

PLAYWRIGHT_BROWSER_TYPE = "chromium"

PLAYWRIGHT_LAUNCH_OPTIONS = {
    "headless": False, # run in the headful mode
    "timeout": 60 * 1000,  # 60 seconds
}

PLAYWRIGHT_CONTEXTS = {
    "some_context_name": {
        "viewport": {"width": 1280, "height": 720},
        "locale": "fe-FR",
        "timezone_id": "Europe/Paris",
    }
}

In the above code, we define two different configurations:

Launch options: A timeout for the browser instance and whether to run the browser in headless mode.
Browser context: The Playwright browser emulation settings, such as viewport and locality configuration.

The Playwright context settings are global and can include several context profiles. They can be used by declaring the profile name in the request metadata:

        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_context": "some_context_name",
            }
        )

Scrapy Playwright also allows for defining custom Headers and Cookies that will be used across all the requests:

# settings.py

from playwright.async_api import Request
from scrapy.http.headers import Headers

def custom_headers(
    browser_type: str,
    playwright_request: Request,
    scrapy_headers: Headers,
) -&gt; dict:
    return {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.35"}

PLAYWRIGHT_PROCESS_REQUEST_HEADERS = custom_headers

Here, we define a custom_headers function that returns specific headers values and pass it to PLAYWRIGHT_PROCESS_REQUEST_HEADERS to use it across all Playwright requests. It will also override the default Scrapy headers and the headers passed to the Scrapy request.

Scrolling

Let's update our previous Scrapy Playwright scraping spider to scroll down and load more reviews. For this, we'll use the scrapy-playwright PageMethod, which supports most of the default Playwright page methods.

We'll execute a custom JavaScript code to simulate a scroll action to load all the review page data and then parse them:

import scrapy
from scrapy_playwright.page import PageMethod


class ReviewsSpider(scrapy.Spider):
    name = "reviews"
    allowed_domains = ["web-scraping.dev"]

    def start_requests(self):
        url = "https://web-scraping.dev/testimonials"
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # execute the scroll script
                    PageMethod("evaluate", "for (let i = 0; i &lt; 8; i++) setTimeout(() =&gt; window.scrollTo(0, document.body.scrollHeight), i * 2000);"),
                    # wait for 30 seconds
                    PageMethod("wait_for_timeout", 15000)
                ],
            }
        )

    async def parse(self, response):
        reviews = response.css("div.testimonial")
        for review in reviews:
            yield {
                "rate": len(review.css("span.rating &gt; svg").getall()),
                "text": review.css("p.text::text").get()
            }

The above code is almost the same as the previous spider one. We only execute a JavaScript code for scrolling and wait for 15 seconds for the script to finish.

If we run the above spider and look at the result file, we'll find all the review data scraped:

[
    ....
    {"rate": 5, "text": "I've tried many similar apps, but this one stands out with its exceptional performance and features."},
    {"rate": 2, "text": "The app's user interface is outdated and not intuitive. It needs a modern redesign."},
    {"rate": 5, "text": "I'm extremely satisfied with this app. It has exceeded my expectations in every way."},
    {"rate": 5, "text": "The app's documentation is comprehensive and easy to follow, making it easy to get started."},
    {"rate": 5, "text": "The app's performance has been flawless. I haven't experienced any issues or slowdowns."}    
]

We can successfully handle infinite scrolling with Scrapy Playwright. However, the script waits for a fixed timeout, which isn't advised. Let's wait for a specific element instead!

Timeouts and Waiting For Elements

Playwright provides support for different waiting types:

An event.
A function to finish.
A load state, either as domcontentloaded or networkidle.
A URL, in case of navigation.
A specific element to be present.
Fixed timeouts.

Relying on dynamic timeouts is more efficient in terms of performance, as it reduces the unnecessary delays between the script actions.

The load state and fixed timeouts are usually used to wait for the natural page loading without explicit actions from the scraper side:

    def start_requests(self):
        url = "https://web-scraping.dev/testimonials"
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # fixed timeout wait
                    PageMethod("wait_for_timeout", 5000),
                    # # wait for the document to load
                    PageMethod("wait_for_load_state", "domcontentloaded"),
                    # # wait for the network to be idle
                    PageMethod("wait_for_load_state", "networkidle"),
                ],
            }
        )

In the context of our reviews scraper, we'll wait for the latest review on the page to load after the scroll:

class ReviewsSpider(scrapy.Spider):
    name = "reviews"
    allowed_domains = ["web-scraping.dev"]

    def start_requests(self):
        url = "https://web-scraping.dev/testimonials"
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # fixed timeout wait
                    PageMethod("evaluate", "for (let i = 0; i &lt; 10; i++) setTimeout(() =&gt; window.scrollTo(0, document.body.scrollHeight), i * 2000);"),
                    # wait for latest element to load
                    PageMethod("wait_for_selector", "div.testimonial:nth-child(60)"),
                ],
            }
        )

    async def parse(self, response):
        reviews = response.css("div.testimonial")
        for review in reviews:
            yield {
                "rate": len(review.css("span.rating &gt; svg").getall()),
                "text": review.css("p.text::text").get()
            }

Here, we use the same JavaScript code to scroll and add an additional PageMethod to wait for the latest review element to appear on the HTML.

Taking Screenshots

To capture a screenshot with Scrapy Playwright, we can utilize the screenshot page method:

class ReviewsSpider(scrapy.Spider):
    name = "reviews"
    allowed_domains = ["web-scraping.dev"]

    def start_requests(self):
        url = "https://web-scraping.dev/testimonials"
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    PageMethod(
                        "screenshot",
                        path="screenshot.png",
                        full_page=True # whether to capture the whole page
                    ),
                ],
            }
        )

Here, we use the screenshot method in the PageMethod parameter to save it to the project directory. However, screenshots are usually captured after some browser actions. Luckily, we can capture the screenshot from the callback function instead:

class ReviewsSpider(scrapy.Spider):
    name = "reviews"
    allowed_domains = ["web-scraping.dev"]

    def start_requests(self):
        url = "https://web-scraping.dev/testimonials"
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_include_page": True
            }
        )

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.screenshot(path="screenshot.png", full_page=True)

Here, we pass the Playwright page instance using the playwright_include_page. Then, we access it from the response metadata and use it to take a screenshot directly.

Clicking Buttons And Filling Forms

Interacting the DOM elements on a page is commonly used while web scraping. In this Scrapy Playwright tutorial, we'll explain clicking buttons and filling forms by attempting to log in to the web-scraping.dev/login example.

We'll create a Scrapy Playwright spider to request the page URL, accept the cookies policy, fill in the login credentials, and then click the login button:

# spiders/login.py
# scrapy crawl login
import scrapy
from scrapy_playwright.page import PageMethod


class LoginSpider(scrapy.Spider):
    name = "login"
    allowed_domains = ["web-scraping.dev"]

    def start_requests(self):
        url = "https://web-scraping.dev/login?cookies="
        yield scrapy.Request(
            url=url,
            meta={
                "playwright": True,
                "playwright_page_methods": [
                    # wait the page to fully load
                    PageMethod("wait_for_load_state", "networkidle"),
                    # accept the cookie policy
                    PageMethod("click", "button#cookie-ok"),
                    # fill in the login creadentials
                    PageMethod("fill", "input[name='username']", "user123"),
                    PageMethod("fill", "input[name='password']", "password"),
                    # click submit button
                    PageMethod("click", "button[type='submit']"),
                    # wait for an element on the reidrect page
                    PageMethod("wait_for_selector", "div#secret-message"),
                ]
            }
        )

    def parse(self, response):
        print(f"The secret message is {response.css('div#secret-message::text').get()}")
        "The secret message is 🤫"

Before you run the above spider, make sure to disable the default Scrapy headers by adding the following line to the settings.py file: PLAYWRIGHT_PROCESS_REQUEST_HEADERS = None

In the above Scrapy Playwright scraper, we use the click and fill to complete the login process while utilizing timeouts between the steps to ensure a successful execution.

ScrapFly: Scrapy Playwright Alternative

ScrapFly is a web scraping API that supports scraping dynamic web pages using a JavaScript rendering feature. It also provides built-in JavaScript scenarios for controlling the headless browsers for common scraping use cases, such as waiting for elements, scrolling, filling and clicking elements.

Moreover, ScrapFly allows for scraping at scale by providing:

Anti-scraping protection bypass: For scraping any website without getting blocked.
Residential proxiess in over 50 countries: For avoiding IP address blocking and throttling while also allowing for scraping from almost any geographical location.
Scrapy Integration, as well as Python and Typescript SDKs.
And much more!

scrapfly middleware — ScrapFly service does the heavy lifting for you!

ScrapFly is available as a Scrapy integration. Simply add the following lines to the settings.py file in the Scrapy project to authorize the API calls and set the concurrency limit:

SCRAPFLY_API_KEY = "Your ScrapFly API key"
CONCURRENT_REQUESTS = 2  # Adjust according to your plan limit rate and your needs

Let's replicate the latest Scrapy spider with the ScrapFly API. All we have to do is enable the asp parameter to avoid scraping blocking and control the headless through the JavaScript scenarios.

ScrapFly X Scrapy:

from scrapfly import ScrapeConfig
from scrapfly.scrapy import ScrapflyScrapyRequest, ScrapflySpider, ScrapflyScrapyResponse


class LoginSpider(ScrapflySpider):
    name = 'login'
    allowed_domains = ['web-scraping.dev']


    def start_requests(self):
        yield ScrapflyScrapyRequest(
            scrape_config=ScrapeConfig(
                # target website URL
                url="https://web-scraping.dev/login?cookies=",
                # bypass anti scraping protection
                asp=True,        
                # set the proxy location to a specific country
                country="US",
                # enable JavaScript rendering
                render_js=True,
                # scroll down the page automatically
                auto_scroll=True,
                # add JavaScript scenarios
                js_scenario=[
                    {"click": {"selector": "button#cookie-ok"}},
                    {"fill": {"selector": "input[name='username']","clear": True,"value": "user123"}},
                    {"fill": {"selector": "input[name='password']","clear": True,"value": "password"}},
                    {"click": {"selector": "form > button[type='submit']"}},
                    {"wait_for_navigation": {"timeout": 5000}}
                ],
                # take a screenshot
                screenshots={"logged_in_screen": "fullpage"}
            ),
            callback=self.parse
        )


    def parse(self, response: ScrapflyScrapyResponse):
        print(f"The secret message is {response.css('div#secret-message::text').get()}")
        "The secret message is 🤫"

ScrapFly SDK:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://web-scraping.dev/login?cookies=",
        # bypass anti scraping protection
        asp=True,        
        # set the proxy location to a specific country
        country="US",
        # # enable the cookies policy
        # headers={"cookie": "cookiesAccepted=true"},
        # enable JavaScript rendering
        render_js=True,
        # scroll down the page automatically
        auto_scroll=True,
        # add JavaScript scenarios
        js_scenario=[
            {"click": {"selector": "button#cookie-ok"}},
            {"fill": {"selector": "input[name='username']","clear": True,"value": "user123"}},
            {"fill": {"selector": "input[name='password']","clear": True,"value": "password"}},
            {"click": {"selector": "form > button[type='submit']"}},
            {"wait_for_navigation": {"timeout": 5000}}
        ],
        # take a screenshot
        screenshots={"logged_in_screen": "fullpage"},
        debug=True
    )
)

# get the HTML from the response
html = api_response.scrape_result['content']

# use the built-in Parsel selector
selector = api_response.selector
print(f"The secret message is {selector.css('div#secret-message::text').get()}")
"The secret message is 🤫"

FAQ

To wrap up this guide on web scraping with Scrapy Playwright, let's have a look at some frequently asked questions.

How to solve the error "NotImplementedError in ('twisted.internet.asyncioreactor.AsyncioSelectorReactor')"?

This is a common error that occurs while running scrapy-playwright in Windows. It happens due to the lack of support for the SelectorEventLoop in Windows. The alternative for using Scrapy Playwright in Windows is running it on WSL. For further details, refer to the official scrapy-playwright known issues.

Can I scrape dynamic web pages with Scrapy?

Yes. Scrapy Playwright is a middleware integration that enables scraping dynamic pages with Scrapy by processing the requests using a Playwright instance.

Are there alternatives for Scrapy Playwright?

Yes, there are other integrations that allow Scrapy to scrape dynamic web pages, such as Scrapy Selenium and Scrapy Splash.

Summary

In this guide, we explored the scrapy-playwright integration, which allows scraping dynamic web pages with Scrapy using Playwright headless browsers.

We went through a step-by-step guide on installing Scrapy Playwright and using it through an example project. We have also explained implementing common web scraping with Scrapy Playwright use cases, such as:

Handling infinite scrolling while scraping.
Executing custom JavaScript code.
Applying timeout waits.
Taking screenshots.
Clicking buttons and filling out forms.

DEV Community