Scrapfly for Scrapfly

Posted on May 20 • Originally published at scrapfly.io on May 17

Web Scraping With Cloud Browsers

#headlessbrowsers #scaling

Most websites nowadays are single-page applications (SPAs) that utilize JavaScript to load data which can make web scraping difficult. Such dynamic rendering requires using headless browsers, which are known to be resource-intensive and challenging to manage.

In this article, we'll explore cloud headless browsers. These services eliminate the effort and time needed to manage locally hosted headless browser solutions. We'll take a look at what cloud browsers are and then go through a practical example of web scraping with self hosted cloud browsers using Selenium Grid. Let's dive in!

What Are Cloud Browsers?

Web browsers can be automated using popular tools such as Selenium, Playwright, and Puppeteer for web scraping and web testing purposes.

However, this tool are difficult to scale and managing complex software like a web browser in each scraping process can be very challenging, slow and error prone.

As a response to that a collection of web browsers can be managed and deployed on cloud environments and serve each web scraper as a service. These are called cloud browsers.

The primary goal of cloud browsers is to simplify and scale up web browser automation tasks like testing and scraping, and this can be approached in one of two ways:

Running multiple browser instances on Docker or Kubernetes.
Using existing Web Driver proxy servers like Selenium Grid.

In this guide, we'll at latter with a practical cloud-based web scraping example using Selenium Grid. Though first, let's see what are the benefits of using cloud browsers.

Why Scrape With Cloud Browsers?

Cloud browsers offer several advantages over traditional headless browsers, such as Selenium, Puppeteer, and Playwright. Here are some of the key benefits of using cloud browsers for web scraping:

Automatic scaling

Scraping dynamic websites at scale requires running multiple headless browsers in parallel. Such a concurrent execution is complex and difficult to manage locally. Cloud browsers can be scaled up and down automatically based on the demand.

Resource effective

Headless browsers are resource-intensive, and running them locally for each scraping process at scale can be really expensive. As scraping is IO intensive task often the headless browser processes are waiting there doing nothing and consuming resources. Aggregating these processes in a cloud environment can save a lot of resources.

Bypass scraping blocking

Websites use anti-bot mechanisms to detect headless browsers used for automated purposes, such as a web scrapers. Once the headless browser is detected, the website can block the request or ask for a CAPTCHA challenge.

Using a cloud browser pool we can improve our bypass chances by constantly caring for the health of the browsers and rotating them. We can warm up our browsers with normal browsing patterns and distribute the load more evenly through multiple browser configurations - drastically decreasing blocking rates.

Cloud Browsers With Selenium Grid

Selenium Grid is a server that enables executing multiple Selenium WebDriver instances on a remote machine, typically through cloud providers. It operates over two main components:

Hub A remote server accepts incoming requests with the WebDriver details in JSON, which routes them to the nodes for execution.
Node A virtual device with a specified operating system, browser name, and version. It executes web browser instances based on the JSON instructions provided by the hub.

Selenium Grid components

The Selenium Grid capabilities can be summarized into two points:

Executing the desired number of headless browsers in parallel.
Specifying the browser name, version, and operating system.

In the following sections, we'll go over a step-by-step tutorial on cloud browser scraping with Selenium Grid. First, have a look at the installation process.

Setup

The easiest way to install Selenium Grid is using Docker us to us the official Docker installation guide.

Using Docker Compose is a very easy to way to approach this - create a docker-compose.yml file and add the following code:

version: '3.8'

services:
  hub:
    image: selenium/hub:4.13.0
    ports:
      - 4442:4442
      - 4443:4443
      - 4444:4444
    environment:
      GRID_MAX_SESSION: 8     

  chrome_node_1:
    image: selenium/node-chrome:4.13.0
    depends_on:
      - hub
    environment:
      SE_EVENT_BUS_HOST: hub
      SE_EVENT_BUS_PUBLISH_PORT: 4442
      SE_EVENT_BUS_SUBSCRIBE_PORT: 4443
      SE_NODE_STEREOTYPE: "{\"browserName\":\"chrome\",\"browserVersion\":\"117\",\"platformName\": \"Windows 10\"}"

The above code represents the two main Selenium Grid components:

Hub: represented by the hub service based on the official Selenium hub image.
Node: represented by the chrome_node_1 service. It pulls the Chrome node image, which can be replaced with other browser types, such as Firefox, Edge, or Opera.

The remaining configuration represents the required port mapping and browser details defined by the SE_NODE_STEREOTYPE variable.

Next, spin the docker-compose file using the following command:

docker-compose up --build

To verify your installation, go to the Selenium Grid dashboard URL at http://localhost:4444. You should be able to access the followig page:

Selenium Grid dashboard

We can see the declared Chrome instance ready to accept connections. There are a few variables declared, let's break them down:

Queue size : the number of connection requests waiting in the queue for execution.
Sessions : the number of connecting WebDrivers to the node.
Max concurrency : the number of max headless browsers that can be executed on each node, defined in the docker-compose file.

The Selenium Grid server is ready to accept connections to execute cloud browsers. For this, we'll install Python Selenium for communication:

pip install selenium

Cloud Scraping With Selenium Grid

Let's go through a practical cloud web scraping example. For this, we'll use the remote Selenium Grid server to spin a headless browser to automate the login page on web-scraping.dev/login:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def get_driver():
    """return a web driver instance from selenium grid node"""
    options = webdriver.ChromeOptions()
    # disable sharing memory across the instances
    options.add_argument('--disable-dev-shm-usage')
    # unitialize a remote WebDriver
    driver = webdriver.Remote(
        command_executor="http://127.0.0.1:4444/wd/hub",
        options=options
    )
    return driver

def scrape_form():
    """automate the login form using the remote web driver"""
    driver = get_driver()
    driver.get("https://web-scraping.dev/login?cookies=")

    # define a timeout
    wait = WebDriverWait(driver, timeout=5)

    # accept the cookie policy
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button#cookie-ok")))
    driver.find_element(By.CSS_SELECTOR, "button#cookie-ok").click()

    # wait for the login form
    wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "button[type='submit']")))

    # fill in the login credentails
    username_button = driver.find_element(By.CSS_SELECTOR, "input[name='username']")
    username_button.clear()
    username_button.send_keys("user123")

    password_button = driver.find_element(By.CSS_SELECTOR, "input[name='password']")
    password_button.clear()
    password_button.send_keys("password")

    # click the login submit button
    driver.find_element(By.CSS_SELECTOR, "button[type='submit']").click()

    # wait for an element on the login redirected page
    wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div#secret-message")))

    secret_message = driver.find_element(By.CSS_SELECTOR, "div#secret-message").text
    print(f"The secret message is: {secret_message}")
    "The secret message is: 🤫"

    # close the browser
    driver.quit()

scrape_form()

The above code consists of two main functions. Let's break them down:

get_driver: To initiate a remote Web Driver instance on a Selenium Grid node.
scrape_form: To automate the remote cloud browser using the Selenium API functions and parse the scraped data.

The above code represents the core concept behind cloud browser scraping. The main difference is that we are running the server for cloud browsers locally. However, it should be deployed on a remote machine when extracting data in production.

Refer to our dedicated guide on Selenium Grid for further details on using it for concurrent web scraping.

How to Bypass Scraper Blocking with Cloud Browsers?

There are a few key differences between headless browsers and regular ones. Anti-bot services detect headless browsers by spotting these differences. For example, the navigator.webdriver value is set to true with headless browsers only:

There are available open-source tools that obfuscate these differences for target website to bypass web scraping blocking

To start Undetected ChromeDriver can be used to patch Selenium web driver with a many fixes that are related to browsers HTTP, TLS and Javascript fingerprint resilience.

Further the Puppeteer-stealth plugin contains a lot of patches that remove headless browser indicators like the mentioned navigator.webdriver variable. While this doesn't directly incorporate with Selenium Grid we've covered in this tutorial it is trivial to replicate each patch in Selenium environment.

If that seems like too much work why not give Scrapfly's cloud browsers a chance instead?

Cloud Browsers With ScrapFly

ScrapFly is a web scraping API with JavaScript rendering capabilities using cloud browsers.

Scrapfly's browsers can be automated using automation scenarios that allow full control of the browser or directly controlled through javascript execution.

Each Scrapfly browser is managed with unique real fingerprints bypassing any anti-bot detection and efficiently scraping any target.

Furthermore, ScrapFly enables data extraction at scale by providing:

Anti-scraping protection bypass - For bypassing anti-scraping protection mechanisms, such as Cloudflare.
Millions of residential proxy IPs in +50 countries - For preventing IP address blocking and throttling while also allowing for scraping from almost any geographical location.
Easy to use Python and Typescript SDKs, as well as Scrapy integration.
And much more!

ScrapFly service does the heavy lifting for you

Here's how to scrape with cloud browsers using ScrapFly. All we have to do is enable the render_js and asp parameters, select a proxy country, and declare the browser automation steps through the js_scenario parameter:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://web-scraping.dev/login",
        # bypass anti scraping protection
        asp=True,
        # set the proxy location to a specific country
        country="US",
        # accept the cookie policy through headers
        headers={"cookie": "cookiesAccepted=true"},
        # enable JavaScript rendering (use a cloud browser)
        render_js=True,
        # scroll down the page automatically
        auto_scroll=True,
        # automate the browser
        js_scenario=[
            {"click": {"selector": "button#cookie-ok"}},
            {"fill": {"selector": "input[name='username']","clear": True,"value": "user123"}},
            {"fill": {"selector": "input[name='password']","clear": True,"value": "password"}},
            {"click": {"selector": "form > button[type='submit']"}},
            {"wait_for_navigation": {"timeout": 5000}}
        ],
        # take a screenshot
        screenshots={"logged_in_screen": "fullpage"},
        debug=True
    )
)

# get the HTML from the response
html = api_response.scrape_result['content']

# use the built-in Parsel selector
selector = api_response.selector
print(f"The secret message is {selector.css('div#secret-message::text').get()}")
"The secret message is 🤫"

Try for FREE!

FAQ

To wrap up this guide on web scraping with cloud browsers, let's have a look at some frequently asked questions.

What is the difference between headless and cloud browsers?

Headless browsers, such as Selenium, Playwright, and Puppeteer, are a common acronym for browser automaton libraries. On the other hand, cloud browsers are scalable Web Driver instances deployed on remote machines.

How to scale headless browsers?

To scale headless browsers multiple instances of Selenium or Playwright can be used through docker and round-robin or dedicated scaling tools like Selenium Grid or Selenoid

What are managed cloud browsers?

Self hosting cloud browsers can be difficult and manage cloud browser services handle the infrastructure and scaling for you. Web scraping focused cloud browser services like ScrapFly also fortify the headless browser instances to bypass scraper blocking and increase browser rendering speeds through optimization patches making it a much easier solution for smaller teams.

Summary

In this article, we explored web scraping with cloud browsers. We started by defining them and the advantages of using them in the data extraction context:

Automatically scale to hundreds of browser instances.
Effectively save resources by utilizing cloud infrastructure.
Bypass web scraping blocking by hiding their automation traces.

Finally, we went through a step-by-step guide on creating a cloud browser server using Selenium Grid for web scraping.

DEV Community