chandan kumar

Posted on May 13, 2023

Bose Web Scraping Tutorial

#webscraping #python #tutorial #webscrapingtools

In This tutorial you will learn about the Bose framework, a framework which provides an easier and structured way of using Selenium for web scraping. Think of it Swiss Army knife for Web Scraping

When using Selenium to scrape websites, there is usually a lot of boilerplate work involved such as:

Downloading the appropriate Chrome driver for Selenium
Creating the driver by specifying the driver's path, which can be challenging on Windows
Specifying the correct ChromeOptions to make Selenium undetectable by bot protection sites
passing profiles and user agents to Selenium
Debugging in case of errors.

The Bose framework solves all these problems and puts an end to these nuisances for all developers. It is Django for web scrapers.

We will use the Bose framework to scrape quotes.toscrape.com, a website that lists quotes from famous authors.

Along the way, I will also share some great features of the Bose framework. So let's get started.

Getting Started

First, let's download the Bose starter project by cloning the starter template:

git clone https://github.com/omkarcloud/bose-starter my-bose-project

Then, change into my-bose-project directory, install dependencies, and start the project:

cd my-bose-project
python -m pip install -r requirements.txt
python main.py

Whenever we use Selenium, we need to download the correct Selenium version for Chrome.

In past you must have visited chrome website to download the correct driver but thankfully with bose when you run the project for the first time, Bose automatically downloads the correct driver in the build/ directory.

Scraping quotes.toscrape.com

We are going to scrape quotes.toscrape.com for quotes and their authors.

Write following code in src/scraper.py:

from selenium.webdriver.common.by import By
from bose import BaseTask, Wait, Output

class Task(BaseTask):

    def run(self, driver):

        driver.get("https://quotes.toscrape.com/")
        els = driver.get_elements_or_none_by_selector('div.quote', Wait.SHORT)

        items = []
        for el in els:
            text = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "span.text"))
            author = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "small.author"))
            item = {
                "text" : text,
                "author" : author,
            }
            items.append(item)

        Output.write_finished(items)

This code defines a Python class Task that inherits from a BaseTask class. All Scraping Tasks must inherit from BaseTask to perform scraping.

Now in the run method of Task, we recieve driver object as a parameter, which is an instance of the BossDriver.

BossDriver extends Selenium WebDriver to add much powerful utlility methods for scraping. For example here get_elements_or_none_by_selector finds all elements with selector div.quote and wait upto 4 seconds (Wait.SHORT) to find them. The same thing in selenium would have been quite verbose.

Then, the method initializes an empty list items and iterates over the els list, which contains the div elements with the class name quote. For each element, it extracts the text of the quote and the author name using the find_element and get_element_text methods.

Finally, it appends a dictionary item to the items list containing the quote text and author name.

Then we use the Output Object from bose which aims to simplify reading, writing of JSON, csv files easy in selenium. Now using Output Object we write in it output/finished.json using the Output.write_finished method.

Now, To run the project, execute the command:

python main.py

You see it start scraping quotes.toscrape.com and put them in output/finished.json

Furthermore, the /tasks/1/ directory will also be generated.

Web scraping can often be fraught with errors, such as incorrect selectors or pages that fail to load. When debugging with raw Selenium, you may have to sift through logs to identify the issue. Fortunately, Bose makes it simple for you to debug by storing information about each run.

If you observe the tasks/1/ directory, you will notice that it contains three files, which are listed below:

`task_info.json`

It contains information about the task run such as duration for which the task run, the ip details of task, the user agent, window_size and profile which used to execute the task.

`final.png`

This is the screenshot captured before driver was closed.

`page.html`

This is the html source captured before driver was closed. Very useful to know in case your selectors failed to select elements.

`error.log`

In case your task crashed due to exception we also store error.log which contains the error due to which the task crashed. This is very helful in debugging.

Exception handling

In bose, when an exception occurs in a scraping task, the browser will remain open instead of immediately closing. This is useful for debugging purposes, as it allows you to see the live browser state when the exception occurred.

For example, if we replace the code in scraper.py with the following code which selects a non exisiting selector causing an exception to be raised and run it:

from selenium.webdriver.common.by import By

from bose import BaseTask, Wait, Output

class Task(BaseTask):

    def run(self, driver):
        driver.get("https://quotes.toscrape.com/")
        els = driver.get_elements_or_none_by_selector('div.some-not-exisiting-selector', Wait.SHORT)

        items = []

        for el in els:
            text = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "span.text"))
            author = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "small.author"))
            item = {
                "text" : text,
                "author" : author,
            }
            items.append(item)

        Output.write_finished(items)

You will notice that the browser does not close, and Bose prompts you to press enter to close the browser. This feature is very handy when you are trying to debug your code in the browser when the exception occurred.

Browser Configuration

Bose makes it easy to configure the Selenium driver with different options, such as

which profile to use
which user agent to use
which window size to use

You can easily configure these options using the BrowserConfig class. For example, here's how you can configure the driver to use Chrome 106 user agent, window size 1280x720, and profile 1:

from selenium.webdriver.common.by import By
from bose import BaseTask, BrowserConfig, UserAgent,WindowSize

class Task(BaseTask):
    browser_config = BrowserConfig(user_agent=UserAgent.user_agent_106, window_size=WindowSize.window_size_1280_720, profile=1)

    def run(self, driver):
        driver.get("https://quotes.toscrape.com/")

In this example, we set the BrowserConfig using the browser_config property of the Task class.

Using Undetected Driver

Bose also supports the undetected_driver library, which provides a robust driver to help evade detection by anti-bot services like Cloudflare. Although it is slower to start, it is much less detectable. To use it, pass the use_undetected_driver option to BrowserConfig, like so:

from selenium.webdriver.common.by import By
from bose import BaseTask, BrowserConfig, UserAgent,WindowSize

class Task(BaseTask):
    browser_config = BrowserConfig(use_undetected_driver=True, user_agent=UserAgent.user_agent_106, window_size=WindowSize.window_size_1280_720, profile=1,)

Outputting Data in Bose

Bose provides great support to easily output data as CSV, Excel, or JSON using the Output class. To use it, call the write method for the type of file you want to save.

All data will be saved in the output/ folder:

from bose import Output

data = [
    {
        "text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d",
        "author": "Albert Einstein"
    },
    {
        "text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d",
        "author": "J.K. Rowling"
    }
]

Output.write_json(data, "data.json")
Output.write_csv(data, "data.csv")
Output.write_xlsx(data, "data.xlsx")

Using LocalStorage

Just like how modern browsers have a local storage module, Bose has also incorporated the same concept in its framework.

You can import the LocalStorage object from Bose to persist data across browser runs, which is extremely useful when scraping large amounts of data.

The data is stored in a file named local_storage.json in the root directory of your project. Here's how you can use it:

from bose import LocalStorage

LocalStorage.set_item("pages", 5)
print(LocalStorage.get_item("pages"))

Conclusion

In summary, Bose is an excellent framework that simplifies the boring parts of Selenium and web scraping. We encourage you to read the reference of BossDriver at Boss Driver, which is an extended version of Selenium that adds some great methods to help you in scraping.

Top comments (6)

Nwosa Tochukwu • May 20 '23

What exactly is web scrapping? And someone once told me you can use React for web scrapping but I'm yet to fully understand what it does mean

chandan kumar • May 25 '23

Web scraping is the process of using bots to extract content and data from a website like getting posts from dev.to, questions from stackoverflow, google places from google maps. Also the person who told you to use react is incorrect. We use python's selenium, bose framework, puppetter.js, scrapy etc. to scrape data.