In This tutorial you will learn about the Bose framework, a framework which provides an easier and structured way of using Selenium for web scraping. Think of it Swiss Army knife for Web Scraping
When using Selenium to scrape websites, there is usually a lot of boilerplate work involved such as:
- Downloading the appropriate Chrome driver for Selenium
- Creating the driver by specifying the driver's path, which can be challenging on Windows
- Specifying the correct ChromeOptions to make Selenium undetectable by bot protection sites
- passing profiles and user agents to Selenium
- Debugging in case of errors.
The Bose framework solves all these problems and puts an end to these nuisances for all developers. It is Django for web scrapers.
We will use the Bose framework to scrape quotes.toscrape.com, a website that lists quotes from famous authors.
Along the way, I will also share some great features of the Bose framework. So let's get started.
Getting Started
First, let's download the Bose starter project by cloning the starter template:
git clone https://github.com/omkarcloud/bose-starter my-bose-project
Then, change into my-bose-project
directory, install dependencies, and start the project:
cd my-bose-project
python -m pip install -r requirements.txt
python main.py
Whenever we use Selenium, we need to download the correct Selenium version for Chrome.
In past you must have visited chrome website to download the correct driver but thankfully with bose when you run the project for the first time, Bose automatically downloads the correct driver in the build/
directory.
Scraping quotes.toscrape.com
We are going to scrape quotes.toscrape.com for quotes and their authors.
Write following code in src/scraper.py:
from selenium.webdriver.common.by import By
from bose import BaseTask, Wait, Output
class Task(BaseTask):
def run(self, driver):
driver.get("https://quotes.toscrape.com/")
els = driver.get_elements_or_none_by_selector('div.quote', Wait.SHORT)
items = []
for el in els:
text = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "span.text"))
author = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "small.author"))
item = {
"text" : text,
"author" : author,
}
items.append(item)
Output.write_finished(items)
This code defines a Python class Task that inherits from a BaseTask class. All Scraping Tasks must inherit from BaseTask to perform scraping.
Now in the run method of Task, we recieve driver object as a parameter, which is an instance of the BossDriver.
BossDriver extends Selenium WebDriver to add much powerful utlility methods for scraping. For example here get_elements_or_none_by_selector
finds all elements with selector div.quote and wait upto 4 seconds (Wait.SHORT) to find them. The same thing in selenium would have been quite verbose.
Then, the method initializes an empty list items and iterates over the els list, which contains the div elements with the class name quote. For each element, it extracts the text of the quote and the author name using the find_element and get_element_text methods.
Finally, it appends a dictionary item to the items list containing the quote text and author name.
Then we use the Output Object from bose which aims to simplify reading, writing of JSON, csv files easy in selenium. Now using Output Object we write in it output/finished.json using the Output.write_finished method.
Now, To run the project, execute the command:
python main.py
You see it start scraping quotes.toscrape.com and put them in output/finished.json
Furthermore, the /tasks/1/ directory will also be generated.
Web scraping can often be fraught with errors, such as incorrect selectors or pages that fail to load. When debugging with raw Selenium, you may have to sift through logs to identify the issue. Fortunately, Bose makes it simple for you to debug by storing information about each run.
If you observe the tasks/1/ directory, you will notice that it contains three files, which are listed below:
task_info.json
It contains information about the task run such as duration for which the task run, the ip details of task, the user agent, window_size and profile which used to execute the task.
final.png
This is the screenshot captured before driver was closed.
page.html
This is the html source captured before driver was closed. Very useful to know in case your selectors failed to select elements.
error.log
In case your task crashed due to exception we also store error.log which contains the error due to which the task crashed. This is very helful in debugging.
Exception handling
In bose, when an exception occurs in a scraping task, the browser will remain open instead of immediately closing. This is useful for debugging purposes, as it allows you to see the live browser state when the exception occurred.
For example, if we replace the code in scraper.py with the following code which selects a non exisiting selector causing an exception to be raised and run it:
from selenium.webdriver.common.by import By
from bose import BaseTask, Wait, Output
class Task(BaseTask):
def run(self, driver):
driver.get("https://quotes.toscrape.com/")
els = driver.get_elements_or_none_by_selector('div.some-not-exisiting-selector', Wait.SHORT)
items = []
for el in els:
text = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "span.text"))
author = driver.get_element_text(el.find_element(By.CSS_SELECTOR, "small.author"))
item = {
"text" : text,
"author" : author,
}
items.append(item)
Output.write_finished(items)
You will notice that the browser does not close, and Bose prompts you to press enter to close the browser. This feature is very handy when you are trying to debug your code in the browser when the exception occurred.
Browser Configuration
Bose makes it easy to configure the Selenium driver with different options, such as
- which profile to use
- which user agent to use
- which window size to use
You can easily configure these options using the BrowserConfig class. For example, here's how you can configure the driver to use Chrome 106 user agent, window size 1280x720, and profile 1:
from selenium.webdriver.common.by import By
from bose import BaseTask, BrowserConfig, UserAgent,WindowSize
class Task(BaseTask):
browser_config = BrowserConfig(user_agent=UserAgent.user_agent_106, window_size=WindowSize.window_size_1280_720, profile=1)
def run(self, driver):
driver.get("https://quotes.toscrape.com/")
In this example, we set the BrowserConfig
using the browser_config
property of the Task
class.
Using Undetected Driver
Bose also supports the undetected_driver
library, which provides a robust driver to help evade detection by anti-bot services like Cloudflare. Although it is slower to start, it is much less detectable. To use it, pass the use_undetected_driver
option to BrowserConfig
, like so:
from selenium.webdriver.common.by import By
from bose import BaseTask, BrowserConfig, UserAgent,WindowSize
class Task(BaseTask):
browser_config = BrowserConfig(use_undetected_driver=True, user_agent=UserAgent.user_agent_106, window_size=WindowSize.window_size_1280_720, profile=1,)
Outputting Data in Bose
Bose provides great support to easily output data as CSV, Excel, or JSON using the Output
class. To use it, call the write
method for the type of file you want to save.
All data will be saved in the output/
folder:
from bose import Output
data = [
{
"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d",
"author": "Albert Einstein"
},
{
"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d",
"author": "J.K. Rowling"
}
]
Output.write_json(data, "data.json")
Output.write_csv(data, "data.csv")
Output.write_xlsx(data, "data.xlsx")
Using LocalStorage
Just like how modern browsers have a local storage module, Bose has also incorporated the same concept in its framework.
You can import the LocalStorage object from Bose to persist data across browser runs, which is extremely useful when scraping large amounts of data.
The data is stored in a file named local_storage.json
in the root directory of your project. Here's how you can use it:
from bose import LocalStorage
LocalStorage.set_item("pages", 5)
print(LocalStorage.get_item("pages"))
Conclusion
In summary, Bose is an excellent framework that simplifies the boring parts of Selenium and web scraping. We encourage you to read the reference of BossDriver at Boss Driver, which is an extended version of Selenium that adds some great methods to help you in scraping.
Top comments (6)
What exactly is web scrapping? And someone once told me you can use React for web scrapping but I'm yet to fully understand what it does mean
Web scraping is the process of using bots to extract content and data from a website like getting posts from dev.to, questions from stackoverflow, google places from google maps. Also the person who told you to use react is incorrect. We use python's selenium, bose framework, puppetter.js, scrapy etc. to scrape data.
Thanks for your response... are you into it? how long will it take someone to learn and earn a few bucks... I need to make some bucks
Let me know how was the tutorial?
@kumarchandan1991 Can you share the name of your VSCode theme, please?
@imzivko You may download marketplace.visualstudio.com/items... extension and use "Community Material Ocean" Theme