For dynamic sites richly built with JavaScript, Selenium is the tool of choice for extracting data from them. Come in now and read this article to learn how to extract data from web pages using Selenium.
The easiest websites to scrape data from are static pages that all content is downloaded upon request. Sadly, these types of sites are gradually fading out, and dynamic websites are gradually taking over.
With dynamic sites, all content on a page is not provided upon loading a page – the content is dynamically added after specific JavaScript events, which poses a different problem to scraping tools designed for static websites. Fortunately enough, with tools like Selenium, you are able to trigger JavaScript events and scrape any page you want, no matter how JavaScript-rich a page is.
With Selenium, you are not tied to a single language like other tools. Selenium has support for Python, Ruby, Java, C#, and JavaScript. In this article, we will be making use of Selenium and Python to extract web data. Before we go into that in detail, it is wise if we look at Selenium and instances when you should make use of it.
Selenium WebDriver – an Overview
Selenium was not initially developed for web scraping – it was initially developed for testing web applications but has found its usage in web scraping. In technical terms, Selenium or, more appropriately, Selenium WebDriver is a portable framework for testing web applications.
In simple terms, all Selenium does is automate web browsers. And as the team behind Selenium rightfully put it, what you do with that power is up to you! Selenium has support for Windows, macOS, and Linux. In terms of browser support, you can use it for automating Chrome, Firefox, Internet Explorer, Edge, and Safari. Also important is the fact that Selenium can be extended using third-party plugins. With Selenium, you can automate the filling of forms, clicking buttons, taking a snapshot of a page, and other specific tasks online. One of these tasks is web extraction. While you can use it for web scraping, it is certainly not a Swiss Army Knife of web scraping; it has its own downside that will make you avoid using it for certain use cases.
The most notable of its downsides is its slow speed. If you have tried using Scrapy or the combo of Requests and Beautifulsoup, you will have a speed benchmark that will get you to rank Selenium slow. This is not unconnected to the fact that it makes use of a real browser, and rendering has to take place.
For this reason, developers only use Selenium when dealing with JavaScript-rich sites that you will find it difficult to call underlying APIs. With Selenium, all you do is automate the process, and all events will be triggered. For static sites that you can quickly replicate API requests, and all content is downloaded upon loading, you will want to use the better option, which is Scrapy or the duo of Requests and Beautifulsoup.
Installation Guide
Selenium is a third-party library, and as such, you will need to install it before you can make use of it. Before installing Selenium, make sure you already have Python installed. To install Python, you can visit the Python official download page. For Selenium to work, you will need to install the Selenium package and then the specific browser driver you want to automate. You can install the library using pip.
pip install Selenium
For browser drivers, they have support for Chrome, Firefox, and many others. Our focus in this article is on Chrome. If you don’t have Chrome installed on your computer, you can download it from the official Google Chrome page. With Chrome installed, you can then go ahead and download the Chrome driver binary here.
Make you download the driver for the version of Chrome you have installed. The file is a zip file with the actual driver inside of it. Extract the actual Chrome driver (chromedriver.exe) and place it in the same folder as any Selenium script you are writing.
Selenium Hello World
As it is in the coding tutorial tradition, we are starting this selenium guide with the classical hello world program. The code does not scrape any data at this point. All it does is attempt to log into an imaginary Twitter account. Let take a look at the code below.
import time from selenium import webdriver from selenium.webdriver.common.keysimport Keys username = "concanated" password = "djhhfhfhjdghsd" driver = webdriver.Chrome() driver.get("https://twitter.com/login") name_form = driver.find_element_by_name("session[username_or_email]") name_form.send_keys(username) pass_form = driver.find_element_by_name(("session[password]")) pass_form.send_keys(password) pass_form.send_keys((Keys.RETURN)) time.sleep(5) driver.quit()
the username and password variables’ values are dummies. When you run the above code, it will launch Chrome and then open the Twitter log in page. The username and password will be inputted and then sent.
Because the username and password are not correct, it displays an error message, and after 5 seconds, the browser is closed. As you can see from the above, you need to specify the specific web browser, and you can see we did that on line 7. The get method sends GET requests. After the page has loaded successfully, we use the
driver.find_element_by_name
method to find the username and input elements and then use
.send_keys
for filling the input fields with the appropriate data.
Sending Web Requests
Sending web requests using Selenium is one of the easiest tasks to do. Unlike in the case of other tools that differentials between POST and GET requests, in Selenium, they are sent the same way. All that’s required is for you to call the get method on the driver passing the URL as an argument. Let see how that is done in action below.
from selenium import webdriver driver = webdriver.Chrome() # visit Twitter homepage driver.get("https://twitter.com/") # page source print(driver.page_source) driver.quit()
Running the code above will launch Chrome in automation mode and visit the Twitter homepage and print the HTML source code of the page using the
driver.page_source
. You will see a notification below the address bar telling you Chrome is being controlled by an automated test software.
Chrome in Headless Mode
From the above, Chrome gets launched – this is the headful approach and used mainly for debugging. If you are ready to launch your script on a server or in a production environment, you wouldn’t want Chrome launched – you will want it to work in the background. This method of running the Chrome browser without it launching is known as the headless Chrome mode. Below is how to run Selenium Chrome in headless mode.
from selenium import webdriver from selenium.webdriver.chrome.optionsimport Options # Pay attention to the code below options = Options() options.headless = True driver = webdriver.Chrome(options=options) # visit Twitter homepage driver.get("https://twitter.com/") # page source print(driver.page_source) driver.quit()
Running the code above will not launch Chrome for you to see – all you see is the source code of the page visited. The only difference between this code and the one before it is that this one is running in the headless mode.
Accessing Elements on a Page
There are basically 3 things involved in web scraping – sending web requests, parsing page source, and then processing or saving the parsed data. The first two are usually the focus as they present more challenges.
You have already learned how to send web requests. Now let me show you how to access elements in other to parse out data from them or carry out a task with them. In the code above, we use the
page_source
method to access the page source. This is only useful when you want to parse using Beautifulsoup or other parsing libraries. If you want to use Selenium, you do not have to use the
page_source
method. [su_list icon="icon: hand-o-right" icon_color="#0E86D4"]Below are the options available to you.
-
driver.title
is for retrieving page title -
driver.current_url
for retrieving the URL of the page in view. -
driver.find_element_by_name
for retrieving an element by its name, e.g., password input with name password. -
driver.find_element_by_tag_name
for retrieving element by tag name such as a, div, span, body, h1, etc. -
driver.find_element_by_class_name
for retrieving element by class name -
driver.find_element_by_id
for finding element by id.
For each of the
find_element_by***
methods, there is a corresponding method that retrieves a list of elements instead of one except for
find_element_by_id
. Take, for instance, if you want to retrieve all elements with the “thin-long” class, you can make use of the
driver.find_elements_by_class_name(“thin-long”)
instead of
driver.find_element_by_class_name(“thin-long”)
. The difference is the plurality of the element keyword in the function.
Interacting with Elements on a Page
With the above, you can find specific elements on a page. However, you do not just do that for doing sake; you will need to interact with them either to trigger certain events or retrieve data from them. Let take a look at some of the interactions you can have with elements on a page using Selenium and Python.
-
element.text
will retrieve the text attached to an element -
element.click()
will trigger the click action and events that follow that -
element.send_keys(“test text”)
is meant for filling input forms -
element.is_displayed()
is for detecting if an element is visible to real users or not -this is perfect for honeypot detection. -
element.get_attributes(“class”)
for retrieving the value of an element’s attribute. You can change the “class” keyword for any other attribute.[/su_list]
With the above, you have what is required to start scraping data from web pages. I will be using the above to scrape the list of US states their capital, population (census), and estimated population from the Britannica website. Take a look at the code below.
from selenium import webdriver from selenium.webdriver.chrome.optionsimport Options # Pay attention to the code below options = Options() options.headless = True driver = webdriver.Chrome(options=options) driver.get("https://www.britannica.com/topic/list-of-state-capitals-in-the-United-States-2119210") list_states = [] trs = driver.find_element_by_tag_name("tbody").find_elements_by_tag_name("tr") for iin trs: tr = i.find_elements_by_tag_name("td") tr_data = [] for x in tr: tr_data.append(x.text) list_states.append(tr_data) print(list_states) driver.quit()
Looking at the above, we put into practice almost all of what we discussed above. Pay attention to the trs variable. If you look at the source code of the page, you will discover that the list of states and the associated information are contained in a table. The table does not have a class neither does its body.
Interestingly, it is the only table, and as such, we can use the find.element_by_tag_name(“tbody”) method to retrieve the tbody element. Each row in the tbody element represents a state and its information, each embedded in a td element. we called the find.elements_by_tag_name(“td”) to retrieve the td elements.
The first loop is for iterating through the tr elements. The second one is for iterating through the td elements for each of the tr elements. Element.text was used for retrieving text attached to an element.
You Have Learnt the Basics: Now What?
From the above, we have been able to show you how to scrape a page using Selenium and Python. However, you need to know that what you have learned is just the basics. There is more you need to learn. You will need to know how to carry out other moves and keyboard actions.
Sometimes, just filling out a form with a text string at once will reveal traffic is bot-originating. In instances like that, you will have to mimic typing by filling in each letter one after the other. With Selenium, you can even take a snapshot of a page, execute custom JavaScript, and carry out a lot of automation tasks. I will advise you to learn more about the Selenium web browser on the official Selenium website.
Conclusion
Selenium has its own setback in terms of slow speed. However, it has proven to be the best option when you need to scrape data from a rich JavaScript website.
One thing you will come to like about Selenium is that it makes the whole process of scraping easy as you do not have to deal with cookies and replicating hard-to-replicate web requests. Interestingly, it is easy to make use of.
Source, https://www.bestproxyreviews.com/selenium-web-scraping-python/
Top comments (0)