Machine Learning requires a lot of data and not always it is easy to get the data you want. Have you ever wondered how Kaggle and other such websites provide us with huge datasets? The answer is web scraping. So, let us see how we can extract data from the web.
Let’s assume we are building a model which requires movie information such as title, summary, and rating of a number of movies. When it comes to movies, we know IMDB has the largest database. Let us dig into it.
There’s a pattern in everything. We need to observe and find a pattern in the HTML code of the web page to extract relevant data. Let’s go step by step. We will be doing everything using python and scrape the data from the following URL :
1. Install dependencies
# To download the webpage pip install requests # To scrape data from the downloaded webpage pip install beautifulsoup4
2. Download the webpage
“Requests” is a great HTTP library to make request calls. We will use it to download the webpage of the given URL.
import requests url = "https://www.imdb.com/search/title?release_date=2019&sort=user_rating,desc&ref_=adv_nxt" # get() method downloads the entire HTML of the provided url response = requests.get(url) # Get the text from the response object response_text = response.text
3. Inspecting elements and finding the pattern
Now the data we have downloaded is exactly the same you see when you right-click and do inspect element in the browser. Let’s right-click on the rating and see how we can extract it.
When we look closely we will see the class “ratings-bar” contains the rating of the movie. If we inspect other movies, we will find all the movies have the same class name for the ratings on that page. Here, we found a pattern to extract all the ratings from the page. Similarly, we can extract summary, title, genre, etc.
Not only using class but you can select a specific part of the HTML code using id, tags, etc as well.
Let’s jump into the code!
BeautifulSoup allows us to extract data(more precisely parse data) from HTML using the class name, id, tags, etc. Isn’t it Beautiful? :-D
from bs4 import BeautifulSoup # Create a BeautifulSoup object # response_text -> The downloaded webpage # lxml -> Used for processing HTML and XML pages soup = BeautifulSoup(response_text,'lxml')
To select the content from the page we use CSS Selectors. CSS Selectors allows us to select different classes, ids, tags, and other html elements easily. CSS Selector for Class is "." and for ID is "#". To select a class we need to prefix a "." to the class name we want to extract and similarly, for ID we need to prefix "#".
# As we saw the rating's class name was "ratings-bar" # we prefix "." since its a class rating_class_selector = ".ratings-bar" # Extract the all the ratings class rating_list = soup.select(rating_class_selector)
This “rating_list” is the list of object containing all the
<div> elements containing “ratings-bar” as class name. We need to get the text from within the div element.
Here’s how a single rating object looks like:
<div class="ratings-bar"> <div class="inline-block ratings-imdb-rating" data-value="10" name="ir"> <span class="global-sprite rating-star imdb-rating"></span> <strong>10.0</strong> </div> ... </div>
We need to get the rating value from the
<strong> tag. We can extract the tags using find(‘tagName’) method and get the text using getText().
# This List will store all the ratings ratings =  # Iterate through all the ratings object for rating_object in rating_list: # Find the <strong> tag and get the Text rating_text = rating_object.find('strong').getText() # Append the rating to the list ratings.append(rating_text) print(ratings)
And we are done. Similarly, you can extract Titles, Summary, Genre using the above method with the appropriate class name and tag names.
You can store the data to CSV or excel file and use it for your Machine Learning model.
Full Code present on my Github:
Follow me on Twitter: