There are mainly two ways to extract data from a website:
- Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
- Access the HTML of the web-page and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction. This blog discusses the steps involved in web scraping using the implementation of a Web Scraping framework of Python called Beautiful Soup.
We are going to use Python as our scraping language, together with a simple and powerful library called BeautifulSoup.
- For Mac users, Python is pre-installed in OS X. Open up Terminal and type python --version. You should see your python version is 3.6(Shows for me).
- For Windows users, please install Python through the official website.
- Next we need to get the BeautifulSoup library using pip, a package management tool for Python.
In the command prompt, type:
pip install BeautifulSoup4
Note: If you fail to execute the above command line, try adding sudo in front of each line.
Before starting scrapping, we need to know about some rules and regulation which is there to scrap any sites. Just read below points before starting scrapping random sites as scrapping data from any site is not legal. So, please follow the below points:
- You should check a website’s Terms and Conditions before you scrape it. Be careful to read the statements about legal use of data. Usually, the data you scrape should not be used for commercial purposes.
- Do not request data from the website too aggressively with your program (also known as spamming), as this may break the website. Make sure your program behaves in a reasonable manner (i.e. acts like a human). One request for one webpage per second is good practice.
- The layout of a website may change from time to time, so make sure to revisit the site and rewrite your code as needed.
Send an HTTP request to the URL of the webpage you want to access. The server responds to the request by returning the HTML content of the webpage. For this task, we will use a third-party HTTP library for python-requests.Install requests using cmd:
pip install requests
Once we have accessed the HTML content, we are left with the task of parsing the data. Since most of the HTML data is nested, we cannot extract data simply through string processing. One needs a parser which can create a nested/tree structure of the HTML data. There are many HTML parser libraries available but the most advanced one is html5lib.Install html5lib using cmd:
pip install html5lib
Now, all we need to do is navigating and searching the parse tree that we created, i.e. tree traversal. For this task, we will be using another third-party python library, Beautiful Soup. It is a Python library for pulling data out of HTML and XML files. installation of bs4 already done.
We will be extract data in the form of table from the site worldometers. The code will be represent below with step by step also with description:
# importing modules import requests from bs4 import BeautifulSoup # URL for scrapping data url = 'https://www.worldometers.info/coronavirus/countries-where-coronavirus-has-spread/' # get URL html page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') data =  # soup.find_all('td') will scrape every element in the url's table data_iterator = iter(soup.find_all('td')) # data_iterator is the iterator of the table # This loop will keep repeating till there is data available in the iterator while True: try: country = next(data_iterator).text confirmed = next(data_iterator).text deaths = next(data_iterator).text continent = next(data_iterator).text # For 'confirmed' and 'deaths', make sure to remove the commas and convert to int data.append(( country, int(confirmed.replace(', ', '')), int(deaths.replace(', ', '')), continent )) # StopIteration error is raised when there are no more elements left to iterate through except StopIteration: break # Sort the data by the number of confirmed cases data.sort(key = lambda row: row, reverse = True) # create texttable object table = tt.Texttable() table.add_rows([(None, None, None, None)] + data) # Add an empty row at the beginning for the headers table.set_cols_align(('c', 'c', 'c', 'c')) # 'l' denotes left, 'c' denotes center, and 'r' denotes right table.header((' Country ', ' Number of cases ', ' Deaths ', ' Continent ')) print(table.draw())
A really nice thing about the BeautifulSoup library is that it is built on the top of the HTML parsing libraries like html5lib, lxml, html.parser, etc. So BeautifulSoup object and specify the parser library can be created at the same time.
So, this was a simple example of how to create a web scraper in Python. From here, you can try to scrap any other website of your choice. In case of any queries, post them below in comments section.
Happy Coding! Cheers.