Basics of webscraping
Webscraping is a powerful tool that allows you to access valuable data from website without using an API. Some websites contain information in their HTML, but do not have an API that you can use. If you wanted data from one of these websites, you could utilize webscraping. Webscraping goes through the HTML of a website, and lets you extract the data you want by parsing through the HTML. Packages like BeautifulSoup, requests, and Scrapy are used to help in this process. In this post, I will explore the basics of using BeautifulSoup to get data from a website.
Why utilize web scraping?
There are many use cases where web scraping would make achieving your desired function easier and more efficiently. For example, a company might utilize webscraping to do market research, by going through websites, and extracting, prices, reviews, and product detail. To gather the same data manually would obviously be a much more laborious and slower process. Another usecase for webscraping is to aggregate content from multiple websites without the need for manual entry. The uses for webscraping extend far beyond these examples, and makes it an essential tool in data collection.
Web scraping example
Here is a very simple example showing how you can use BeautifulSoup to scrape through a website and extract information.
After you install and import all of the necessary packages, retrieve the HTML from your website and use BeautifulSoup to parse the HTML:
html = requests.get("https://information.com/", headers=headers)
doc = BeautifulSoup(html.text, 'html.parser')
This gives you a doc variable that we can filter unneeded information from, leaving us with only the parts we are looking for. The are a couple BeautifulSoup method we'll use to further filter and simplify the HTML.
print(doc.select('.heading')[0].contents)
In this line of code we select all elements that match the targeted CSS selector(e.g. class, id, etc.)'heading' by using the .select() method. Then, we take the first element of that class by using [0], and extract its text by getting the attribute 'contents'. Contents can only be used on a single element, so an error would be raised if you tried to use it on multiple elements. You can also use the get_text() method which will return the information you want.
Of course there would be many further steps in order to filter, manipulate and analyze the data you scraped, but for simplicity's sake, I will leave the function as printing the text information from the element.
Storing and using scraped data
There are many useful libraries that allow you to meaningfully interact with, or store the data you scraped from the web. After you have scraped through a website, you can store the gathered data in sql databases for structured access to all of your data. After you have inserted the data into your table, there are more packages that allow you to visualize or interact with the data. For example 'pandas' is a versatile library that allows you to explore, visualize and manipulate the data you have stored. Using pandas, you can analyze data using functions that calculate mean, median, etc. as well as correlations within the data. You can also visualize the data using plots, graphs, etc. because pandas integrates with Matplotlib, another python library. You can use pandas to 'clean' your data, helping patch up any missing values, duplicates, and unnecessary information. In terms of manipulating using pandas, it provides functions that allow you merge databases, allowing you to consolidate different pieces of information into one table. As you can see, there are many tools that you can use to get the most from the information you scraped.
I look forward to utilizing webscraping in more complex ways in the future, and I hope I have convinced you of its immense value!
Top comments (0)