Introduction
We will be talking about
- Spidering/Scraping
- How to do it elegantly in python
- Limitations and restriction
In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze.
Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions.
Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture.
Note: Scraping information from the sites which are not free or is not publically available can have serious consequences.
Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information.
HTML is very complex in itself due to loose rules and a large number of attributes. Information can be scraped in two ways:
- Manually filtering using regular expressions
- Python's way -Beautiful Soup
In this post, we will be discussing beautiful soup's way of scraping.
Beautiful Soup
As per the definition in its documentation
"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work."
If you have ever tried something like parsing texts and HTML documents that you will understand how brilliantly this module is built and really save a lot of programmers work and time.
Let's start with beautiful soup
Installation
I hope python is installed in your system. To install Beautiful Soup you can use pip
pip install beautifulsoup4
Getting Started
Problem 1: Getting all the links from a page.
For this problem, we will use a sample HTML string which has some links and our goal is to get all the links
html_doc = """ <html> <body> <h1>Sample Links</h1> <br> <a href="https://www.google.com">Google</a> <br> <a href="https://www.apple.com">Apple</a> <br> <a href="https://www.yahoo.com">Yahoo</a> <br> <a href="https://www.msdn.com">MSDN</a> </body> </html> """
#to import the package from bs4 import BeautifulSoup #creating an object of BeautifulSoup and pass 2 parameters #1)the html t be scanned #2)the parser to be used(html parser ,lxml parser etc) soup=BeautifulSoup(html_doc,"html.parser") #to find all the anchor tags in the html string #findAll returns a list of tags in thi scase anchors(to get first one we can use find ) anchors=soup.findAll('a') #getting links from anchor tags for a in anchor: print a.get('href') #get is used to get the attributes of a tags element #print a['href'] can also be used to access the attribute of a tag
This is it, just 5-6 lines to get any tag from the the html and iterating over it, finding some attriutes.
Can you this of doing this with the help of regular expressions. It will be one heck of a job doing it with RE. We can think of how well the module is coded to perform all this functions.
Talking about the parsers (one we have passed while creating a Beautiful Soup object), we have multiple choices if parsers.
This table summarizes the advantages and disadvantages of each parser library:
Parser | Typical usage | Advantages | Disadvantages |
Python’s html.parser | BeautifulSoup(markup, "html.parser") |
|
|
lxml’s HTML parser | BeautifulSoup(markup, "lxml") |
|
|
lxml’s XML parser | BeautifulSoup(markup, "lxml-xml") BeautifulSoup(markup, "xml") |
|
|
html5lib | BeautifulSoup(markup, "html5lib") |
|
|
Other methods and Usage
Beautiful soup is a vast library and can do things which are too difficult in just a single line.
Some of the methods for searching tags in HTML are:
#finding by ID soup.find(id='abc') #finding through a regex #lmit the return to 2 tags soup.find_all(re.compile("^a"),limit=2) #finding multiple tags soup.find_all(['a','h1']) #fiind by custom or built in attributes soup.find_all(attrs={'data':'abc'})
Problem 2:
In the above example, we are using HTML string for parsing, now we will see how we can hit a URL and get the HTML for that page and then we can parse it in the same manner as we were doing for HTML string above
For this will be using urllib3 package of python. It can be easily installed by the following command
pip install urllib3
Documentation for urllib3 can be seen here.
import urllib3 http = urllib3.PoolManager() #hiitng the url r = http.request('GET', 'https://en.wikipedia.org/wiki/India') #creating a soup object using html from the link soup=BeautifulSoup(r.data,"html.parser") #getting whole text from the wiki page text=soup.text #getting all the links from wiki page links=soup.find_all('a') #iterating over the new pages and getting text from them #this can be done in a recursive fashion to parse large number of pages for link in links: prihref=nt link.get('href') new_url='https://en.wikipedia.org'+href http = urllib3.PoolManager() r_new = http.request('GET', new_url) #do something with new page new_text=r_new.text #getting source of all the images src=soup.find('img').get('src')
This was just a basic introduction to web scraping using Python. Much more can be achieved using the packages used in this tutorial. This article can serve as a starting point.
Points to Remember
Web Scraping is very useful in gathering data for different purposes like data mining, knowledge creation, data analysis etc but it should be done with care.
As a basic rule of thumb, we should not scrape anything which is paid content. Being said that we should comply with the robots.txt file of the site to know the areas which can be crawled.
It is very important to look into the legal implications before scraping.
Hope the article was informative.
TechScouter (JSC)
Top comments (2)
Very good thoughts and interesting topic. It would fair, though, to mention Scrapy here? Kind of high level lib for aforementioned
Hi,
This was a basic level introduction to web scrapping for those who want to start exploring. Scrapy is no doubt high level tool. Will cover it soon.