DEV Community

Cover image for A beginner guide to webscraping in Python
Jordan Kalebu
Jordan Kalebu

Posted on

A beginner guide to webscraping in Python

Hi guys,

In this article, you're going to learn the basics of web scraping in python plus doing a demo project to scrap quotes from a website, therefore I suggest you read this to the end

what is web scraping?

Web scraping simply concerns with extracting data from website programmatically, using web scraping you can extract the text in HTML tags, download images & files and almost do anything you do manually with copying and pasting but in a faster way.

should you learn web scraping?

Yeah, absolutely as a programmer in many cases you might need to use the content found on other people's websites but those website doesn't give you API to that, that's why you need to learn web scraping to be able to that.

Requirements

In order to follow through with this tutorial, you need to have the following libraries installed on your machine

Installation

you can install the above two libraries just by using the pip command as shown below;

$ pip install requests 
$ pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Basics of requests

Requests is an elegant and simple HTTP library for Python, built for human beings, it allows you to send HTTP requests(post, get, put, delete) to a website in an easy way.

We gonna use the requests library while implementing our demo project to send a get a request to the website so as to get its HTML source code.

Basics of BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files, it comes with parsers that give us a way to navigate within an HTML source code and extract the content we need.

For us to be able to pull data from our HTML and XML files we need to convert the string representation of the HTML or XML into a BeautifulSoup object which provides us tons of methods to manipulate it.

Let's get hands dirty with some code

Let use the BeautifulSoup library to extract data from the below HTML file sample.html.

<!DOCTYPE html>
<head>
    <title>Document</title>
</head>
<body>
    <div id = 'quotes'>
        <p id = 'normal'>Time the time before the time times you</p>
        <p id = 'normal'>The Future is now </p>
        <p id = 'special'>Be who you wanted to be when you're younger</p>
        <p id = 'special'>The world is reflection of who you're</p>
    </div><div>
        <p id = 'Languages'>Programming Languages</p>
        <ul>
            <li>Python</li>
            <li>C+++</li>
            <li>Javascript</li>
            <li>Golang</li>
        </ul>
    </div>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode

Extracting all paragraphs in HTML

Let’s Extract all paragraphs from the sample.html shown above using BeautifulSoup;

from bs4 import BeautifulSoup

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

for paragraph in soup.find_all('p'):
    print(paragraph.text)
Enter fullscreen mode Exit fullscreen mode

Output

When you run the above simple program it will produce the following result;

$ python app.py 
Time the time before the time times you
The Future is now 
Be who you wanted to be when you're younger
The world is a reflection of who you're
Programming Languages
Enter fullscreen mode Exit fullscreen mode

Code Explanation

  • importing BeautifulSoup library
from bs4 import BeautifulSoup
Enter fullscreen mode Exit fullscreen mode
  • Creating a BeautifulSoup object from HTML string
html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')
Enter fullscreen mode Exit fullscreen mode

The above 2 lines of code are for reading the sample.html and creating a Beautifulsoup object ready for parsing data.

  • Finding all paragraphs and printing them
for paragraph in soup.find_all('p'):
    print(paragraph.text)
Enter fullscreen mode Exit fullscreen mode

We used BeautifulSoup find_all () method to extract all the paragraph in the HTML file, it accept a parameter of the name of HTML tag and then it parses through the HTML string to find all tags and returns them.

Extracting all elements in the list from the HTML

In extracting the list elements instead of paragraph, we are going to specify tag li instead of p in the find_all() method just as shown below;
app.py

from bs4 import BeautifulSoup

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

for List in soup.find_all('li'):
    print(List.text)
Enter fullscreen mode Exit fullscreen mode

Output

$ python app.py
Python
C+++
Javascript
Golang
Enter fullscreen mode Exit fullscreen mode

Extracting paragraphs with a specific id

Apart from just returning all tags in HTML string, we can also specify the attributes of those tags for us to extract only specific tags. just as shown below;

  • Extract paragraphs with an id of normal
import requests
from bs4 import BeautifulSoup

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

for paragraph in soup.find_all('p'):
    if paragraph['id'] == 'normal':
        print(paragraph.text)
Enter fullscreen mode Exit fullscreen mode

Output

$ python app.py 
Time the time before the time times you
The Future is now 
Enter fullscreen mode Exit fullscreen mode

Demo Project

So far we have seen how to extract data from an HTML file that is in our local directory, now let’s go see how we can extract data from the website hosted in the cloud.

Quotes spider

In this project, we are going to implement a web scraper to scrap quotations from a website of a given URL.

We are going to use the requests library to pull the HTML from the website and then parse that HTML using BeautifulSoup.

Website of Interest (WOI)

In our demo project, we are going to scrap the quotes from quotes.toscrape.com

Demo project source code

In the source code of our demo project, nothing has changed much other than the fact that this time we gonna obtains the HTML source code from a website using the requests module instead of reading it from the file.

import requests
from bs4 import BeautifulSoup

html = requests.get('http://quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')

for paragraph in soup.find_all('span'):
    if paragraph.string:
        print(paragraph.string
Enter fullscreen mode Exit fullscreen mode

Output

$ python scraper.py 
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."
Enter fullscreen mode Exit fullscreen mode

The Original article can be found on kalebujordan.com

Hope you find it interesting, now share it with your fellow developers on Twitter and other dev communities.

Top comments (3)

Collapse
 
mihaylov profile image
Petar Petrov

Scrapy and Selenium are also worth mentioning.

Specifically Selenium where the data is behind some JS or browser generated and BeautifulSoup isn´t enough.

Collapse
 
kalebu profile image
Jordan Kalebu

Yeah sure,
Thanks for mentioning that out

Selenium requires a special position when it comes to web scraping

Scrapy is also really great and structured for crawling and scraping guess beautifulsoup turn out to be friendly for beginners.

Collapse
 
mihaylov profile image
Petar Petrov

I also forgot to mention that combining them its very powerful, for example Selenium with BeautifulSoup.