DEV Community

Alexander Maina
Alexander Maina

Posted on • Edited on

4

WEB SCRAPING USING PYTHON AND BEAUTIFUL SOUP.

I previously submitted a back-end developer job application to a certain website. But the website never sent me updates through email, and this scenario only sometimes presented itself. As a result, I had to access the internet daily and would continue to scroll and gaze. Until I discovered web scraping, I was unable to take advantage of an opportunity the next season.
As a result, I feel the need to share my knowledge about Python web scraping.

What is Web Scraping?

Web Scraping is a term used to refer to using a program to download and process content from the web.

Interestingly, copying and pasting the contents of the web is a basic example of web scraping.However Web scraping involves automation.

What is Beautiful Soup?

It is a Python module that parses (analyze and identify the parts of) Hyper-Text Mark-Up Language, a language in which web pages are written in.

Beautiful Soup does not come installed in python and hence it needs to be installed before initial use.

The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version 4).

1.0 Installing Beautiful Soup and Requests Library.

Open the command line and type python to open the python interpreter in interactive mode.

To install Beautiful Soup, type the following command on the command line:

pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

While beautifulsoup4 is the name used for installation, we use bs4 to import Beautiful Soup

We also need to install requests python library by typing the following command on the command line.

pip install requests
Enter fullscreen mode Exit fullscreen mode

2.0 Scraping the Page

In this section, we now need to get the contents of the candidate page. We will use https://www.xyz.com as an example of our candidate page.

import requests
URL = "https://www.xyz.com"
response = requests.get(URL)
Enter fullscreen mode Exit fullscreen mode

This returns the Content of the https://www.xyz.com page. This includes all elements and attributes present inside the page.

3.0 Parsing the HTML page content.

This refers to parsing this lengthy code answer using Python's help to make it more accessible and allow you to select the information you need.

We first of all need to import Beautiful soup and then create a variable to store our parsed content.

import requests
from bs4 import BeautifulSoup

URL = "https://www.xyz.com"
response = requests.get(URL)
soup = BeautifulSoup (response.content, "html.parser")
Enter fullscreen mode Exit fullscreen mode

"html.parser" is a built-in Python library that parses HTML and XML documents. It creates a parse tree from the HTML content that can be used to extract information from the website.

4.0 Find element by attribute.

Take for example we need to find jobs based on a div with id = "available". We now need to scan through the entire page and find all elements with id = "available".

import requests
from bs4 import BeautifulSoup

URL = "https://www.xyz.com"
response = requests.get(URL)
soup = BeautifulSoup (page.content, "html.parser")
job = soup.find(id = "available")
Enter fullscreen mode Exit fullscreen mode

This returns the list of all jobs available. Below is an example:

<div id = "available">
   <!--Job listings-->
</div>
Enter fullscreen mode Exit fullscreen mode

You can also chain multiple find_all() method to make the search more specific.For example:

job = soup.find_all("p", string="Posts")
Enter fullscreen mode Exit fullscreen mode

The above code will scan through the paragraphs, trying to find a string Posts. If it happens that this has been mispelt or typed with a different case, it will return no object.To rectify this, we can use lambda() function as follows;

python_job = results.find_all(
    "h2", string=lambda text: "python" in text.lower()
)
Enter fullscreen mode Exit fullscreen mode

The above code will scan through all h2 in the page and convert them to lower case. It'll then find the substring 'search, and return the results.

5.0 Display the available jobs.

Now print the available jobs.

print(job.text)
Enter fullscreen mode Exit fullscreen mode

You can as well use len if you need to see the number of jobs available.

print(len(job.text))
Enter fullscreen mode Exit fullscreen mode

I hope that this step-by-step guide has instilled new skills in you. Happy Coding!!!

Top comments (3)

Collapse
 
dev_geos profile image
Dev Geos

One error dear
The response variable must be put as parameter not page.content.
So the correct way is :

response = requests.get(URL)
soup = BeautifulSoup (response.content, "html.parser")
Enter fullscreen mode Exit fullscreen mode
Collapse
 
alekiie profile image
Alexander Maina

I agree with you.
Thank you for the correction.
Error rectified.

Collapse
 
alekiie profile image
Alexander Maina

Any recommendation about the article will be highly appreciated.

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started