DEV Community

Betty Kamanthe
Betty Kamanthe

Posted on

Simple Web scraping project using python and Beautiful soup

Web scraping a shopping site

Introduction

Web scraping is an automated method used to extract large amounts of data from websites. The data on the websites are unstructured. Web scraping helps collect these unstructured data and store it in a structured form.

In this project I will show you how to scrape data from a Kenyan website called Jumia https://www.jumia.co.ke/. The data we gather can be used for price comparison.

Website Inspection

The aim of this project is to scrape all products, their prices and rating. So first, we need to inspect the website, this is done by:

1.Visiting this site https://www.jumia.co.ke/all-products/

2.Right clicking and selecting inspect or clicking ctrl+shift+i to inspect the website.
Inspect
3.Move the cursor around till a product is selected.Then search for the div tag that has the name, price and rating of the product.
Web scraping

Write the code
We start by importing the necessary libraries

from bs4 import BeautifulSoup
import requests
Enter fullscreen mode Exit fullscreen mode

The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

jumia = requests.get('https://www.jumia.co.ke/all-products/')
Enter fullscreen mode Exit fullscreen mode

Parsing a page using BeautifulSoup

soup = BeautifulSoup(jumia.content , 'html.parser')
products = jsoup.find_all('div' , class_ = 'info')
Enter fullscreen mode Exit fullscreen mode

Use the find_all method, which will find all the instances of the div tag that has a class called 'info' on the page.

We now extract the name, price and rating.If you want to find the first instance of a tag, you can use the find method, which will return a single BeautifulSoup object:

Name = product.find('h3' , class_="name").text.replace('\n', '')
Price = product.find('div' , class_= "prc").text.replace('\n', '')
Rating = product.find('div', class_='stars _s').text.replace('\n', '')
Enter fullscreen mode Exit fullscreen mode

replace() is an inbuilt function in the Python programming language that returns a copy of the string where all occurrences of a substring are replaced with another substring.

We can now loop over all products on the page to extract the name, price and rating.

for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      Rating = product.find('div', class_='stars _s').text.replace('\n', '')

      info = [ Name, Price,Rating]
      print(info)
Enter fullscreen mode Exit fullscreen mode

Note that we are storing all these in a list called info.

Loop over all pages
We have only scraped data from the first page. The site has 50 pages and when you click on the second page you notice that the url changes. So to get the new url we do this:

url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
Enter fullscreen mode Exit fullscreen mode

That is a simple string concatination. The code to loop through all the pages is:

for page in range(1,51):
  url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
  furl = requests.get(url)
  jsoup = BeautifulSoup(furl.content , 'html.parser')
  products = jsoup.find_all('div' , class_ = 'info')

  for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      try:
        Rating = product.find('div', class_='stars _s').text.replace('\n', '')
      except:
        Rating = 'None'

      info = [ Name, Price,Rating]
      print(info)
Enter fullscreen mode Exit fullscreen mode

range() function goes up to but doesn't include the last number. The website has 50 pages this range is up to 51.
Since some of the products have no ratings, we put it between try catch clause and print None in that instance.

Saving to csv

df = pd.DataFrame({'Product Name':Name,'Price':Price,'Rating':Ratings}) 
df.to_csv('products.csv', index=False, encoding='utf-8')
Enter fullscreen mode Exit fullscreen mode

The whole code

from bs4 import BeautifulSoup
import requests

for page in range(1,51):
  url = "https://www.jumia.co.ke/all-products/" + "?page=" +str(page)+"#catalog-listing"
  furl = requests.get(url)
  jsoup = BeautifulSoup(furl.content , 'html.parser')
  products = jsoup.find_all('div' , class_ = 'info')

  for product in products:
      Name = product.find('h3' , class_="name").text.replace('\n', '')
      Price = product.find('div' , class_= "prc").text.replace('\n', '')
      try:
        Rating = product.find('div', class_='stars _s').text.replace('\n', '')
      except:
        Rating = 'None'

      info = [ Name, Price,Rating]
      print(info)
Enter fullscreen mode Exit fullscreen mode

Conclusion

This is a simple web scraping beginners project into data analytics. All the best in your journey.

Discussion (8)

Collapse
john_muriu profile image
John Muriu

This is a great piece of work

Collapse
betty1999kamanthe profile image
Betty Kamanthe Author

Thank you

Collapse
sikukudancan profile image
DANCAN SIKUKU

Good work @betty1999kamanthe

Collapse
betty1999kamanthe profile image
Betty Kamanthe Author

Thank you

Collapse
esthermulwa profile image
Ess_codes

Amazing stuffπŸ‘πŸ‘ thanks for sharing

Collapse
betty1999kamanthe profile image
Betty Kamanthe Author

Thank you

Collapse
nashipai98 profile image
Nashipai98

This is a very good piece of work nerd!!keep up the good worksπŸ˜˜πŸ˜‰

Collapse
betty1999kamanthe profile image
Betty Kamanthe Author

Thank you.