Dhruv Joshi for Quokka Labs

Posted on Aug 10, 2023 • Updated on Oct 16, 2023

Web Scraping with Python: A Quick Guide to Scrape Data from Website

#python #webscraping #website #tutorial

Web scraping is a process to scrape data from a website and save it for further use. This technique has become increasingly popular due to the massive amount of data available on the internet. Python is most commonly used for web scraping, as it provides many libraries and tools for this purpose.

Hire Python Developers | Hire Python Programmer | Quokka Labs

Hire Python developers from Quokka Labs to empower your business with world-class Python development solutions. Choose from a wide range of Python development.

quokkalabs.com

In this blog, we'll take a quick look at web scraping with Python and scrape data from website with web scraping tools.

Efficient Web Scraping with Python: Your Quick Guide to Extracting Data from Websites

Discover the power of Python for efficient web scraping in our comprehensive guide. Learn how to easily extract valuable data from websites and supercharge your data collection process.

Why Use Web Scraping?

Web scraping can be helpful in many ways, including data analysis, lead generation, price comparison, and much more.

For example, if you're a data analyst, you may want to scrape data from websites to perform sentiment analysis or market research. On the other hand, if you're a business owner, you should scrape data from competitor websites to keep track of their pricing.

How Does Web Scraping Work?

Web scraping works by sending a request to a website's server, which returns the page's HTML code. The HTML code can then be parsed to extract the data you want. Python has many libraries and tools to make this process easier, such as BeautifulSoup, Selenium, and Scrapy.

Before you scrape data from a website, it's essential to check the website's terms of service to see if they allow web scraping. Some websites have strict policies against it, and you could face legal consequences if you scrape data from their site without permission.

Why Start Using FastAPI for Python?

Explore the reasons why you should start using FastAPI for your Python development projects. Learn how FastAPI's speed, ease of use, and advanced features.

mayankranjan.hashnode.dev

Getting started with BeautifulSoup

BeautifulSoup is a library in Python used to parse HTML and XML files. It's a popular choice for scrape data from website due to its simplicity and ease of use. To use BeautifulSoup, you'll need to install it by running the next command in your terminal:

pip install beautifulsoup4

Once you have BeautifulSoup installed, you can start scraping data from websites. The first step is to send a request to the website and get the HTML code. You can do this using the "requests" library. For example,

import requests 

url = "https://www.example.com" 

response = requests.get(url) 

html_content = response.content

Next, you can parse the HTML code using BeautifulSoup. Here's an example:

from bs4 import BeautifulSoup 

soup = BeautifulSoup(html_content, "html.parser")

Now that the HTML code is parsed, you can extract the data you want. BeautifulSoup provides many methods for finding elements in the HTML code, such as "find()" and "find_all()." For example,

titles = soup.find_all("h1") 

for title in titles: 

    print(title.text)

In this example, we're finding all the h1 elements in the HTML code and printing their text.

Getting Started with Selenium

Selenium is another popular Python library for web scraping tools. It's a browser automation tool that can interact with websites and scrape data from websites. The main advantage of using Selenium for web scraping is that it can handle JavaScript, which is often used to load dynamic website content.

To use Selenium, you'll need to install it by executing the succeeding command in your terminal:

pip install selenium

You'll also need to download a web driver for the browser you want to use. For example, if you use Google Chrome, you'll need to download the ChromeDriver.

Once you have Selenium and the web driver installed, you can start scraping data from websites. Here's an example:

from selenium import webdriver 

url = "https://www.example.com" 

driver = webdriver.Chrome() 

driver.get(url) 

titles = driver.find_elements_by_tag_name("h1") 

for title in titles: 

    print(title.text) 

driver.quit()

In this example, we're using the Chrome web driver to visit the website and find all the h1 elements on the page. The "find_elements_by_tag_name()" method finds all the parts with the specified tag name.

Getting Started with Scrapy

Scrapy is a robust Python framework for web scraping. It's one of the excellent web scraping tools. It's often used for large-scale web scraping projects, providing many features and tools to make the process easier and more efficient.

To use Scrapy, you'll need to install it by following the command in your terminal:

pip install scrapy

Once you have Scrapy installed, you can start creating your scraping project. Here's an example:

import scrapy 



class ExampleSpider(scrapy.Spider): 

    name = "example" 

    start_urls = [ 

        "https://www.example.com", 

    ] 



    def parse(self, response): 

        titles = response.css("h1::text").getall() 

        for title in titles: 

            yield {"title": title}

In this example, we're creating a Scrapy spider called "ExampleSpider" that will scrape data from website. The "CSS" method finds elements on the page and extracts the text. The "getall" method gets all the text from the elements.

WebDrivers and Browsers

We have seen the most used web scraping tools to scrape data from the website. Here, every web scraper uses a browser because it requests to connect to the terminus URL. Using a regular browser is recommended, especially if you are new to it.

You can use a headless browser later, once you've experienced it. Also, it will be helpful for the next complex task. In this blog, we use Chrome web browser for all the processes, and it's the same for Firefox.

Let's get started with a preferred search engine called "webdriver for Chrome."

Now, the final step is you will need to find a good coding environment. Many options exist, but Visual Studio Code or PyCharm are the best options. We will use PyCharm for newcomers.

Now, on PyCharm, right-click on it and select New > Python file. You can name it anything you want.

Using and Importing Libraries

Now let's put all the pipes we created into use like below:

import pandas as pd 

from bs4 import BeautifulSoup 

from selenium import webdriver

You may see them as gray, but don't accept it, as this will remove unused libraries. We should start now by defining the browser.


driver = webdriver.Chrome(executable_path='c:\path\to\windows\webdriver\executable.exe') 

OR 

driver = webdriver.Firefox(executable_path='/nix/path/to/webdriver/executable')

Choose a URL

Now, we have to pick the URL that we want to use to scrape data. Selenium needs the connection protocol to be provided and always attach "https://" on the URL like below.


driver.get('https://your.url/here?yes=brilliant')

Building Lists and Defining Objects

You can create an object easily by typing a title and a value.


# Object is "results", brackets make the object an empty list. 

# We will store our data here. 

results = []

We can make more objects like the ones below.


# Add page source to the variable `content`. 

content = driver.page_source 

# Load the contents of the page and its source into BeautifulSoup  

# class, which analyzes the HTML as a nested data structure and allows it to select 

# its elements using various selectors. 

soup = BeautifulSoup(content)

Extracting Data from Web Scraper

In this section, we will process every small section and add them to the list.


# Loop over all elements returned by the `findAll` call. It has the filter `attrs` given 

# to limit the data returned to those elements with a given class only. 

for element in soup.findAll(attrs={'class': 'list-item'}):

Now let's visit the URL on a real browser. You can press CTRL + U on chrome and choose view page source. You can find the closet class where the data is nested. For example:

<h4 class="title"> 

    <a href="...">This is a Title</a> 

</h4>

Now, let's get back and add the class we found in the source:


# Change 'list-item' to 'title'.

Now we will process all the class "titles" for all like below:

name = element.find('a') 

for element in soup.findAll(attrs={'class': 'title'}):

Now, let's see our loop:


<h4 class="title"> 

    <a href="...">This is a Title</a> 

</h4>

Exporting Data to CSV

Now we have to check whether the data is assigned to the right object and move to the array correctly. To check this, we can use" print." Also, "for" is used for it. So far, our code will look like the below,


driver = webdriver.Chrome(executable_path='/nix/path/to/webdriver/executable') 

driver.get('https://your.url/here?yes=brilliant') 

results = [] 

content = driver.page_source 

soup = BeautifulSoup(content) 

for a in soup.findAll(attrs={'class': 'class'}): 

    name = a.find('a') 

    if name not in results: 

        results.append(name.text) 

for x in results: 

    print(x)

So now, we will remove the "print" loop and move movie data to a CSV file.


df = pd.DataFrame({'Names': results}) 

df.to_csv('names.csv', index=False, encoding='utf-8')

Best Practices: Web Scraping with Python Programming Language

Web scraping can be a powerful tool for extracting data from websites. Still, it's essential to follow best practices to avoid breaking websites and respect the terms of use. Here are some best practices for web scraping with Python:

Respect the website's terms of use

Some websites prohibit the scraping of their data. Before scraping a website, check its terms of service to see if it's allowed.

Use a "User-Agent" header

Websites can block scraping requests if they see them coming from a bot. To avoid this, set a "User-Agent" header in your scraping requests to identify yourself as a human user.

Don't scrape too quickly

Scraping too many pages can strain the website's server and slow it down for other users. To avoid this, add delays to your scraping code and be mindful of your request rate.

Cache data

Scraping the same data multiple times can strain the website's server and slow it down for other users. To avoid this, cache the data you scrape to reuse later without making additional requests to the website.

Be mindful of privacy

Some websites may contain personal information you don't have permission to scrape. Make sure only to scrape data that you have permission to use, and be mindful of privacy laws and regulations.

Use APIs

Many websites provide APIs, allowing you to access their data more structured and efficiently. If an API is available, consider using it instead of scraping the website directly.

Monitor your code

Web scraping tools can be brittle and break easily when websites change. To avoid this, regularly monitor your scraping code to ensure it's still working as expected.

By resulting these best practices, you can confirm that your web scraping projects are efficient, respectful, and compliant with legal and ethical standards.

Final Words

Web scraping with Python programming language is a powerful tool for scrape data from website. Whether you're a data analyst, business owner, or anyone looking to gather information from the web, Python provides several options to make the process easier and more efficient. With the correct web scraping tools and techniques, you can easily scrape data from websites and put it to use in your projects.

If you require more help with this domain or a related domain, why not reach out to a python developer to handle complex situations and ease stress.

Like, share, and comment! Share with the needed ones! Thanks for reading!

An Introduction to SQLite with Python

Quokka Labs for Quokka Labs ・ Dec 15 '22

#sql #python #beginners #programming

DEV Community