Webscraping using pandas

#python #pandas #webscraping #datascience

Web scraping refers to the process of extracting data from websites using automated tools and scripts. Web scraping can be used for a variety of purposes, such as market research, competitor analysis, and data analysis.

Pandas is a popular data analysis library in Python that provides powerful tools for working with structured data. In this article, we will explore how to use Pandas for web scraping and how it can make the process easier and more efficient.

The Pandas read_html() Function

One of the key features of Pandas for web scraping is the read_html() function. This function allows you to read HTML tables from web pages and convert them into Pandas DataFrames. The read_html() function takes a URL as input and returns a list of all HTML tables found on the page.

Here's an example of how to use read_html() to scrape a table from a web page:

import pandas as pd
import matplotlib.pyplot as plt

# Wikipedia page for total wealth data
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_total_wealth'

# read HTML tables from URL
tables = pd.read_html(url)

# extract the first table (which contains the wealth data)
wealth_table = tables[0]

In this example, we first import the Pandas library and specify the URL of the web page we want to scrape. We then call the read_html() function with the URL as input, which returns a list of all tables found on the page. We extract the first table from the list by indexing it with [0].

Data Cleaning and Manipulation with Pandas

Once you have scraped the data from a web page into a Pandas DataFrame, you can use the full power of Pandas to clean, manipulate, and analyze the data.

Here's an example of how to clean and manipulate data in a scraped DataFrame:

wealth_table["Total wealth (USD bn)"] = wealth_table['Total wealth (USD bn)'].replace("—",pd.NA)

# remove unnecessary columns

wealth_table = wealth_table[['Country (or area)', 'Total wealth (USD bn)']]

# remove rows with missing values
wealth_table = wealth_table.dropna()

top10 = wealth_table.head(10)

# plot a bar chart of the top 10 countries by total wealth
plt.bar(top10['Country (or area)'], top10['Total wealth (USD bn)'])
plt.xticks(rotation=90)
plt.ylabel('Total wealth (USD bn)')
plt.title('Top 10 Countries by Total Wealth')
plt.show()

Note that the read_html() function may not work for all web pages, especially those with complex or dynamic HTML structures.

Conclusion

Web scraping with Pandas can be a powerful tool for extracting and analyzing data from web pages. The read_html() function provides an easy way to scrape HTML tables, and Pandas provides a wide range of tools for cleaning, manipulating, and analyzing the data. However, it's important to be mindful of the legal and ethical implications of web scraping, as some websites may prohibit or restrict scraping activities.

Github link:https://gist.github.com/ksn-developer/bb541c1aa2c13b423cdef188b2444661

DEV Community

Webscraping using pandas

Top comments (0)

Read next

New AI Breakthrough Makes Self-Driving Cars 15x Faster and Safer with Truncated Diffusion Model

How to Define AI Agents with Cloudformation and SAM: A Builder's Guide

Building Race Riot: A Racing Game with Pygame and a CI/CD Pipeline

Your ML/AI Success Begins Here: Data Ingestion & Storage on AWS