Phylis Jepchumba, MSc

Posted on Feb 19, 2023

How to Perform Exploratory Data Analysis with Python

#python #datascience #womenintech #analytics

Introduction

Exploratory data analysis (EDA) is an essential step in the data science process which involves use of both statistical methods and data visualization techniques to uncover patterns, trends, understand relationships and gain meaningful insights from data in order to understand the problem and make informed decisions. Statistics show that the amount of data generated worldwide is growing at an exponential rate. It is estimated that by 2025, the world will generate 463 exabytes of data per day, thus data scientists and data analysts should understand the process.

Whether you’re an experienced data scientist or a beginner, this blog will walk you through the exciting process of EDA and how to use Python to perform this essential task.

So buckle up, it’s time to dive into the world of data and unearth some hidden insights.

“Data is like garbage. You’d better know what you are going to do with it before you collect it.” — Mark Twain.

Why Exploratory data analysis (EDA)) is an essential task.

EDA helps data practitioners understand and gain insights from data before applying machine learning and statistical techniques.
EDA helps identify patterns, anomalies, and relationships within the data so as to make informed decisions and develop effective strategies.
The EDA process aims in detecting faulty points in data such as errors or missing values which can be corrected by analysis.

Exploratory data analysis (EDA) steps.

To achieve this critical task, following steps need to be taken into consideration.

Importing necessary libraries.
Load the dataset
Viewing the dataset.
Check for duplication.
Data preparation (Handling missing values and outliers)
Analyzing the data.(univariate, bivariate and multivariate analysis)
Visualization.

Wondering where to get datasets for practice? We got you covered. Take a look at the following resources.

Lets now perform EDA on the following sample Dataset from Kaggle by performing the following steps;

Dataset

Importing necessary Libraries

Python is a versatile programming language with robust libraries for data analysis, including Pandas, NumPy, seaborn, Matplotlib, Plotly, Tensor flow, Keras.
We will make use of the following modules

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Get to understand other Python libraries for data analysis by reading through the following article.

Data Science Libraries Every Data scientist should know.

Loading the Dataset.

The pandas library can be used to read various datasets from multiple formats such as csv, txt, Excel, JSON, SQL, etc.

#How to read Datasets of various formats
import pandas as pd
#read csv
df = pd.read_csv('filename.csv')
#read txt
df = pd.read_csv('filename.txt', sep="\t")
#read excel
df = pd.read_excel('filename.xlsx')
#read json
df = pd.read_json('filename.json')
#read sql
df = pd.read_sql('SELECT * FROM TableName', connection)

Our dataset is in csv format, we load it into data DataFrame as follows;

data=pd.read_csv('https://raw.githubusercontent.com/PhylisKorir/SalesPrediction-using-Linear-Regression/main/Electronic_sales.csv')
data

Viewing the DataFrame.

We can quickly find out how many rows and columns there are in our dataset by using the shape method. This returns a tuple which contains the number of rows and columns.

Shape of the Data

data.shape

Previewing the Dataset.

Preview the first 5 rows.

data.head()

Preview last five rows.

data.tail()

head() and tail() functions are used to preview the first five rows and last fie rows in the dataset respectively.

Columns names

data.columns

The pandas.DataFrame.columns function is used to get the names of all the columns of a pandas dataframe object. It returns an Index object which holds the column labels.

Concise info of dataset

data.info()

The info() method allows us to obtain additional information about the dataset, such as the names of the columns, the data type of each column, and the number of non-null values.

Data types

data.dtypes

The pandas.dtypes function returns the data types of each column in a pandas DataFrame. It returns a pandas Series with the data type of each column.

Descriptive statistics.

data.describe()

The describe() function in pandas is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values. It provides the count, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile (median) and 75th percentile of the data.

Checking Duplicates

data.duplicated.sum()

The pandas.DataFrame.duplicated.sum() function returns a Series or DataFrame containing the sum of the boolean values (True or False) in the duplicate rows of the DataFrame.

The dataset has no duplicates.

Unique Values

DATA CLEANING
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It involves handling missing values and handling outliers

Missing values

We will begin by scanning the dataset for missing values. We may do this by utilizing the isna() method, which returns a dataframe of boolean values indicating whether or not a field is null. We can use the sum() method to group all missing values by column.

Our dataset has no null values and its ready for analysis.

It is common for datasets to contain errors, missing values, outliers, or other types of inconsistencies. Incase your dataset has missing values this blog (Handling Missing Values)will guide you how to handle missing values in data.

"No data is clean, but most is useful."~ Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ

Analyzing Data

With our cleansed dataset we can go ahead and begin the task of exploring the data.

Univariate Analysis.
Univariate analysis is a form of exploratory data analysis (EDA) that involves the examination of a single variable. It is used to summarize the data and gain insight into the data's distribution, central tendency, and variability. It can be used to answer questions such as what is the range of the data, what is the most common value, and is there any outliers. It is also used to identify any trends or patterns in the data.

we can visualize this information using a boxplot from Seaborn.

#Handling outliers
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(data['TV'], ax = axs[0])
plt2 = sns.boxplot(data['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(data['Radio'], ax = axs[2])
plt.tight_layout()

With the above box plots we can be able distribution of the data in the vriables.

Bivariate analysis

Bivariate analysis involves analyzing data with two variables or columns. This is usually a way to explore the relationships between these variables and how they influence each other, if at all.

# Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(data, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
plt.show()

Multivariate analysis

Multivariate analysis is a type of data analysis that involves examining more than two variables at once in order to better understand the relationships between them. It is a powerful tool for exploratory data analysis, as it allows researchers to identify patterns and trends in large datasets that would otherwise be difficult to spot. It also allows researchers to explore the relationships between multiple variables and determine which ones are most important in predicting the outcome of interest.

# Let's see the correlation between different variables.
sns.heatmap(data.corr(), cmap="Greens", annot = True)
plt.show()

As it is visible from the scatterplot and the heatmap, the variable TV seems to be most correlated with Sales

Conclusions

I hope this article has been informative and helpful in understanding how to perform Exploratory Data Analysis with Python. If you found it helpful, please share it with your fellow colleagues and friends. Enjoy your exploration of data!

Top comments (11)

Chris Greening • Feb 19 '23

Hey Phylis thanks for sharing!

A really neat tool I found a while back for pandas is the ydata-profiling library, it takes your DataFrame and outputs a full report about the data with summary stats, visualizations, etc. it's a really fantastic EDA tool!

NOTE: the tool used to be called pandas-profiling but they recently changed it to ydata-profiling just FYI

Decent article about the topic