DEV Community

Cover image for How to Perform Exploratory Data Analysis with Python
Phylis Jepchumba
Phylis Jepchumba

Posted on

How to Perform Exploratory Data Analysis with Python


Exploratory data analysis (EDA) is an essential step in the data science process which involves use of both statistical methods and data visualization techniques to uncover patterns, trends, understand relationships and gain meaningful insights from data in order to understand the problem and make informed decisions. Statistics show that the amount of data generated worldwide is growing at an exponential rate. It is estimated that by 2025, the world will generate 463 exabytes of data per day, thus data scientists and data analysts should understand the process.

Whether you’re an experienced data scientist or a beginner, this blog will walk you through the exciting process of EDA and how to use Python to perform this essential task.

So buckle up, it’s time to dive into the world of data and unearth some hidden insights.

“Data is like garbage. You’d better know what you are going to do with it before you collect it.” — Mark Twain.

Why Exploratory data analysis (EDA)) is an essential task.

  • EDA helps data practitioners understand and gain insights from data before applying machine learning and statistical techniques.
  • EDA helps identify patterns, anomalies, and relationships within the data so as to make informed decisions and develop effective strategies.
  • The EDA process aims in detecting faulty points in data such as errors or missing values which can be corrected by analysis.

Exploratory data analysis (EDA) steps.

To achieve this critical task, following steps need to be taken into consideration.

  • Importing necessary libraries.
  • Load the dataset
  • Viewing the dataset.
  • Check for duplication.
  • Data preparation (Handling missing values and outliers)
  • Analyzing the data.(univariate, bivariate and multivariate analysis)
  • Visualization.

Wondering where to get datasets for practice? We got you covered. Take a look at the following resources.

Lets now perform EDA on the following sample Dataset from Kaggle by performing the following steps;


Importing necessary Libraries

Python is a versatile programming language with robust libraries for data analysis, including Pandas, NumPy, seaborn, Matplotlib, Plotly, Tensor flow, Keras.
We will make use of the following modules

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Enter fullscreen mode Exit fullscreen mode

Get to understand other Python libraries for data analysis by reading through the following article.

Data Science Libraries Every Data scientist should know.

Loading the Dataset.

The pandas library can be used to read various datasets from multiple formats such as csv, txt, Excel, JSON, SQL, etc.

#How to read Datasets of various formats
import pandas as pd
#read csv
df = pd.read_csv('filename.csv')
#read txt
df = pd.read_csv('filename.txt', sep="\t")
#read excel
df = pd.read_excel('filename.xlsx')
#read json
df = pd.read_json('filename.json')
#read sql
df = pd.read_sql('SELECT * FROM TableName', connection)
Enter fullscreen mode Exit fullscreen mode

Our dataset is in csv format, we load it into data DataFrame as follows;

Enter fullscreen mode Exit fullscreen mode

Viewing the DataFrame.

We can quickly find out how many rows and columns there are in our dataset by using the shape method. This returns a tuple which contains the number of rows and columns.

Shape of the Data

Enter fullscreen mode Exit fullscreen mode

Image description

Previewing the Dataset.

Preview the first 5 rows.

Enter fullscreen mode Exit fullscreen mode

Image description

Preview last five rows.

Enter fullscreen mode Exit fullscreen mode

Image description

head() and tail() functions are used to preview the first five rows and last fie rows in the dataset respectively.

Columns names

Enter fullscreen mode Exit fullscreen mode

The pandas.DataFrame.columns function is used to get the names of all the columns of a pandas dataframe object. It returns an Index object which holds the column labels.

Image description

Concise info of dataset
Enter fullscreen mode Exit fullscreen mode

The info() method allows us to obtain additional information about the dataset, such as the names of the columns, the data type of each column, and the number of non-null values.

Image description

Data types

Enter fullscreen mode Exit fullscreen mode

The pandas.dtypes function returns the data types of each column in a pandas DataFrame. It returns a pandas Series with the data type of each column.

Image description

Descriptive statistics.

Enter fullscreen mode Exit fullscreen mode

The describe() function in pandas is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values. It provides the count, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile (median) and 75th percentile of the data.

Image description

Checking Duplicates

Enter fullscreen mode Exit fullscreen mode

The pandas.DataFrame.duplicated.sum() function returns a Series or DataFrame containing the sum of the boolean values (True or False) in the duplicate rows of the DataFrame.

Image description
The dataset has no duplicates.

Unique Values

Image description

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It involves handling missing values and handling outliers

Missing values

We will begin by scanning the dataset for missing values. We may do this by utilizing the isna() method, which returns a dataframe of boolean values indicating whether or not a field is null. We can use the sum() method to group all missing values by column.

Image description

Our dataset has no null values and its ready for analysis.

It is common for datasets to contain errors, missing values, outliers, or other types of inconsistencies. Incase your dataset has missing values this blog (Handling Missing Values)will guide you how to handle missing values in data.

"No data is clean, but most is useful."~ Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ

Analyzing Data

With our cleansed dataset we can go ahead and begin the task of exploring the data.

Univariate Analysis.
Univariate analysis is a form of exploratory data analysis (EDA) that involves the examination of a single variable. It is used to summarize the data and gain insight into the data's distribution, central tendency, and variability. It can be used to answer questions such as what is the range of the data, what is the most common value, and is there any outliers. It is also used to identify any trends or patterns in the data.

we can visualize this information using a boxplot from Seaborn.

Image description

#Handling outliers
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(data['TV'], ax = axs[0])
plt2 = sns.boxplot(data['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(data['Radio'], ax = axs[2])
Enter fullscreen mode Exit fullscreen mode

Image description

With the above box plots we can be able distribution of the data in the vriables.

Bivariate analysis

Bivariate analysis involves analyzing data with two variables or columns. This is usually a way to explore the relationships between these variables and how they influence each other, if at all.

# Let's see how Sales are related with other variables using scatter plot.
sns.pairplot(data, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter')
Enter fullscreen mode Exit fullscreen mode

Image description

Multivariate analysis

Multivariate analysis is a type of data analysis that involves examining more than two variables at once in order to better understand the relationships between them. It is a powerful tool for exploratory data analysis, as it allows researchers to identify patterns and trends in large datasets that would otherwise be difficult to spot. It also allows researchers to explore the relationships between multiple variables and determine which ones are most important in predicting the outcome of interest.

# Let's see the correlation between different variables.
sns.heatmap(data.corr(), cmap="Greens", annot = True)
Enter fullscreen mode Exit fullscreen mode

Image description

As it is visible from the scatterplot and the heatmap, the variable TV seems to be most correlated with Sales


I hope this article has been informative and helpful in understanding how to perform Exploratory Data Analysis with Python. If you found it helpful, please share it with your fellow colleagues and friends. Enjoy your exploration of data!

Top comments (11)

chrisgreening profile image
Chris Greening

Hey Phylis thanks for sharing!

A really neat tool I found a while back for pandas is the ydata-profiling library, it takes your DataFrame and outputs a full report about the data with summary stats, visualizations, etc. it's a really fantastic EDA tool!

NOTE: the tool used to be called pandas-profiling but they recently changed it to ydata-profiling just FYI

Decent article about the topic

phylis profile image
Phylis Jepchumba

Will check it out!

mccurcio profile image
Matt Curcio

YES, Love Pandas-Profiling or Ydata-Profiling. Great starter.

vulcanwm profile image

great explanations!

phylis profile image
Phylis Jepchumba

Thanks...Glad you found it interesting and helpful

phylis profile image
Phylis Jepchumba

Thank you

nyangweso profile image
Rodgers Nyangweso

Great piece here, thanks. Very helpful

phylis profile image
Phylis Jepchumba

Thank you..Am glad you found it helpful

phylis profile image
Phylis Jepchumba

Noted! will check it out

stevstar profile image
Stephen Muruchi

so informative👏

phylis profile image
Phylis Jepchumba

Thank you. Glad you found it helpful😊