Exploratory data analysis (EDA) is an essential step in the data science process which involves use of both statistical methods and data visualization techniques to uncover patterns, trends, understand relationships and gain meaningful insights from data in order to understand the problem and make informed decisions. Statistics show that the amount of data generated worldwide is growing at an exponential rate. It is estimated that by 2025, the world will generate 463 exabytes of data per day, thus data scientists and data analysts should understand the process.
Whether you’re an experienced data scientist or a beginner, this blog will walk you through the exciting process of EDA and how to use Python to perform this essential task.
So buckle up, it’s time to dive into the world of data and unearth some hidden insights.
“Data is like garbage. You’d better know what you are going to do with it before you collect it.” — Mark Twain.
Why Exploratory data analysis (EDA)) is an essential task.
- EDA helps data practitioners understand and gain insights from data before applying machine learning and statistical techniques.
- EDA helps identify patterns, anomalies, and relationships within the data so as to make informed decisions and develop effective strategies.
- The EDA process aims in detecting faulty points in data such as errors or missing values which can be corrected by analysis.
Exploratory data analysis (EDA) steps.
To achieve this critical task, following steps need to be taken into consideration.
- Importing necessary libraries.
- Load the dataset
- Viewing the dataset.
- Check for duplication.
- Data preparation (Handling missing values and outliers)
- Analyzing the data.(univariate, bivariate and multivariate analysis)
Wondering where to get datasets for practice? We got you covered. Take a look at the following resources.
Lets now perform EDA on the following sample Dataset from Kaggle by performing the following steps;
Importing necessary Libraries
Python is a versatile programming language with robust libraries for data analysis, including Pandas, NumPy, seaborn, Matplotlib, Plotly, Tensor flow, Keras.
We will make use of the following modules
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
Get to understand other Python libraries for data analysis by reading through the following article.
Data Science Libraries Every Data scientist should know.
Loading the Dataset.
The pandas library can be used to read various datasets from multiple formats such as csv, txt, Excel, JSON, SQL, etc.
#How to read Datasets of various formats import pandas as pd #read csv df = pd.read_csv('filename.csv') #read txt df = pd.read_csv('filename.txt', sep="\t") #read excel df = pd.read_excel('filename.xlsx') #read json df = pd.read_json('filename.json') #read sql df = pd.read_sql('SELECT * FROM TableName', connection)
Our dataset is in csv format, we load it into data DataFrame as follows;
Viewing the DataFrame.
We can quickly find out how many rows and columns there are in our dataset by using the shape method. This returns a tuple which contains the number of rows and columns.
Shape of the Data
Previewing the Dataset.
Preview the first 5 rows.
Preview last five rows.
head() and tail() functions are used to preview the first five rows and last fie rows in the dataset respectively.
The pandas.DataFrame.columns function is used to get the names of all the columns of a pandas dataframe object. It returns an Index object which holds the column labels.
Concise info of dataset
The info() method allows us to obtain additional information about the dataset, such as the names of the columns, the data type of each column, and the number of non-null values.
The pandas.dtypes function returns the data types of each column in a pandas DataFrame. It returns a pandas Series with the data type of each column.
The describe() function in pandas is used to generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values. It provides the count, mean, standard deviation, minimum, maximum, 25th percentile, 50th percentile (median) and 75th percentile of the data.
The pandas.DataFrame.duplicated.sum() function returns a Series or DataFrame containing the sum of the boolean values (True or False) in the duplicate rows of the DataFrame.
The dataset has no duplicates.
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. It involves handling missing values and handling outliers
We will begin by scanning the dataset for missing values. We may do this by utilizing the isna() method, which returns a dataframe of boolean values indicating whether or not a field is null. We can use the sum() method to group all missing values by column.
Our dataset has no null values and its ready for analysis.
It is common for datasets to contain errors, missing values, outliers, or other types of inconsistencies. Incase your dataset has missing values this blog (Handling Missing Values)will guide you how to handle missing values in data.
"No data is clean, but most is useful."~ Dean Abbott, Co-founder and Chief Data Scientist at SmarterHQ
With our cleansed dataset we can go ahead and begin the task of exploring the data.
Univariate analysis is a form of exploratory data analysis (EDA) that involves the examination of a single variable. It is used to summarize the data and gain insight into the data's distribution, central tendency, and variability. It can be used to answer questions such as what is the range of the data, what is the most common value, and is there any outliers. It is also used to identify any trends or patterns in the data.
we can visualize this information using a boxplot from Seaborn.
#Handling outliers fig, axs = plt.subplots(3, figsize = (5,5)) plt1 = sns.boxplot(data['TV'], ax = axs) plt2 = sns.boxplot(data['Newspaper'], ax = axs) plt3 = sns.boxplot(data['Radio'], ax = axs) plt.tight_layout()
With the above box plots we can be able distribution of the data in the vriables.
Bivariate analysis involves analyzing data with two variables or columns. This is usually a way to explore the relationships between these variables and how they influence each other, if at all.
# Let's see how Sales are related with other variables using scatter plot. sns.pairplot(data, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4, aspect=1, kind='scatter') plt.show()
Multivariate analysis is a type of data analysis that involves examining more than two variables at once in order to better understand the relationships between them. It is a powerful tool for exploratory data analysis, as it allows researchers to identify patterns and trends in large datasets that would otherwise be difficult to spot. It also allows researchers to explore the relationships between multiple variables and determine which ones are most important in predicting the outcome of interest.
# Let's see the correlation between different variables. sns.heatmap(data.corr(), cmap="Greens", annot = True) plt.show()
As it is visible from the scatterplot and the heatmap, the variable TV seems to be most correlated with Sales
I hope this article has been informative and helpful in understanding how to perform Exploratory Data Analysis with Python. If you found it helpful, please share it with your fellow colleagues and friends. Enjoy your exploration of data!
Top comments (12)
Hey Phylis thanks for sharing!
A really neat tool I found a while back for
ydata-profilinglibrary, it takes your
DataFrameand outputs a full report about the data with summary stats, visualizations, etc. it's a really fantastic EDA tool!
NOTE: the tool used to be called
pandas-profilingbut they recently changed it to
Decent article about the topic
Will check it out!
YES, Love Pandas-Profiling or Ydata-Profiling. Great starter.
Thanks...Glad you found it interesting and helpful
Great piece here, thanks. Very helpful
Thank you..Am glad you found it helpful
Noted! will check it out
Thank you. Glad you found it helpful😊