Hey reader👋Hope you are doing well!!!
In the last post we have seen some basics of Exploratory Data Analysis. Taking our discussion further in this post we are going to see how EDA is performed on dataset.
So let's get started🔥
Dataset used here-:
(https://www.kaggle.com/datasets/sanyamgoyal401/customer-purchases-behaviour-dataset)
Step1-: Import important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Panadas is used to load the dataset and manipulate the dataframe.
Numpy is used to perform any mathematical operation on the dataset such as finding mean, median , mode etc.
Matplotlib ad Seaborn are used for data visualization through graphs.
Step2-: Load the dataset
df=pd.read_csv('/kaggle/input/customer-purchases-behaviour-dataset/customer_data.csv')
Here read_csv(data address)
is used to load the csv file in our dataframe df
.
Step3-: Check the data
df.head()
Here head()
will return first five rows of dataset, if you want to see more rows you can just pass the number of rows as argument in head()
method.
Similarly for seeing last 5 rows we have tail()
method.
df.tail()
Get dimensions of dataset
df.shape
The shape
property will return pair {x,y}
in which x refers to the number of rows and y refers to the number of columns.
Check data type of each column
df.dtypes
The dtypes
will give a list containing the column name and data type of each column.
Checking null values in each column
If your dataset is containing any missing values then these are treated as NULL and these values need a special attention because they can directly impact our model's performance.
df.isnull().sum()
Here the isnull()
method checks for missing values in every column, it returns a dataframe of same dimension as of dataset with every cell filled with boolean value. The sum()
method returns the count of missing values in every column.
In this dataset we don't have any missing values but I assure you in the upcoming blogs we are going to see how we can handle missing values.
Checking for duplicates in each column
Sometimes due to some errors in data collection our dataset may contain duplicate values and these duplicates can be problematic as they can impact our model. So these are needed to be removed.
df.duplicated().sum()
So the duplicated()
method will return a boolean series that tells us that whether a row is duplicated or not.
We don't have duplicates in this dataset. But if present in any dataset we will simply drop them using-:
df.drop_duplicates(inplace=True)
The drop_duplicates()
method will drop all the duplicates but this method does not make change in original dataframe, to ensure this we use inplace=True
so that changes made are reflected in original dataframe.
Checking count of distinct values in each column
We can look for number of distinct values in each column in dataframe.
df.nunique()
The nunique()
method will give number of unique values present in each column.
Getting a statistical summary of dataset
df.describe()
The describe()
function in pandas provides summary statistics for numerical columns in a DataFrame. It gives us information such as count, mean, standard deviation, minimum, maximum, and quartile values for each numeric column.
Checking for outliers
Outliers are the values in the dataset whose behavior is very different than rest of the values in the dataset. We have different techniques for detecting outliers in the dataset, one of the common techniques is detection of outliers using BoxPlot.
Note -: You will see a complete blog on the detection and handling of outliers.
What is a BoxPlot?
A BosPlot is graphical representation of the distribution of dataset. It gives the information about maximum value, minimum value, median, 25 percentile ,75 percentile and outliers.
In this image the two dots to the left and right represent the outliers the two lines in the end represent maximum and minimum value and the middle line represents median and the extremes of box represents the 25th percentile and 75th percentile.
Step4-: Performing Univariate Analysis
Univariate analysis involves examining the distribution, central tendency, and variability of a single variable in isolation, without considering its relationship with other variables.
Here the countPlot will count the number of Male and Female and show the graphical representation.
x = df['education'].value_counts()
: This line calculates the frequency of each unique value in the 'education' column of the DataFrame df and stores the result in the variable x. It creates a Series where the index represents the unique values in the 'education' column, and the values represent the frequency of each value.
plt.pie(x.values, labels=x.index, autopct='%1.1f%%')
: This line creates a pie chart using Matplotlib's plt.pie() function. It takes the values from the Series x.values (which represent the frequencies) and the index from x.index (which represent the unique education levels) to plot the pie chart. The autopct='%1.1f%%' parameter specifies that the percentages of each category will be displayed on the chart with one decimal place.
plt.show()
: This line displays the pie chart.
plt.figure(figsize=(8, 6))
sns.histplot(df['age'], kde=True)
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()
plt.figure(figsize=(8, 6))
: This line creates a new figure with a specified size of 8 inches by 6 inches using Matplotlib's plt.figure()
function. This sets the dimensions of the plot.
sns.histplot(df['age'], kde=True)
: This line creates a histogram using Seaborn's histplot()
function. It takes the 'age' column from the DataFrame df as input and plots the distribution of ages. The kde=True
parameter adds a kernel density estimate curve to the plot, providing a smooth representation of the distribution.
plt.title('Histogram of Age')
: This line sets the title of the plot to 'Histogram of Age' using Matplotlib's plt.title()
function.
plt.xlabel('Age')
: This line sets the label for the x-axis to 'Age' using plt.xlabel()
.
plt.ylabel('Frequency')
: This line sets the label for the y-axis to 'Frequency' using plt.ylabel()
.
The histplot will give you the insight of how your data is distributed (normal distribution, Poisson distribution etc.). We can use histplot for only numerical values.
Step5-: Performing Bivariate Analysis
Bivariate analysis involves analyzing the relationship between two variables simultaneously. It aims to understand how the value of one variable changes with respect to the value of another variable. Common techniques used in bivariate analysis include scatter plots, correlation analysis, and cross-tabulation. Bivariate analysis helps in identifying patterns, trends, and associations between variables, providing insights into their relationship and potential dependencies.
Categorical V/S Categorical
Categorical V/S Numerical
Numerical V/S Numerical
We have different approaches for each defined above. Here we will only consider Numerical V/S Numerical-:
plt.figure(figsize=(8, 6))
sns.lineplot(x='age', y='income', data=df)
plt.title('Line Plot of Age vs. Income')
plt.xlabel('Age')
plt.ylabel('Income')
plt.show()
Here you can see that how income varies with age this can be useful to see the relationship between income and age.
We have an another plot named as scatterplot which also shows the relationship between numerical data.
We will see more about these plots in upcoming blogs. This was just an insight to how EDA is performed on datasets. I hope you have understood it well. Please leave some reactions and don't forget to follow me.
Thank you ❤
Top comments (3)
Nice post. May I know from where are you learning all these?
I am following Andrew Ng ML course on youtube . The channel is Stanford online.
Thanks