Exploratory Data Analysis (EDA) in Python is a process that was developed by “John Tukey” in the 1970s. Statistically , exploratory data analysis is the process of analyzing data sets to summarize their main characteristics, and presenting them Visually for proper observations . Basically it is the step in which we need to explore the data set.
Why Exploratory Data Analysis (EDA) ?
EDA is important in Data analysis and Machine learning in that it helps you get to know whether the selected features are good enough to model, if they are all required and if there are any correlations based on which we can either go back to the Data Pre-processing step or move on to modeling.
Generally , EDA is applied to investigate the data and summarize the key insights.
EDA does not only give us insight about our data, it also involve preprocessing of data for further analytics and model development by removing anomalies and outliers from the data.
This makes data cleaner for use in machine learning processes.
EDA is also a source of information for making better business decisions.
*Approach *
When it comes to data Exploratory , there are two key approaches used
1.Non-graphical approach
In the non-graphical approach, we use functions such as shape, summary, describe, isnull, info, datatypes and more.
_2. Graphical approach _
In the graphical approach, you will be using plots such as scatter, box, bar, density and correlation plots.
Before EDA
Before we begin EDA, we first need to do :
1.Data Sourcing
2.Data Cleaning
1. Data Sourcing / Data Collection
Before we can analyse data we first need to have the data. The process of obtaining data is what we call, data Sourcing. We can source data using two major ways. Data Sourcing is the very first step of EDA. Data can be obtained from public or private sources
Public Data Sources:
These are data sources that we can obtain and use without any restrictions or need for special permissions. This are publicly available to any one organization to use. Some common source of public data is:
https://data.gov/
https://data.gov.uk/
https://data.gov.in/
https://www.kaggle.com/
https://github.com/awesomedata/awesome-public-datasets
Private Data Sources:
These are data sources that are private to individuals and organizations and can not be accessed by just anyone without the proper authentication and permissions. Mostly only used within the organization for its internal data analysis and model buildings.
2. Data Cleaning
The second step before we begin the actual EDA, we need to clean our data. Data from the field may or may not be cleaned hence we need to perform some data inspection and do some cleaning before moving on to anlyzing the data.
When it comes to data cleaning we have already looked at some of the techniques we can use to clean data.
- Missing Values
- Incorrect Format
- Incorrect Headers/column names
- Anomalies/Outliers
- Re-index rows
- One thing I'll data to dealing with missing values is, the different types of missing values:
- MCAR(Missing completely at random): These values do not depend on any other features in the dataset.
- MAR(Missing at random): These values may be dependent on some other features in the dataset.
- MNAR(Missing not at random): These missing values have some reason for why they are missing from the dataset
Lets Look at a Example on EDA For better understanding :
To do Exploratory Data Analysis in Python, we need some python libraries such as Numpy, Pandas, and Seaborn. The last two libraries will be used for visualization
.
Make sure you import them before proceeding
The second step is loading our dataset for analysis:
Now we can begin our EDA :
1. Check data shape (num of Rows & Columns)
This can be done by just simply use, the code down below
df.shape
The output gives information on the number of rows and columns in your dataset. In the example above, there are 1460 rows and 81 columns in the data. In these 14 columns one is the target or the dependent variable and The rest of them are mostly independent variables
2. Check each data type of columns and missing values
df.info()
The info() method prints information about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values)
3. Splitting values
On some occasions, we might want to split the value of a column.
Inace the column carries more than one value , eg country and city in one column . We split into two , a country and city column
df[['city', 'country']] = df['address'].str.split(',', expand=True)
resulting to
4. Change the data type
We can use astype() function from pandas.This is important because the specific data type you use will determine what values you can assign to it and what you can do to it (including what operations you can perform on it).
For example, to replace the data type of Customer Number, IsPurchased, Total Spend, and Dates we run the code below
#Replace Data Types to Integer
df["Customer Number"] = df['Customer Number'].astype('int')
#Replace Data Types to String
df["Customer Number"] = df['Customer Number'].astype('str')
#Replace Data Types to Boolean
df["IsPurchased"] = df['IsPurchased'].astype('bool')
#Replace Data Types to Float
df["Total Spend"] = df['Total Spend'].astype('float')
#Replace Data Types to Datetime with format= '%Y%m%d'
df['Dates'] = pd.to_datetime(df['Dates'], format='%Y%m%d')
5. Deal With Missing Values
We first check if there are any missing values then we can decide on what to do next depending on results .
df.isna().sum()
If there are no missing values we can proceed with analysis however if there are notable missing values we do a percentage of the missing values .If the percentage of missing values is high and it is not an important column, we can drop the corresponding column
total_missing = df.isna().sum().sort_values(ascending=False)
percentages_missing = (df.isna().sum()/df.isna().count()).sort_values(ascending=False)
missing_df = pd.concat([total_missing, percentages_missing], axis=1, keys=["Total_Missing", "Percentages_Missing"])
missing_df.head(25)
Incases where the number of missing values are not so large we find ways to fill in the missing figures .There are many ways of dealing with missing values and we shall look at them later .
6. Summary Statistics
If the DataFrame contains numerical data, the description contains these information for each column:
count - The number of not-empty values.
mean - The average (mean) value.
std - The standard deviation.
min - the minimum value.
25% - The 25% percentile*.
50% - The 50% percentile*.
75% - The 75% percentile*.
max - the maximum value.
From this, you could already see the data distribution that you have for each and determine whether there are outliers or not.
df.describe()
7. Value counts for a specific column
In here we count the number of each value in a column.
Eg if it it is a cars dataset , we want to know how many types appears in the dataset .
df.Col.value_counts()
- Check for duplicate values
This is also a check of duplicate values , then we can know if to drop them or keep them depending on the data and the goal we want to achieve
#example of the data that have multiple values
df[df.Player == "john doe"]
9.See the data distribution and data anomaly
Here, we want to see visually how the data distribution is using the Seaborn library.From the summary statistics before, we might already know which columns that potentially having data anomalies. Anomalies in data are also called standard deviations, outliers, noise, novelties, and exceptions.
10. The correlation between variables in the data
This refers to the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.
plt.figure(figsize=(12, 7))
sns.heatmap(df[["SalePrice", "OverallQual", "OverallCond"]].corr(), annot=True, cmap="Greens")
plt.title("Correlation Matrix Heatmap")
plt.show()
Corr can tell us about the direction of the relationship, the form (shape) of the relationship, and the degree (strength) of the relationship between two variables. The Direction of a Relationship The correlation measure tells us about the direction of the relationship between the two variables.
CONCLUSION
The most import thing in analytics or data exploration is understanding the nature of the dataset . Understanding the problem statement so as to know which part of data is needed and how do go about it . The more you practice , the more you deal with different dataset this will becomes clearer . Happy coding!
Top comments (0)