Introduction
In today's world, data has become one of the most important asset for businesses, organisations, governments and even individuals. It has become a norm to hear the coin phrase, "data-driven decision-making". The world is generating data at unprecedented levels, according to Statista it predicts the amount of data to be created, consumed and stored in 2025 will be about 180 zettabytes an increase from 64.2 zettabytes in 2020.
Exploratory Data Analysis is an approach to understanding your data in depth. It is a critical first step in any data analysis project; as it provides a foundation for further analysis and can help identify potential problems or biases in the data, detect outliers or anomalous events or even find interesting relations among the variables.
EDA is an iterative process than a one-step process that involves visualizing and summarizing the data in different ways to uncover hidden insights before making any assumptions.
Python, a programming language is a go-to tool for data analysis and EDA.
In this case, we will go through an EDA process for a data project I worked on. You can have a look at the project on my Github.
Types of EDA
There are four primary types of EDA:
- Univariate non-graphical: This is the simplest form of data analysis where data being analysed is just a single variable. The main purpose here is to describe data and find patterns that exist within it as this doesn't deal with causes or relationships.
- Univariate graphical: Graphical methods provide a fuller picture of the data. Common types include: histograms, box plots and stem-and-leaf plots.
- Multivariate non-graphical: This combines multiple variables trying to show the relationship between two or more variables of the data through cross-tabulation or statistics.
- Multivariate graphical: This is the graphical version of the multivariate non-graphical. Most used graphs include bar plot/chart, scatter plot, multivariate chart, run chart, bubble chart and heat map.
Steps Involved in EDA
1) Data Gathering - This is the process of finding, collecting and loading data into the analytical system.
2) Data Cleaning - To get quality findings the data used must be of high quality. To achieve this, the data if necessarily needs to be removed of unwanted variables and irregularities, re-indexing and reformatting, feature engineering, etc.
3) Univariate Analysis - This involves analysis data of variable. A single variable refers to a single feature/column.
4) Bivariate Analysis - This involves analysing two or more variables. For example, how one feature affects the other.
Loan Analysis with EDA
You can access the dataset used here from Kaggle, Prosper Loan Dataset. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. See this data dictionary to understand the dataset's variables.
The main task here is to have a look at the basic structure of the loans given. This means having a look at the amounts given, interests and fees, time and the state of the borrowers, etc. This will mean having a look at features like: CreditGrade, Term, LoanStatus, BorrowerState, EmploymentStatus, LP_InterestandFees, LoanOriginalAmount, LoanOriginationDate, amongst others.
Please access the variable definitions of the dataset here.
1. Data Loading
Load the dataset into jupyter notebooks using Python libraries. First we import the necessary libraries needed for this project and then load the dataset.
# Import needed Python Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#import warnings
import warnings
warnings.filterwarnings('ignore')
Load the ProsperLoan Dataset:
# Load the dataset and display first 5 rows
df_loans = pd.read_csv("prosperLoanData.csv")
df_loans.head()
2. Data Exploration
After loading the dataset, the next step is to explore the data to get an overview by examining the structure, identifying variables and checking for missing or incorrect values.
# Check the dataset's datatypes and nulls
df_loans.info()
Check the shape of the dataset(rows, columns):
# Check how many columns and rows the dataset contains
df_loans.shape
(113937, 81)
Check for the missing values:
# Check the count of missing values by column
# Limit to first 10 columns with missing values
missing_values = df_loans.isna().sum()
missing_values[:10]
ListingKey 0
ListingNumber 0
ListingCreationDate 0
CreditGrade 84984
Term 0
LoanStatus 0
ClosedDate 58848
BorrowerAPR 25
BorrowerRate 0
LenderYield 0
dtype: int64
Using a 'for-loop' print column headers
# Check for the header-columns
for col in df_loans:
print(col)
Check for duplicates in the dataset
# Check for duplicates
df_loans.duplicated().sum()
0
Check for the frequency count of loan amounts given
# Check the count of LoanOriginalAmount
df_loanscp.LoanOriginalAmount.value_counts()
4000 14333
15000 12407
10000 11106
5000 6990
2000 6067
...
7765 1
5652 1
9746 1
15889 1
12282 1
Name: LoanOriginalAmount, Length: 2468, dtype: int64
Using "describe()" check the statistical summary of the LoanOriginalAmount
# Statistical view of LoanOriginalAmount
df_loanscp.LoanOriginalAmount.describe()
count 113937.00000
mean 8337.01385
std 6245.80058
min 1000.00000
25% 4000.00000
50% 6500.00000
75% 12000.00000
max 35000.00000
Name: LoanOriginalAmount, dtype: float64
Using "describe()" check the statistical summary of the LP_InterestandFees
# Statistical view of LP_InterestandFees
df_loanscp.LP_InterestandFees.describe()
count 113937.000000
mean 1077.542901
std 1183.414168
min -2.349900
25% 274.870000
50% 700.840100
75% 1458.540000
max 15617.030000
Name: LP_InterestandFees, dtype: float64
Using "value_counts()" check the count of loan terms
# Check count of Loan terms
df_loanscp.Term.value_counts()
36 87778
60 24545
12 1614
Name: Term, dtype: int64
Check the loan risk score and frequency count
# Checking the custom risk score on loans taken
# The score ranges from 1-10, with 10 being the best, or lowest risk score.
df_loanscp.ProsperScore.value_counts()
4.0 12595
6.0 12278
8.0 12053
7.0 10597
5.0 9813
3.0 7642
9.0 6911
2.0 5766
10.0 4750
11.0 1456
1.0 992
Name: ProsperScore, dtype: int64
Check for the loan categories
# Breakdown of loan categories
df_loanscp.Occupation.value_counts()
Other 28617
Professional 13628
Computer Programmer 4478
Executive 4311
Teacher 3759
...
Dentist 68
Student - College Freshman 41
Student - Community College 28
Judge 22
Student - Technical School 16
Name: Occupation, Length: 67, dtype: int64
3. Data Cleaning
The next step is to clean the data. This involves handling missing or incorrect values, removing duplicates, and transforming the data into a format suitable for analysis.
- Convert Date Columns to Datetime Datatype
# Convert mutiple columns using the dot apply method
df_loans[['ClosedDate','DateCreditPulled','DateCreditPulled','LoanOriginationDate' ]] = df_loans[['ClosedDate','DateCreditPulled','DateCreditPulled','LoanOriginationDate' ]].apply(pd.to_datetime)
- To save on loading time for the copied dataset, I will be dropping columns that I will not need in this phase.
# First create a list of all columns
all_cols = []
for col in df_loanscp:
all_cols.append(col)
print(all_cols)
['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']
# Dropping column
df_loanscp.drop(['ListingCreationDate','ProsperRating (Alpha)','LenderYield','ListingCategory (numeric)','CurrentlyInGroup','DateCreditPulled','CreditScoreRangeLower','CreditScoreRangeUpper','FirstRecordedCreditLine','CurrentCreditLines','GroupKey','ProsperPrincipalBorrowed','ProsperPrincipalOutstanding','EstimatedEffectiveYield','EstimatedLoss','EstimatedReturn','ProsperRating (numeric)', 'TotalProsperLoans','TotalProsperPaymentsBilled','OnTimeProsperPayments','ProsperPaymentsLessThanOneMonthLate','ProsperPaymentsOneMonthPlusLate','ListingKey'], axis =1, inplace = True)
df_loanscp.head()
Please note, the columns can also be dropped using for-loop as shown below.
# Define the list of columns to drop
columns_to_drop = ['ListingCreationDate','ProsperRating (Alpha)','LenderYield','ListingCategory (numeric)','CurrentlyInGroup','DateCreditPulled','CreditScoreRangeLower','CreditScoreRangeUpper','FirstRecordedCreditLine','CurrentCreditLines','GroupKey','ProsperPrincipalBorrowed','ProsperPrincipalOutstanding','EstimatedEffectiveYield','EstimatedLoss','EstimatedReturn','ProsperRating (numeric)', 'TotalProsperLoans','TotalProsperPaymentsBilled','OnTimeProsperPayments','ProsperPaymentsLessThanOneMonthLate','ProsperPaymentsOneMonthPlusLate','ListingKey']
# Drop the columns using a loop
for col in columns_to_drop:
df_loanscp.drop(col, axis=1, inplace=True)
# Print the updated DataFrame
print(df)
- Feature Engineering based on existing data.
# Create a year column based on the LoanOriginationDate
df_loanscp['LoanOriginationYear'] = pd.DatetimeIndex(df_loanscp['LoanOriginationDate']).year
df_loanscp['LoanOriginationYear']
0 2007
1 2014
2 2007
3 2012
4 2013
...
113932 2013
113933 2011
113934 2013
113935 2011
113936 2014
Name: LoanOriginationYear, Length: 113937, dtype: int64
Fill missing values in the "No Grade"
# Filling the NaN values in the CreditGrade Column with "No Grade"
df_loanscp['CreditGrade'] = df_loanscp['CreditGrade'].fillna('No Grade')
Check for missing values after filling in the values
# Checking missing values in the CreditGrade column
missing_credits = df_loanscp.CreditGrade.isna().sum()
missing_credits
0
4. Data Analysis & Visualization
After cleaning the dataset, the next step is to analyse and create visual representations of the data using graphs and other visual techniques. in this case, both of these steps are combined but in other cases the two steps are separated. Sometimes an analyst/scientist may need to apply statistical and machine learning techniques to the data to identify patterns, relationships, and anomalies before visualizing the results.
Univariate Exploration
Using the "Question-Visualization-Observations" framework throughout the exploration. This framework involves asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.
- What was the employment status for the borrowers?
bcolor = sns.color_palette()[0]
cate_order = df_loanscp.EmploymentStatus.value_counts().index
plt.figure(figsize=[10, 8])
sns.countplot(data=df_loanscp,y='EmploymentStatus',color=bcolor, order=cate_order)
plt.title('Borrowers-Employment Distribution')
plt.xlabel('Terms in Months');
Majority of the borrowers were employed while the retirers took the least number of loans.
- Which year had the most and least count of loans given?
plt.figure(figsize = [20, 8])
bcolor = sns.color_palette()[0]
sns.countplot(data=df_loanscp, x=df_loanscp['LoanOriginationYear'],color=bcolor)
plt.title('Loan Origin Dates(Year) Distribution')
plt.xlabel('Year Loan was Given');
2013 had the highest number of loans given while 2005 had the least.
- Is the borrower a Homeowner?
homeowner_count = df_loanscp.IsBorrowerHomeowner.value_counts()
homeowner_count
True 57478
False 56459
Name: IsBorrowerHomeowner, dtype: int64
homeowner_count = df_loanscp.IsBorrowerHomeowner.value_counts()
plt.figure(figsize=[25,10])
plt.subplot(1, 2, 1)
plt.pie(homeowner_count, labels = homeowner_count.index, startangle = 90, counterclock = False, autopct='%1.1f%%', textprops={'color':"b"})
plt.title("Breakdown of whether a borrower is a homeowner");
plt.subplot(1, 2, 2)
base_color = sns.color_palette()[0]
sns.countplot(data=df_loanscp,x= 'IsBorrowerHomeowner',color=base_color);
plt.title('Borrower-Homeowner Status Distrobution')
plt.xlabel('Borrower homeownership status');
Majority of the borrowers own a home. The difference between the two is neglible by less than 1% though.
Bivariate Exploration
For this section we will investigate relationships between pairs of variables in the
data. Variables covered here have been introduced in some
fashion in the previous section (univariate exploration).
- What is the relationship between Employment Status and Loan Amount Given?
cate_order = df_loanscp.EmploymentStatus.value_counts().index
plt.figure(figsize = [18, 12])
bcolor = sns.color_palette()[0]
sns.boxplot(data=df_loanscp,x='LoanOriginalAmount',y='EmploymentStatus',color=bcolor, order=cate_order)
plt.title('Employment Status & Loan Original Amount Distribution')
plt.xlabel('Loan Original Amount')
plt.ylabel('Employment Status');
Being employed gives a borrower a chance to access more loans.
- What is the relationship between credit grade and status loan?
# Group the loans by creditgrade and loan status
loan_gradestatus = df_loanscp.groupby(['CreditGrade', 'LoanStatus']).count()['LoanOriginationDate']
loan_gradestatus = loan_gradestatus.reset_index()
# Rename the loanoriginaldate column and pivot the columns
# Fill missing values with 0
loan_gradestatus = loan_gradestatus.rename(columns={'LoanOriginationDate': 'count'})
loan_gradestatus = loan_gradestatus.pivot(index = 'CreditGrade', columns = 'LoanStatus', values = 'count')
loan_gradestatus.fillna(0, inplace=True)
# Plot the data using a heatmap
# A heatmap can be used to show relationships between categorical variables
# Note similar colors across the row depict there is a relationship between variables and vice versa
plt.figure(figsize = [18, 10])
sns.heatmap(loan_gradestatus, cmap = 'rocket_r', annot = True, fmt = ".0f")
plt.title('Credit Grade & Loan Status Distribution')
plt.ylabel("Credit Grade")
plt.xlabel("Loan Status");
Loans in Grade C had the highest completion and default rate while those in NC had the lowest respectivelly, this applies for all loans that had a Credit Grade. However looking at all loans with or without a credit grade, loans with No Grade cumulatively had the highest rate of all the loan statuses apart from cancelled.
Multivariate Exploration
- What is the relationship between these three variables: MonthlyLoanPayment, LoanStatus and ProsperScore?
# MonthlyLoanPayment and LoanStatus vs ProsperRating
plt.figure(figsize = [18, 10])
plt.scatter(data=df_loanscp,x='MonthlyLoanPayment',y = 'LoanStatus',c='ProsperScore',cmap = 'viridis_r')
plt.colorbar(label = 'ProsperScore');
plt.xlabel('MonthlyLoanPayment')
plt.ylabel('LoanStatus')
plt.title('MonthlyLoanPayment & LoanStatus vs ProsperRating');
Completed and Current Loans seemed to have the lowest risks while the others which were mostly past due had higher risks.
You can have a look at the project on my Github.
The more you know more, the more you find out that you know nothing.
Exploring the Possibilities: Let's Collaborate on Your Next Data Venture! You can check me out at this Link
Top comments (0)