Itsdru

Posted on Mar 1, 2023 • Edited on Aug 21, 2023

Exploratory Data Analysis(EDA) with Python

#datascience #programming #python #machinelearning

Introduction

In today's world, data has become one of the most important asset for businesses, organisations, governments and even individuals. It has become a norm to hear the coin phrase, "data-driven decision-making". The world is generating data at unprecedented levels, according to Statista it predicts the amount of data to be created, consumed and stored in 2025 will be about 180 zettabytes an increase from 64.2 zettabytes in 2020.

Exploratory Data Analysis is an approach to understanding your data in depth. It is a critical first step in any data analysis project; as it provides a foundation for further analysis and can help identify potential problems or biases in the data, detect outliers or anomalous events or even find interesting relations among the variables.

EDA is an iterative process than a one-step process that involves visualizing and summarizing the data in different ways to uncover hidden insights before making any assumptions.

Python, a programming language is a go-to tool for data analysis and EDA.

In this case, we will go through an EDA process for a data project I worked on. You can have a look at the project on my Github.

Types of EDA

There are four primary types of EDA:

Univariate non-graphical: This is the simplest form of data analysis where data being analysed is just a single variable. The main purpose here is to describe data and find patterns that exist within it as this doesn't deal with causes or relationships.
Univariate graphical: Graphical methods provide a fuller picture of the data. Common types include: histograms, box plots and stem-and-leaf plots.
Multivariate non-graphical: This combines multiple variables trying to show the relationship between two or more variables of the data through cross-tabulation or statistics.
Multivariate graphical: This is the graphical version of the multivariate non-graphical. Most used graphs include bar plot/chart, scatter plot, multivariate chart, run chart, bubble chart and heat map.

Steps Involved in EDA

1) Data Gathering - This is the process of finding, collecting and loading data into the analytical system.
2) Data Cleaning - To get quality findings the data used must be of high quality. To achieve this, the data if necessarily needs to be removed of unwanted variables and irregularities, re-indexing and reformatting, feature engineering, etc.
3) Univariate Analysis - This involves analysis data of variable. A single variable refers to a single feature/column.
4) Bivariate Analysis - This involves analysing two or more variables. For example, how one feature affects the other.

Loan Analysis with EDA

You can access the dataset used here from Kaggle, Prosper Loan Dataset. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. See this data dictionary to understand the dataset's variables.

The main task here is to have a look at the basic structure of the loans given. This means having a look at the amounts given, interests and fees, time and the state of the borrowers, etc. This will mean having a look at features like: CreditGrade, Term, LoanStatus, BorrowerState, EmploymentStatus, LP_InterestandFees, LoanOriginalAmount, LoanOriginationDate, amongst others.

Please access the variable definitions of the dataset here.

1. Data Loading

Load the dataset into jupyter notebooks using Python libraries. First we import the necessary libraries needed for this project and then load the dataset.

# Import needed Python Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

#import warnings
import warnings
warnings.filterwarnings('ignore')

Load the ProsperLoan Dataset:

# Load the dataset and display first 5 rows
df_loans = pd.read_csv("prosperLoanData.csv")

df_loans.head()

2. Data Exploration

After loading the dataset, the next step is to explore the data to get an overview by examining the structure, identifying variables and checking for missing or incorrect values.

# Check the dataset's datatypes and nulls
df_loans.info()

Check the shape of the dataset(rows, columns):

# Check how many columns and rows the dataset contains
df_loans.shape

(113937, 81)


Check for the missing values:
# Check the count of missing values by column
# Limit to first 10 columns with missing values
missing_values = df_loans.isna().sum()
missing_values[:10]

ListingKey                 0
ListingNumber              0
ListingCreationDate        0
CreditGrade            84984
Term                       0
LoanStatus                 0
ClosedDate             58848
BorrowerAPR               25
BorrowerRate               0
LenderYield                0
dtype: int64

Using a 'for-loop' print column headers

# Check for the header-columns
for col in df_loans:
    print(col)

Check for duplicates in the dataset

# Check for duplicates 
df_loans.duplicated().sum()

0

Check for the frequency count of loan amounts given

# Check the count of LoanOriginalAmount
df_loanscp.LoanOriginalAmount.value_counts()

4000     14333
15000    12407
10000    11106
5000      6990
2000      6067
         ...  
7765         1
5652         1
9746         1
15889        1
12282        1
Name: LoanOriginalAmount, Length: 2468, dtype: int64

Using "describe()" check the statistical summary of the LoanOriginalAmount

# Statistical view of LoanOriginalAmount
df_loanscp.LoanOriginalAmount.describe()

count    113937.00000
mean       8337.01385
std        6245.80058
min        1000.00000
25%        4000.00000
50%        6500.00000
75%       12000.00000
max       35000.00000
Name: LoanOriginalAmount, dtype: float64

Using "describe()" check the statistical summary of the LP_InterestandFees

# Statistical view of LP_InterestandFees
df_loanscp.LP_InterestandFees.describe()

count    113937.000000
mean       1077.542901
std        1183.414168
min          -2.349900
25%         274.870000
50%         700.840100
75%        1458.540000
max       15617.030000
Name: LP_InterestandFees, dtype: float64

Using "value_counts()" check the count of loan terms

# Check count of Loan terms
df_loanscp.Term.value_counts()

36    87778
60    24545
12     1614
Name: Term, dtype: int64

Check the loan risk score and frequency count

# Checking the custom risk score on loans taken
# The score ranges from 1-10, with 10 being the best, or lowest risk score. 
df_loanscp.ProsperScore.value_counts()

4.0     12595
6.0     12278
8.0     12053
7.0     10597
5.0      9813
3.0      7642
9.0      6911
2.0      5766
10.0     4750
11.0     1456
1.0       992
Name: ProsperScore, dtype: int64

Check for the loan categories

# Breakdown of loan categories
df_loanscp.Occupation.value_counts()

Other                          28617
Professional                   13628
Computer Programmer             4478
Executive                       4311
Teacher                         3759
                               ...  
Dentist                           68
Student - College Freshman        41
Student - Community College       28
Judge                             22
Student - Technical School        16
Name: Occupation, Length: 67, dtype: int64

3. Data Cleaning

The next step is to clean the data. This involves handling missing or incorrect values, removing duplicates, and transforming the data into a format suitable for analysis.

Convert Date Columns to Datetime Datatype

# Convert mutiple columns using the dot apply method
df_loans[['ClosedDate','DateCreditPulled','DateCreditPulled','LoanOriginationDate' ]] = df_loans[['ClosedDate','DateCreditPulled','DateCreditPulled','LoanOriginationDate' ]].apply(pd.to_datetime)

To save on loading time for the copied dataset, I will be dropping columns that I will not need in this phase.

# First create a list of all columns
all_cols = []
for col in df_loanscp:
        all_cols.append(col)
print(all_cols)

['ListingKey', 'ListingNumber', 'ListingCreationDate', 'CreditGrade', 'Term', 'LoanStatus', 'ClosedDate', 'BorrowerAPR', 'BorrowerRate', 'LenderYield', 'EstimatedEffectiveYield', 'EstimatedLoss', 'EstimatedReturn', 'ProsperRating (numeric)', 'ProsperRating (Alpha)', 'ProsperScore', 'ListingCategory (numeric)', 'BorrowerState', 'Occupation', 'EmploymentStatus', 'EmploymentStatusDuration', 'IsBorrowerHomeowner', 'CurrentlyInGroup', 'GroupKey', 'DateCreditPulled', 'CreditScoreRangeLower', 'CreditScoreRangeUpper', 'FirstRecordedCreditLine', 'CurrentCreditLines', 'OpenCreditLines', 'TotalCreditLinespast7years', 'OpenRevolvingAccounts', 'OpenRevolvingMonthlyPayment', 'InquiriesLast6Months', 'TotalInquiries', 'CurrentDelinquencies', 'AmountDelinquent', 'DelinquenciesLast7Years', 'PublicRecordsLast10Years', 'PublicRecordsLast12Months', 'RevolvingCreditBalance', 'BankcardUtilization', 'AvailableBankcardCredit', 'TotalTrades', 'TradesNeverDelinquent (percentage)', 'TradesOpenedLast6Months', 'DebtToIncomeRatio', 'IncomeRange', 'IncomeVerifiable', 'StatedMonthlyIncome', 'LoanKey', 'TotalProsperLoans', 'TotalProsperPaymentsBilled', 'OnTimeProsperPayments', 'ProsperPaymentsLessThanOneMonthLate', 'ProsperPaymentsOneMonthPlusLate', 'ProsperPrincipalBorrowed', 'ProsperPrincipalOutstanding', 'ScorexChangeAtTimeOfListing', 'LoanCurrentDaysDelinquent', 'LoanFirstDefaultedCycleNumber', 'LoanMonthsSinceOrigination', 'LoanNumber', 'LoanOriginalAmount', 'LoanOriginationDate', 'LoanOriginationQuarter', 'MemberKey', 'MonthlyLoanPayment', 'LP_CustomerPayments', 'LP_CustomerPrincipalPayments', 'LP_InterestandFees', 'LP_ServiceFees', 'LP_CollectionFees', 'LP_GrossPrincipalLoss', 'LP_NetPrincipalLoss', 'LP_NonPrincipalRecoverypayments', 'PercentFunded', 'Recommendations', 'InvestmentFromFriendsCount', 'InvestmentFromFriendsAmount', 'Investors']

# Dropping column
df_loanscp.drop(['ListingCreationDate','ProsperRating (Alpha)','LenderYield','ListingCategory (numeric)','CurrentlyInGroup','DateCreditPulled','CreditScoreRangeLower','CreditScoreRangeUpper','FirstRecordedCreditLine','CurrentCreditLines','GroupKey','ProsperPrincipalBorrowed','ProsperPrincipalOutstanding','EstimatedEffectiveYield','EstimatedLoss','EstimatedReturn','ProsperRating (numeric)',       'TotalProsperLoans','TotalProsperPaymentsBilled','OnTimeProsperPayments','ProsperPaymentsLessThanOneMonthLate','ProsperPaymentsOneMonthPlusLate','ListingKey'], axis =1, inplace = True)
df_loanscp.head()

Please note, the columns can also be dropped using for-loop as shown below.

# Define the list of columns to drop
columns_to_drop = ['ListingCreationDate','ProsperRating (Alpha)','LenderYield','ListingCategory (numeric)','CurrentlyInGroup','DateCreditPulled','CreditScoreRangeLower','CreditScoreRangeUpper','FirstRecordedCreditLine','CurrentCreditLines','GroupKey','ProsperPrincipalBorrowed','ProsperPrincipalOutstanding','EstimatedEffectiveYield','EstimatedLoss','EstimatedReturn','ProsperRating (numeric)',     'TotalProsperLoans','TotalProsperPaymentsBilled','OnTimeProsperPayments','ProsperPaymentsLessThanOneMonthLate','ProsperPaymentsOneMonthPlusLate','ListingKey']

# Drop the columns using a loop
for col in columns_to_drop:
    df_loanscp.drop(col, axis=1, inplace=True)

# Print the updated DataFrame
print(df)

Feature Engineering based on existing data.

# Create a year column based on the LoanOriginationDate
df_loanscp['LoanOriginationYear'] = pd.DatetimeIndex(df_loanscp['LoanOriginationDate']).year
df_loanscp['LoanOriginationYear']

0         2007
1         2014
2         2007
3         2012
4         2013
          ... 
113932    2013
113933    2011
113934    2013
113935    2011
113936    2014
Name: LoanOriginationYear, Length: 113937, dtype: int64

Fill missing values in the "No Grade"

# Filling the NaN values in the CreditGrade Column with "No Grade"

df_loanscp['CreditGrade'] = df_loanscp['CreditGrade'].fillna('No Grade')

Check for missing values after filling in the values

# Checking missing values in the CreditGrade column
missing_credits = df_loanscp.CreditGrade.isna().sum()
missing_credits

0

4. Data Analysis & Visualization

After cleaning the dataset, the next step is to analyse and create visual representations of the data using graphs and other visual techniques. in this case, both of these steps are combined but in other cases the two steps are separated. Sometimes an analyst/scientist may need to apply statistical and machine learning techniques to the data to identify patterns, relationships, and anomalies before visualizing the results.

Univariate Exploration

Using the "Question-Visualization-Observations" framework throughout the exploration. This framework involves asking a question from the data, creating a visualization to find answers, and then recording observations after each visualisation.

What was the employment status for the borrowers?

bcolor = sns.color_palette()[0]
cate_order = df_loanscp.EmploymentStatus.value_counts().index
plt.figure(figsize=[10, 8])
sns.countplot(data=df_loanscp,y='EmploymentStatus',color=bcolor, order=cate_order)
plt.title('Borrowers-Employment Distribution')
plt.xlabel('Terms in Months');

Click to enlarge image.

Majority of the borrowers were employed while the retirers took the least number of loans.

Which year had the most and least count of loans given?

plt.figure(figsize = [20, 8])
bcolor = sns.color_palette()[0]
sns.countplot(data=df_loanscp, x=df_loanscp['LoanOriginationYear'],color=bcolor)
plt.title('Loan Origin Dates(Year) Distribution')
plt.xlabel('Year Loan was Given');

Click to enlarge image.

2013 had the highest number of loans given while 2005 had the least.

Is the borrower a Homeowner?

homeowner_count = df_loanscp.IsBorrowerHomeowner.value_counts()
homeowner_count

True     57478
False    56459
Name: IsBorrowerHomeowner, dtype: int64

homeowner_count = df_loanscp.IsBorrowerHomeowner.value_counts()

plt.figure(figsize=[25,10])

plt.subplot(1, 2, 1)
plt.pie(homeowner_count, labels = homeowner_count.index, startangle = 90, counterclock = False, autopct='%1.1f%%', textprops={'color':"b"})
plt.title("Breakdown of whether a borrower is a homeowner");

plt.subplot(1, 2, 2)
base_color = sns.color_palette()[0]
sns.countplot(data=df_loanscp,x= 'IsBorrowerHomeowner',color=base_color);
plt.title('Borrower-Homeowner Status Distrobution')
plt.xlabel('Borrower homeownership status');

Click to enlarge image.

Majority of the borrowers own a home. The difference between the two is neglible by less than 1% though.

Bivariate Exploration

For this section we will investigate relationships between pairs of variables in the
data. Variables covered here have been introduced in some
fashion in the previous section (univariate exploration).

What is the relationship between Employment Status and Loan Amount Given?

cate_order = df_loanscp.EmploymentStatus.value_counts().index

plt.figure(figsize = [18, 12])
bcolor = sns.color_palette()[0]

sns.boxplot(data=df_loanscp,x='LoanOriginalAmount',y='EmploymentStatus',color=bcolor, order=cate_order)
plt.title('Employment Status & Loan Original Amount Distribution')
plt.xlabel('Loan Original Amount')
plt.ylabel('Employment Status');

Click to enlarge image.

Being employed gives a borrower a chance to access more loans.

What is the relationship between credit grade and status loan?

# Group the loans by creditgrade and loan status
loan_gradestatus = df_loanscp.groupby(['CreditGrade', 'LoanStatus']).count()['LoanOriginationDate']
loan_gradestatus = loan_gradestatus.reset_index()

# Rename the loanoriginaldate column and pivot the columns
# Fill missing values with 0
loan_gradestatus = loan_gradestatus.rename(columns={'LoanOriginationDate': 'count'}) 
loan_gradestatus = loan_gradestatus.pivot(index = 'CreditGrade', columns = 'LoanStatus', values = 'count')
loan_gradestatus.fillna(0, inplace=True)

# Plot the data using a heatmap
# A heatmap can be used to show relationships between categorical variables
# Note similar colors across the row depict there is a relationship between variables and vice versa
plt.figure(figsize = [18, 10])
sns.heatmap(loan_gradestatus, cmap = 'rocket_r', annot = True, fmt = ".0f")
plt.title('Credit Grade & Loan Status Distribution')
plt.ylabel("Credit Grade")
plt.xlabel("Loan Status");

Click to enlarge image.

Loans in Grade C had the highest completion and default rate while those in NC had the lowest respectivelly, this applies for all loans that had a Credit Grade. However looking at all loans with or without a credit grade, loans with No Grade cumulatively had the highest rate of all the loan statuses apart from cancelled.

Multivariate Exploration

What is the relationship between these three variables: MonthlyLoanPayment, LoanStatus and ProsperScore?

# MonthlyLoanPayment and LoanStatus vs ProsperRating

plt.figure(figsize = [18, 10])
plt.scatter(data=df_loanscp,x='MonthlyLoanPayment',y = 'LoanStatus',c='ProsperScore',cmap = 'viridis_r')
plt.colorbar(label = 'ProsperScore');
plt.xlabel('MonthlyLoanPayment')
plt.ylabel('LoanStatus')
plt.title('MonthlyLoanPayment & LoanStatus vs ProsperRating');

Click to enlarge image.

Completed and Current Loans seemed to have the lowest risks while the others which were mostly past due had higher risks.

You can have a look at the project on my Github.
The more you know more, the more you find out that you know nothing.

Exploring the Possibilities: Let's Collaborate on Your Next Data Venture! You can check me out at this Link

DEV Community

Exploratory Data Analysis(EDA) with Python

Introduction

Types of EDA

Steps Involved in EDA

Loan Analysis with EDA

1. Data Loading

2. Data Exploration

3. Data Cleaning

4. Data Analysis & Visualization

Univariate Exploration

Bivariate Exploration

Multivariate Exploration

Top comments (0)

Read next

Comprehensive Guide to Data Observability Tools in 2024

Latterns animation using the html css and javascript code

Why Run LLM's /SLM's locally

Why APIs Are the Unsung Heroes of App Development