DEV Community

Cover image for Loan Repayment Prediction using Machine Learning.
Oluwafunmilola Obisesan
Oluwafunmilola Obisesan

Posted on

Loan Repayment Prediction using Machine Learning.

Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.
If you’re looking to read more about machine learning, check out this article I wrote for FreeCodeCamp[(]

In this project, I worked on developing a machine learning model that predicts if an individual will pay back a loan or not. This was done using classification machine learning algorithms; Decision Tree and Random Forest.

I decided to use both algorithms so I could compare the performance of both on the dataset.

Random Forest is a preferred choice when compared to Decision Tree, particularly in high-dimensional data scenarios. It excels in harnessing ensemble learning, where multiple decision trees collaboratively tackle complex pattern recognition and contribute to improved predictive accuracy.

Using Random Forest in this project reflects not just my personal preference but a data-driven approach, acknowledging the substantial benefits of combining these trees in mitigating overfitting and enhancing classification robustness in real-world, diverse datasets.

Data Description
The dataset is a lending data available online which shows the varying profile of people that applied for loan and if they paid back or not.

Here are what the columns of the dataset represent:

  1. credit.policy: If the customer meets the credit underwriting criteria of, and 0 otherwise.
  2. purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
  3. int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by to be more risky are assigned higher interest rates.
  4. installment: The monthly installments owed by the borrower if the loan is funded.
  5. The natural log of the self-reported annual income of the borrower.
  6. dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
  7. fico: The FICO credit score of the borrower.
  8. The number of days the borrower has had a credit line.
  9. revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
  10. revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
  11. inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
  12. delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
  13. pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).


1.Importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Enter fullscreen mode Exit fullscreen mode

Image description

2.Loading in the dataset:

loan_dataset = pd.read_csv("loan-data.csv")
Enter fullscreen mode Exit fullscreen mode

Image description

A peep into what the dataset looks like

Enter fullscreen mode Exit fullscreen mode

Image description

Checking the number of rows and columns present in the dataset

Enter fullscreen mode Exit fullscreen mode

Image description

3.Data Cleaning
It is essential to carry out data cleaning/pre processing on any given dataset before proceeding with the model building.
Data Cleaning involves removal of duplicates, null values, outliers and a plethora of errors that can be found in the dataset.

Checking for missing values

Enter fullscreen mode Exit fullscreen mode

Image description

The dataset has no missing values.

4.Label Encoding
Label encoding is used in converting categorical data into numerical form.
The column “Purpose” needed to be converted from categorical column to a numerical column.

loan =pd.get_dummies(loan_dataset,columns=cat_feats,drop_first=True)
Enter fullscreen mode Exit fullscreen mode

Image description

5.Extracting Dependent and independent variables and training the model

X = loan.drop('not.fully.paid',axis=1)
y = loan['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)
Enter fullscreen mode Exit fullscreen mode

Image description

6.Fitting the Decision Tree Model

from sklearn.tree import DecisionTreeClassifier
tree =DecisionTreeClassifier(),y_train)
Enter fullscreen mode Exit fullscreen mode

Image description

7.Checking the accuracy of the Decision Tree model using the test data

from sklearn.metrics import accuracy_score
y_pred = tree.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score {:.2f}%".format(accuracy * 100))
Enter fullscreen mode Exit fullscreen mode

Image description

The Decision Tree model gave an accuracy score of 73.38%
Not bad!

8.Fitting the Random Forest

from sklearn.ensemble import RandomForestClassifier
rfc= RandomForestClassifier(n_estimators=100),y_train)
Enter fullscreen mode Exit fullscreen mode

9.Checking the accuracy of the Random Forest Model using the test data

y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score: {:.2f}%".format(accuracy * 100))
Enter fullscreen mode Exit fullscreen mode

Image description

As expected, The Random Forest Model outperformed the Decision Tree Model with an accuracy score of 84.86%

These results proves the effectiveness of Random Forest in comparison to Decision Trees for this particular problem, highlighting the valuable role of ensemble techniques in enhancing model performance and ensuring better generalization to unseen data.

That’s it for this project!

For the entire code, check my GitHub profile:

Thank you for reading!

Top comments (0)