Data science in banking plays a major role nowadays. Banks all over the world analyze data to provide better experiences to their customers and also to reduce risks.
In this post, You can get to know the importance and role of data science in the banking sector and how it leverages the earnings potential by reducing the risks of a firm.
Here are a few applications across banking:
- Credit Decisions
- Risk Assessment
- Fraud prevention
- Process Automation
The above use cases have been applied by JP Morgan Chase & Co in their business operations and management as per a case study. You can find more information on the case study from the following link https://www.superiordatascience.com/jpmcasestudy.html
A PayPal's Use case: Here is another example that was applied by PayPal. PayPal uses AI base model for listing your available payment options(between linked bank accounts and credit/debit cards). This happens when PayPal deducts money when you are performing transactions. With PayPal, you can observe your bank account doesn't show up few times when you are performing a transaction. This moreover happens when you transfer funds with friends and family or Goods and services category. This is because an AI-based model will classify your list of payment options linked, based on the Risk Assessment analysis from your previous transactions.
Note: Accross the europe, the bank transactions will be preformed using a payment system called SEPA Direct debit Mandate. For paypal to preform these type of transactions they will not have access to check the users bank account whether they had enough funds or not. They will only get the information on whether a payment is successful or declined after paypal registers the transaction on users account bank. This usually takes a couple of working days for bank to make successful or unsuccessful transaction. For this reason, paypal have to provide a credit for the user if they use the bank account which might be high risk depends on the transaction amount. So, during the risk assessment your listed bank account will more likely gets a high risk score than the credit cards or debit cards transactions.
You can look into this thread for more information on this issue by few PayPal users at https://www.paypal-community.com/t5/Transactions/Linked-bank-account-not-showing-as-payment-method/td-p/1808787
You can also find more information on how PayPal uses AI in their business from the following link https://www.paypal.com/us/brc/article/enterprise-solutions-paypal-machine-learning-stop-fraud
A practical example with python:
For a practical understanding, we can look into an example of Loan eligibility prediction. For this, we can use a publicly available dataset from the Kaggle. We start with data processing which includes handling missing data, data analysis then followed by training and testing machine learning models.
Loan eligibility identification is one of the most challenging problems in the banking sector. Applicant’s eligibility for a loan will be based on several factors like credit history, Salary, loan application amount, tenure to repay, and a few other factors. To solve this problem we use machine learning to train a few sample records and predict future outcomes.
Steps Involved:
- Dataset Information
- Loading data
- Dealing with Missing Values
- Adding extra features
- Exploratory Data Analysis
- Correlation matrix and Outliers Detection
- Encoding Categorical to numerical data
- Model training and evaluation
- Conclusion
Here we use pandas, matplotlib, scikit-learn modules in our data processing and model development.
Dataset Information:
Contains 614 records with the following 13 column names
'Loan_ID'
'Gender'
'Married'
'Dependents'
'Education'
'Self_Employed'
'ApplicantIncome'
'CoapplicantIncome'
'LoanAmount'
'Loan_Amount_Term'
'Credit_History'
'Property_Area'
'Loan_Status‘
You can load the data using the following code sniplet:
Loading data:
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
data = pd.read_csv('train.csv')
Dealing with Missing Values:
After loading the dataset, we have to find whether any missing values exist in our dataset. We can deal with the missing values in different ways. We can either remove the rows containing missing values in case you consider having large enough dataset for model training, or we can fill the missing values by using any statistical methods by finding mean, average, or clustering, machine learning model to predict the missing values. This depends on the type of variables that you are dealing with and the amount of data available.
print(data.isnull().sum())
Will give us the count of missing values in each column of our dataframe.
Output:
Loan_ID 0
Gender 13
Married 3
Dependents 15
Education 0
Self_Employed 32
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 22
Loan_Amount_Term 14
Credit_History 50
Property_Area 0
Loan_Status 0
dtype: int64
In our dataset columns, Credit_History, Self_Emoloyed, Dependents, Loan_Amount_Term, Gender, and Married are categorical columns so we used mode to fill the missing values in these columns while LoanAmount is a numerical data type, so we fill these missing values using median of column values
We can perform this operation with the folowing code sniplets:
data['Gender'] = data['Gender'].fillna(data['Gender'].dropna().mode().values[0])
data['Married'] = data['Married'].fillna(data['Married'].dropna().mode().values[0])
data['Dependents'] = data['Dependents'].fillna(data['Dependents'].dropna().mode().values[0])
data['Self_Employed'] = data['Self_Employed'].fillna(data['Self_Employed'].dropna().mode().values[0])
data['LoanAmount'] = data['LoanAmount'].fillna(data['LoanAmount'].dropna().median())
data['Loan_Amount_Term'] = data['Loan_Amount_Term'].fillna(data['Loan_Amount_Term'].dropna().mode().values[0])
data['Credit_History'] = data['Credit_History'].fillna(data['Credit_History'].dropna().mode().values[0])
Adding extra features:
Adding extra features based on a few data analysis insights can improve the model accuracy.
Based on our dataset and goal we are adding two extra columns Total_Income, avg_income_met.
- Total_Income is calculated by adding ApplicantIncome and CoapplicantIncome for each applicant/record.
- avg_income_met column is generated with 1’s and 0’s by calculating whether is applicant Total_Income is greater than the average income of all the applicants with Loan_Status=Y (Yes) Dataframe with new columns:
Exploratory Data Analysis:
Few data analysis observations were performed on the dataset to analyze and get insights throughout the data
From the above graphs following observations can be noted
- Applicants with fewer Dependents have more possibility of getting a loan.
- More applications are made by the applicants from semi-urban areas.
- Graduates are more likely to get loans.
- Credit History should be moreover positive for the applicants with average income.
- Married people are more likely to apply for loans.
- Males applicants are higher than female applicants.
- Self-employed applicants have less probability of getting a loan.
The EDA graphs can be generated individually using the following code sniplets:
data['Gender'].value_counts(normalize=True).plot.bar(title='Gender')
plt.show()
data['Married'].value_counts(normalize=True).plot.bar(title='Married')
plt.show()
data['Self_Employed'].value_counts(normalize=True).plot.bar(title='Self_Employed')
plt.show()
data['Credit_History'].value_counts(normalize=True).plot.bar(title='Credit_History')
plt.show()
# Independent Variable (Ordinal)
data['Dependents'].value_counts(normalize=True).plot.bar(title='Dependents', color= 'cyan',edgecolor='black')
plt.show()
data['Education'].value_counts(normalize=True).plot.bar(title='Education', color= 'cyan',edgecolor='black')
plt.show()
data['Property_Area'].value_counts(normalize=True).plot.bar(title='Property_Area', color= 'cyan',edgecolor='black')
plt.show()
Correlation matrix and Outliers Detection:
- Correlation matrix is a table showing correlation coefficients between variables. Thus we can remove variables that having less correlation as a dimensional reduction technique.
- An outlier is a value in a data set that is very different from the other values. That is, outliers are values unusually far from the middle. In most cases, outliers have influence on mean , but not on the median , or mode.
Correlation matrix and boxplot can be generated using the following code
corr = data.corr()
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
data.boxplot(column = 'Total_Income', by = 'Loan_Status')
plt.suptitle("")
From the above heatmap, we can observe that Married and Gender columns are negatively correlated and the Loan_Status is well correlated with credit_History.
The above code sniplets will generate a box plot to check the data distribution of Total_income based on Loan_Status as above.
From the above boxplot, we can observe that there are outliers present that have Total_Income greater than 60000. The outliers should be handled as these will reduce the standard deviation between variables and result in decreasing correlation. We considered data having greater than 30000 are outliers. We removed the records containing outliers with the following code sniplet.
data.drop(data[data.Total_Income >30000].index)
Encoding Categorical to numerical data:
Columns Gender, Married, Education, Property_Area, Self_Employed, and Loan_Status contain categorical values. So we convert these data types to numerical by using a dictionary as mentioned below.
cat2num = {'Male': 1, 'Female': 2,
'Yes': 1, 'No': 0,
'Graduate': 1, 'Not Graduate': 0,
'Rural': 1, 'Semiurban': 2,'Urban': 3,
'Y': 1, 'N': 0,
'3+': 3}
data = data.applymap(lambda item: cat2num.get(item) if item in cat2num else item)
Before training machine learning algorithms, we have to evaluate whether all the features will be used to train the machine learning algorithms. In our data LoanID has unique values for each applicant, so we drop this column as it is not useful for model training and predictions.
data.drop('Loan_ID', axis = 1, inplace = True)
In our dataframe we have Dependants column with values 0, 1, 2, 3, 3+. Here 3+ is not numerical datatypeinstead it is of string datatype. So we convert this to numerical datatype.
data['Dependents'] = pd.to_numeric(data['Dependents'])
Now, we are good to train the data with a few machine learning models and test the model with predictions, and calculating accuracy with f1_score.
Model training and evaluation:
All the models trained and tested are imported from the available scikit-learn module.
Every model is imported from sklearn module
.fit() method used to train
model.fit(X_train, y_train)
X_train – contains all the features after splitting data.
Y_train – contains target variable Loan_Status.
•predict() method used for prediction
Here we have used f1_socre which is an evaluation metric for testing model accuracy
To achieve the goal, we have selected five classifier models to train and test on our data.
- Support Vector Machine Classifier
- Decision Tree Classifier
- Random Forest Classifier
- KNearestNeighbors Classifier
- Naive Bayes Classifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
#splitting data (80% - for training, 20% - for validation)
X_train, X_test, y_train, y_test = train_test_split(data.drop('Loan_Status', axis = 1),
data['Loan_Status'], test_size=0.20, random_state=0)
# SVM classifier
from sklearn import svm
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train,y_train)
svm_prediction = classifier.predict(X_test)
evaluation_svm = f1_score(y_test, svm_prediction)
print('SVM classifier f1_score : ', evaluation_svm)
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
d_tree = DecisionTreeClassifier()
d_tree.fit(X_train, y_train)
d_pred = d_tree.predict(X_test)
evaluation_DT = f1_score(y_test, d_pred)
print('Decision Tree f1_score : ', evaluation_DT)
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier()
forest.fit(X_train, y_train)
forest_prediction = forest.predict(X_test)
evaluation_forest = f1_score(y_test, forest_prediction)
print('Random Forest Classifier f1_score : ', evaluation_forest)
#KNN classifier
from sklearn.neighbors import KNeighborsClassifier
KNN_model = KNeighborsClassifier(n_neighbors=3)
KNN_model.fit(X_train, y_train)
KNN_predicted= KNN_model.predict(X_test)
evaluation_KNN = f1_score(y_test, KNN_predicted)
print('KNN f1_score : ', evaluation_KNN)
# Navie Bayes classification
from sklearn.naive_bayes import GaussianNB
NB_model = GaussianNB()
NB_model.fit(X_train, y_train)
NB_predicted= NB_model.predict(X_test)
evaluation_NB = f1_score(y_test, NB_predicted)
print('NB f1_score : ', evaluation_NB)
scores_N = list(('svm', 'DT', 'forest', 'KNN', 'NB'))
scores = list((evaluation_svm, evaluation_DT, evaluation_forest, evaluation_KNN, evaluation_NB))
x = np.array([0,1,2,3,4])
plt.xticks(x, scores_N)
plt.plot(scores)
You might need to use kfold technique to train and evaluate the models in cases of small dataset.
K-Fold is validation technique in which we split the data into k-subsets and the holdout method is repeated k-times where each of the k subsets are used as test set and other k-1 subsets are used for the training purpose.
Conclusion:
Model evaluation is tested on 123 samples (20% of the dataset called validation dataset)
From the above graph, we can observe that the Gaussian Naive Bayes classifier resulted in the highest accuracy of 89.23%
Scores plot using matplotlib
Future posts:
- How datascience techniques can be used for malware detection on android systems.
- A programmers instinct towards datascience (In this post, I would like to discuss how programming tricks can be used to perform data-processing tasks, model development and training with a minimal knowledge on datascience concepts) . . .
If you are interested to know how datascience can be used in any particular field, please comment your interested area of study. I will try do a post with a practical example.
Please comment down, if you have any questions.
Thank you for your time, hope you enjoyed reading. Happy learning
Top comments (0)