Finding the Best Classification Algorithm for Predicting Loan Payment
This project will be focussing on finding the best classifier to predict whether a loan case will be paid off or not. We will use machine learning packages from scikit-learn such as KNN, Decision Tree, SVM, and Logistic Regression.
Credit: IBM Cognitive Class
About Dataset
This dataset is about past loans. The Loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:
Field | Description |
Loan_status | Whether a loan is paid off or defaulted |
Principal | Basic principal loan amount |
Terms | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
Effective_date | When the loan got originated and took effects |
Due_date | Since it’s one-time payoff schedule, each loan has one single due date |
Age | Age of applicant |
Education | Education of applicant |
Gender | The gender of applicant |
Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Load Training Set
train_url = ''
df = pd.read_csv(train_url)
First off, let’s view the data.
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
0 | 0 | 0 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 45 | High School or Below | male |
1 | 2 | 2 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 33 | Bechalor | female |
2 | 3 | 3 | PAIDOFF | 1000 | 15 | 9/8/2016 | 9/22/2016 | 27 | college | male |
3 | 4 | 4 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 28 | college | female |
4 | 6 | 6 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 29 | college | male |
(346, 10)
Unnamed: 0 346
Unnamed: 0.1 346
loan_status 346
Principal 346
terms 346
effective_date 346
due_date 346
age 346
education 346
Gender 346
dtype: int64
# Checking missing values
Unnamed: 0 0
Unnamed: 0.1 0
loan_status 0
Principal 0
terms 0
effective_date 0
due_date 0
age 0
education 0
Gender 0
dtype: int64
Unnamed: 0 int64
Unnamed: 0.1 int64
loan_status object
Principal int64
terms int64
effective_date object
due_date object
age int64
education object
Gender object
dtype: object
The data contains 346 rows and 10 columns with no missing values. The dataset were also mixed with numbers and strings.
Data cleaning
# Drop Insignificant Column
df.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis = 1, inplace = True)
loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
0 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 45 | High School or Below | male |
1 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 33 | Bechalor | female |
2 | PAIDOFF | 1000 | 15 | 9/8/2016 | 9/22/2016 | 27 | college | male |
3 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 28 | college | female |
4 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 29 | college | male |
#Renaming Column
df.rename(columns={'Principal': 'principal', "Gender": "gender"}, inplace = True)
loan_status | principal | terms | effective_date | due_date | age | education | gender | |
0 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 45 | High School or Below | male |
1 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 33 | Bechalor | female |
2 | PAIDOFF | 1000 | 15 | 9/8/2016 | 9/22/2016 | 27 | college | male |
3 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 28 | college | female |
4 | PAIDOFF | 1000 | 30 | 9/9/2016 | 10/8/2016 | 29 | college | male |
# Standardizing Text & Fixing Typos
['High School or Below' 'Bechalor' 'college' 'Master or Above']
['male' 'female']
df['loan_status'] = df['loan_status'].apply(lambda x: 'paid_off' if (x == 'PAIDOFF') else 'defaulted')
df.loc[ == 'High School or Below', 'education'] = 'high_school_or_below'
df.loc[ == 'college', 'education'] = 'college'
df.loc[ == 'Bechalor', 'education'] = 'bachelor'
df.loc[ == 'Master or Above', 'education'] = 'master_or_above'
loan_status | principal | terms | effective_date | due_date | age | education | gender | |
0 | paid_off | 1000 | 30 | 9/8/2016 | 10/7/2016 | 45 | high_school_or_below | male |
1 | paid_off | 1000 | 30 | 9/8/2016 | 10/7/2016 | 33 | bachelor | female |
2 | paid_off | 1000 | 15 | 9/8/2016 | 9/22/2016 | 27 | college | male |
3 | paid_off | 1000 | 30 | 9/9/2016 | 10/8/2016 | 28 | college | female |
4 | paid_off | 1000 | 30 | 9/9/2016 | 10/8/2016 | 29 | college | male |
# Convert to date time object
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
loan_status | principal | terms | effective_date | due_date | age | education | gender | |
0 | paid_off | 1000 | 30 | 2016-09-08 | 2016-10-07 | 45 | high_school_or_below | male |
1 | paid_off | 1000 | 30 | 2016-09-08 | 2016-10-07 | 33 | bachelor | female |
2 | paid_off | 1000 | 15 | 2016-09-08 | 2016-09-22 | 27 | college | male |
3 | paid_off | 1000 | 30 | 2016-09-09 | 2016-10-08 | 28 | college | female |
4 | paid_off | 1000 | 30 | 2016-09-09 | 2016-10-08 | 29 | college | male |
Feature Extraction
Let’s try to dig information from effective date and due date, and their relationship to loan payment.
We convert the date to day of week. It will start from 0 as Monday until 6, which is Sunday.
df['dayofweek_getloan'] = df['effective_date'].dt.dayofweek
df['dayofweek_dueloan'] = df['due_date'].dt.dayofweek
loan_status | principal | terms | effective_date | due_date | age | education | gender | dayofweek_getloan | dayofweek_dueloan | |
0 | paid_off | 1000 | 30 | 2016-09-08 | 2016-10-07 | 45 | high_school_or_below | male | 3 | 4 |
1 | paid_off | 1000 | 30 | 2016-09-08 | 2016-10-07 | 33 | bachelor | female | 3 | 4 |
2 | paid_off | 1000 | 15 | 2016-09-08 | 2016-09-22 | 27 | college | male | 3 | 3 |
3 | paid_off | 1000 | 30 | 2016-09-09 | 2016-10-08 | 28 | college | female | 4 | 5 |
4 | paid_off | 1000 | 30 | 2016-09-09 | 2016-10-08 | 29 | college | male | 4 | 5 |
bins = np.linspace(df.dayofweek_getloan.min(), df.dayofweek_getloan.max(), 10)
g = sns.FacetGrid(df, col = "gender", hue="loan_status", palette="Set1", col_wrap = 2), 'dayofweek_getloan', bins = bins, ec = "k")
The charts above show that people who get their loan during the weekend (Fri-Sun) tend to not paying their loan.
bins = np.linspace(df.dayofweek_dueloan.min(), df.dayofweek_dueloan.max(), 10)
g = sns.FacetGrid(df, col = "gender", hue="loan_status", palette="Set1", col_wrap = 2), 'dayofweek_dueloan', bins = bins, ec = "k")
While for the due date loan charts it show that the defaulted percentage is higher on Monday and Sunday.
Data Preprocessing
Based on the information above, we will encode the weekend (Fri to Sun) of effective_date to 1, and the others to 0.
For due_date, we will encode Monday and Sunday to 1, and the others to 0.
Other categorical features will be encoded to 0 and 1 as well.
#encode effective_date weekend
df['weekend_getloan'] = df['dayofweek_getloan'].apply(lambda x: 1 if (x > 3) else 0)
#encode monday and sunday of due_date
df['startendweek_dueloan'] = df['dayofweek_dueloan'].apply(lambda x: 1 if (x == 0 or x == 6) else 0)
#encode gender
gender_dummy = pd.get_dummies(df.gender)
#encode education
edu_dummy = pd.get_dummies(
#combined all new encoded features to dataframe
df = pd.concat([df, gender_dummy, edu_dummy], axis = 1)
#encode loan_status
df['loan_stat'] = df['loan_status'].apply(lambda x: 1 if (x == 'paid_off') else 0)
#remove unused column.
df.drop(['loan_status','effective_date', 'due_date', 'dayofweek_getloan', 'dayofweek_dueloan', 'education','gender'], axis = 1, inplace = True)
principal | terms | age | weekend_getloan | startendweek_dueloan | female | male | bachelor | college | high_school_or_below | master_or_above | loan_stat | |
0 | 1000 | 30 | 45 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
1 | 1000 | 30 | 33 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
2 | 1000 | 15 | 27 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
3 | 1000 | 30 | 28 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
4 | 1000 | 30 | 29 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
Let’s get an overview on how the parameters relate to one another using correlation matrix
plt.figure(figsize=(10, 5))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
Feature Selection
x = df.iloc[:,:-1].values
array([[1000, 30, 45, ..., 0, 1, 0],
[1000, 30, 33, ..., 0, 0, 0],
[1000, 15, 27, ..., 1, 0, 0],
[ 800, 15, 39, ..., 1, 0, 0],
[1000, 30, 28, ..., 1, 0, 0],
[1000, 30, 26, ..., 1, 0, 0]])
y = df.iloc[:,-1].values
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Split Train and Validation Set
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size = 0.3, random_state = 10)
print ('Train set:', x_train.shape, y_train.shape)
print ('Validation set:', x_val.shape, y_val.shape)
Train set: (242, 11) (242,)
Validation set: (104, 11) (104,)
Data Standardization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_val = sc.transform(x_val)
/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/utils/ DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/utils/ DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/utils/ DataConversionWarning: Data with input dtype int64 was converted to float64 by StandardScaler.
warnings.warn(msg, DataConversionWarning)
Algorithm 1: Logistic Regression
from sklearn.linear_model import LogisticRegression
classifier1 = LogisticRegression(solver = 'liblinear', random_state = 0), y_train)
y_pred = classifier1.predict(x_val)
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
print(accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))
print(f1_score(y_val, y_pred))
[[ 5 17]
[ 3 79]]
precision recall f1-score support
0 0.62 0.23 0.33 22
1 0.82 0.96 0.89 82
micro avg 0.81 0.81 0.81 104
macro avg 0.72 0.60 0.61 104
weighted avg 0.78 0.81 0.77 104
Algorithm 2: Random Forest
from sklearn.ensemble import RandomForestClassifier
classifier2 = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0), y_train)
y_pred = classifier2.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))
print(f1_score(y_val, y_pred))
[[ 9 13]
[18 64]]
precision recall f1-score support
0 0.33 0.41 0.37 22
1 0.83 0.78 0.81 82
micro avg 0.70 0.70 0.70 104
macro avg 0.58 0.59 0.59 104
weighted avg 0.73 0.70 0.71 104
Algorithm 3: Decision Tree
from sklearn.tree import DecisionTreeClassifier
classifier3 = DecisionTreeClassifier(criterion = 'entropy', random_state = 0), y_train)
y_pred = classifier3.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))
print(f1_score(y_val, y_pred))
[[ 9 13]
[21 61]]
precision recall f1-score support
0 0.30 0.41 0.35 22
1 0.82 0.74 0.78 82
micro avg 0.67 0.67 0.67 104
macro avg 0.56 0.58 0.56 104
weighted avg 0.71 0.67 0.69 104
Algorithm 4: K-Nearest Neighbour
from sklearn.tree import DecisionTreeClassifier
classifier3 = DecisionTreeClassifier(criterion = 'entropy', random_state = 0), y_train)
y_pred = classifier3.predict(x_val)
print(accuracy_score(y_val, y_pred))
print(confusion_matrix(y_val, y_pred))
print(classification_report(y_val, y_pred))
print(f1_score(y_val, y_pred))
[[ 9 13]
[21 61]]
precision recall f1-score support
0 0.30 0.41 0.35 22
1 0.82 0.74 0.78 82
micro avg 0.67 0.67 0.67 104
macro avg 0.56 0.58 0.56 104
weighted avg 0.71 0.67 0.69 104
Evaluating Test Set
test_url = ''
df_test = pd.read_csv(test_url)
Unnamed: 0 | Unnamed: 0.1 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
0 | 1 | 1 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 50 | Bechalor | female |
1 | 5 | 5 | PAIDOFF | 300 | 7 | 9/9/2016 | 9/15/2016 | 35 | Master or Above | male |
2 | 21 | 21 | PAIDOFF | 1000 | 30 | 9/10/2016 | 10/9/2016 | 43 | High School or Below | female |
3 | 24 | 24 | PAIDOFF | 1000 | 30 | 9/10/2016 | 10/9/2016 | 26 | college | male |
4 | 35 | 35 | PAIDOFF | 800 | 15 | 9/11/2016 | 9/25/2016 | 29 | Bechalor | male |
Preprocess Test Set
# Drop Insignificant Column
df_test.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis = 1, inplace = True)
#Renaming Column
df_test.rename(columns={'Principal': 'principal', "Gender": "gender"}, inplace = True)
# Standardizing Text & Fixing Typos
df_test['loan_status'] = df_test['loan_status'].apply(lambda x: 'paid_off' if (x == 'PAIDOFF') else 'defaulted')
df_test.loc[ == 'High School or Below', 'education'] = 'high_school_or_below'
df_test.loc[ == 'college', 'education'] = 'college'
df_test.loc[ == 'Bechalor', 'education'] = 'bachelor'
df_test.loc[ == 'Master or Above', 'education'] = 'master_or_above'
# Convert to date time object
df_test['due_date'] = pd.to_datetime(df_test['due_date'])
df_test['effective_date'] = pd.to_datetime(df_test['effective_date'])
df_test['dayofweek_getloan'] = df_test['effective_date'].dt.dayofweek
df_test['dayofweek_dueloan'] = df_test['due_date'].dt.dayofweek
#encode effective_date weekend
df_test['weekend_getloan'] = df_test['dayofweek_getloan'].apply(lambda x: 1 if (x > 3) else 0)
#encode monday and sunday of due_date
df_test['startendweek_dueloan'] = df_test['dayofweek_dueloan'].apply(lambda x: 1 if (x == 0 or x == 6) else 0)
#encode gender
gender_dummy = pd.get_dummies(df_test.gender)
#encode education
edu_dummy = pd.get_dummies(
#combined all new encoded features to dataframe
df_test = pd.concat([df_test, gender_dummy, edu_dummy], axis = 1)
#encode loan_status
df_test['loan_stat'] = df_test['loan_status'].apply(lambda x: 1 if (x == 'paid_off') else 0)
#remove unused column.
df_test.drop(['loan_status','effective_date', 'due_date', 'dayofweek_getloan', 'dayofweek_dueloan', 'education','gender'], axis = 1, inplace = True)
principal | terms | age | weekend_getloan | startendweek_dueloan | female | male | bachelor | college | high_school_or_below | master_or_above | loan_stat | |
0 | 1000 | 30 | 50 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
1 | 300 | 7 | 35 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
2 | 1000 | 30 | 43 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
3 | 1000 | 30 | 26 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
4 | 800 | 15 | 29 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
x_test = df_test.iloc[:,:-1].values
y_test = df_test.iloc[:,-1].values
Logistic Regression: Test Set
y_pred = classifier1.predict(x_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f1_score(y_test, y_pred))
[[ 0 14]
[ 0 40]]
precision recall f1-score support
0 0.00 0.00 0.00 14
1 0.74 1.00 0.85 40
micro avg 0.74 0.74 0.74 54
macro avg 0.37 0.50 0.43 54
weighted avg 0.55 0.74 0.63 54
/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/metrics/ UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/metrics/ UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
/opt/conda/envs/Python36/lib/python3.6/site-packages/sklearn/metrics/ UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Random Forest: Test Set
y_pred = classifier2.predict(x_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f1_score(y_test, y_pred))
[[10 4]
[11 29]]
precision recall f1-score support
0 0.48 0.71 0.57 14
1 0.88 0.72 0.79 40
micro avg 0.72 0.72 0.72 54
macro avg 0.68 0.72 0.68 54
weighted avg 0.77 0.72 0.74 54
Decision Tree: Test Set
y_pred = classifier3.predict(x_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(f1_score(y_test, y_pred))
[[10 4]
[11 29]]
precision recall f1-score support
0 0.48 0.71 0.57 14
1 0.88 0.72 0.79 40
micro avg 0.72 0.72 0.72 54
macro avg 0.68 0.72 0.68 54
weighted avg 0.77 0.72 0.74 54