Introduction
The analysis to create a decision making granting loans is one of the most important operations for financial institutions. By taking into account past results of a pertain, a model is trained to accurately predict future outcomes of individual loan approval success rate.
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggthemes)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(shiny)
library(DT)
##
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
##
## dataTableOutput, renderDataTable
library(DBI)
## Warning: package 'DBI' was built under R version 3.6.2
library(odbc)
## Warning: package 'odbc' was built under R version 3.6.2
library(rsconnect)
## Warning: package 'rsconnect' was built under R version 3.6.2
##
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
##
## serverInfo
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(callr)
# Set the blank spaces to NA's
loan = read_csv("loan.csv" , na = "")
## Parsed with column specification:
## cols(
## .default = col_double(),
## id = col_logical(),
## member_id = col_logical(),
## term = col_character(),
## grade = col_character(),
## sub_grade = col_character(),
## emp_title = col_character(),
## emp_length = col_character(),
## home_ownership = col_character(),
## verification_status = col_character(),
## issue_d = col_character(),
## loan_status = col_character(),
## pymnt_plan = col_character(),
## url = col_logical(),
## desc = col_logical(),
## purpose = col_character(),
## title = col_character(),
## zip_code = col_character(),
## addr_state = col_character(),
## earliest_cr_line = col_character(),
## initial_list_status = col_character()
## # ... with 29 more columns
## )
## See spec(...) for full column specifications.
## Warning: 462349 parsing failures.
## row col expected actual file
## 92797 debt_settlement_flag_date 1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_status 1/0/T/F/TRUE/FALSE ACTIVE 'loan.csv'
## 92797 settlement_date 1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_amount 1/0/T/F/TRUE/FALSE 5443 'loan.csv'
## 92797 settlement_percentage 1/0/T/F/TRUE/FALSE 65 'loan.csv'
## ..... ......................... .................. ........ ..........
## See problems(...) for more details.
rsconnect::setAccountInfo(name='rode', token='FF0050CE796E6D152241FCDBD70150F1', secret='8CfkDnpsOULx5SVvf1OF/n+YvaLOvK7ImOz5/480')
From the analysis, the pertain contains information of age, annual income, grade of employee, home ownership that affects the probability of default to the borrower.
# Select only the columns mentioned above.
loan = loan %>%
select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership ,
annual_inc , term)
loan
## # A tibble: 2,260,668 x 8
## loan_status loan_amnt int_rate grade emp_length home_ownership
## <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 Current 2500 13.6 C 10+ years RENT
## 2 Current 30000 18.9 D 10+ years MORTGAGE
## 3 Current 5000 18.0 D 6 years MORTGAGE
## 4 Current 4000 18.9 D 10+ years MORTGAGE
## 5 Current 30000 16.1 C 10+ years MORTGAGE
## 6 Current 5550 15.0 C 10+ years MORTGAGE
## 7 Current 2000 18.0 D 4 years RENT
## 8 Current 6000 13.6 C 10+ years RENT
## 9 Current 5000 18.0 D 10+ years MORTGAGE
## 10 Current 6000 14.5 C < 1 year OWN
## # ... with 2,260,658 more rows, and 2 more variables: annual_inc <dbl>,
## # term <chr>
Missing Values:
sapply(loan , function(x) sum(is.na(x)))
## loan_status loan_amnt int_rate grade emp_length
## 0 0 0 0 0
## home_ownership annual_inc term
## 0 4 0
# Remove the 4 rows with missing annual income, 49 rows where home ownership is 'NONE' or 'ANY' and rows where emp_length is 'n/a'.
loan = loan %>%
filter(!is.na(annual_inc) ,
!(home_ownership %in% c('NONE' , 'ANY')) ,
emp_length != 'n/a')
loan %>%
count(loan_status) %>%
ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) +
geom_col() +
coord_flip() + labs(x = 'Loan Status' , y = 'Count')
The chart above fully illustrates a conversion variable to binary where (1 represents for default and 0 for non-defaults) but we have 10 different levels. Loans with status Current, Late payments was removed and transformed into a new variable called loan_outcome where
loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’
loan = loan %>%
mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') ,
1,
ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
))
barplot(table(loan$loan_outcome) , col = 'lightblue')
Based on the diagram, 0, 1 and no info illustrates a descriptive analysis of each transactional binary value.
# Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status column for the modelling
loan2 = loan %>%
select(-loan_status) %>%
filter(loan_outcome %in% c(0 , 1))
Let’s observe how useful these variables would be for credit risk modelling. The graph below illustrates that the better the grade, the lower the interest rate of a pretaining loan.
ggplot(loan2 , aes(x = grade , y = int_rate , fill = grade)) +
geom_boxplot() +
theme_igray() +
labs(y = 'Interest Rate' , x = 'Grade')
table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
##
## Fully Paid Default
## A 201708 12424
## B 311584 46655
## C 271318 76680
## D 128042 54771
## E 53267 33064
## F 16447 13413
## G 4285 4227
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Paid')))) +
geom_bar() +
theme(legend.title = element_blank())
Based on the diagram above, the report illustrates the impact of annuals income of borrower as other variables. As a result, a majority of credits which have been fully paid are higher than of default rates.
ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annual_inc , y = loan_amnt , color = int_rate)) +
geom_point(alpha = 0.5 , size = 1.5) +
geom_smooth(se = F , color = 'darkred' , method = 'loess') +
xlim(c(0 , 300000)) +
labs(x = 'Annual Income' , y = 'Loan Ammount' , color = 'Interest Rate')
## Warning: Removed 57 rows containing non-finite values (stat_smooth).
## Warning: Removed 57 rows containing missing values (geom_point).
Based on the diagram, it shows a larger annual income than larger demanded ammounts by borrowers.
# Split dataset
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , replace = F)
trainset = loan2[idx , ]
testset = loan2[-idx , ]
# Fit logistic regression
glm.model = glm(loan_outcome ~ . , trainset , family = binomial(link = 'logit'))
summary(glm.model)
##
## Call:
## glm(formula = loan_outcome ~ ., family = binomial(link = "logit"),
## data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4570 -0.7119 -0.5306 -0.3213 6.8140
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.117e+00 2.005e-02 -155.437 < 2e-16 ***
## loan_amnt 1.171e-05 3.810e-07 30.727 < 2e-16 ***
## int_rate 3.737e-02 1.760e-03 21.231 < 2e-16 ***
## gradeB 6.567e-01 1.369e-02 47.983 < 2e-16 ***
## gradeC 1.080e+00 1.691e-02 63.865 < 2e-16 ***
## gradeD 1.293e+00 2.231e-02 57.971 < 2e-16 ***
## gradeE 1.420e+00 2.813e-02 50.483 < 2e-16 ***
## gradeF 1.474e+00 3.581e-02 41.173 < 2e-16 ***
## gradeG 1.509e+00 4.539e-02 33.248 < 2e-16 ***
## emp_length1 year -1.499e-02 1.372e-02 -1.093 0.274516
## emp_length10+ years -5.992e-02 1.046e-02 -5.728 1.02e-08 ***
## emp_length2 years -4.259e-02 1.272e-02 -3.348 0.000814 ***
## emp_length3 years -2.235e-02 1.311e-02 -1.706 0.088068 .
## emp_length4 years -3.675e-02 1.420e-02 -2.589 0.009635 **
## emp_length5 years -4.750e-02 1.407e-02 -3.377 0.000733 ***
## emp_length6 years -4.732e-02 1.537e-02 -3.078 0.002086 **
## emp_length7 years -6.510e-02 1.565e-02 -4.160 3.19e-05 ***
## emp_length8 years -1.008e-02 1.547e-02 -0.651 0.514797
## emp_length9 years -1.283e-02 1.640e-02 -0.782 0.434128
## home_ownershipOTHER 2.465e-01 2.553e-01 0.966 0.334282
## home_ownershipOWN 1.941e-01 9.512e-03 20.407 < 2e-16 ***
## home_ownershipRENT 3.492e-01 6.097e-03 57.270 < 2e-16 ***
## annual_inc -2.515e-06 7.339e-08 -34.274 < 2e-16 ***
## term60 months 4.259e-01 6.847e-03 62.202 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 912910 on 920912 degrees of freedom
## Residual deviance: 837103 on 920889 degrees of freedom
## AIC: 837151
##
## Number of Fisher Scoring iterations: 5
The coefficients of the following features are positive:
In contrast to this, the probability of defaulting on the given credit varies directly with these factors. As a result, the higher the amouunt of loan, the higher the risk of credit loss.
The coefficients of the following features are negative:
This means that the probability of defaulting is inversely proportional to the factors mentioned above.
# Prediction on test set
preds = predict(glm.model , testset , type = 'response')
# Density of probabilities
ggplot(data.frame(preds) , aes(preds)) +
geom_density(fill = 'lightblue' , alpha = 0.4) +
labs(x = 'Predicted Probabilities on test set')
As of now, the accuracy, sensitivity and specificity are transformed for given threshold.
k = 0
accuracy = c()
sensitivity = c()
specificity = c()
for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
k = k + 1
preds_binomial = ifelse(preds > i , 1 , 0)
confmat = table(testset$loan_outcome , preds_binomial)
accuracy[k] = sum(diag(confmat)) / sum(confmat)
sensitivity[k] = confmat[1 , 1] / sum(confmat[ , 1])
specificity[k] = confmat[2 , 2] / sum(confmat[ , 2])
}
If we plot our results we get this visualization.
threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)
data = data.frame(threshold , accuracy , sensitivity , specificity)
head(data)
## threshold accuracy sensitivity specificity
## 1 0.01 0.1962133 0.8493151 0.1960580
## 2 0.02 0.1965196 0.9190751 0.1961121
## 3 0.03 0.1973405 0.9343545 0.1962416
## 4 0.04 0.2026862 0.9619220 0.1971906
## 5 0.05 0.2455208 0.9586278 0.2048655
## 6 0.06 0.2919843 0.9509678 0.2135462
# Gather accuracy , sensitivity and specificity in one column
ggplot(gather(data , key = 'Metric' , value = 'Value' , 2:4) ,
aes(x = threshold , y = Value , color = Metric)) +
geom_line(size = 1.5)
Based on the chart, a threshold of 25% - 30% seems ideal cause further increase of the cut off percentage does not have significant impact on the accuracy of the model. The Confusion Matrix for cut off point at 30% will be this,
preds.for.30 = ifelse(preds > 0.3 , 1 , 0)
confusion_matrix_30 = table(Predicted = preds.for.30 , Actual = testset$loan_outcome)
confusion_matrix_30
## Actual
## Predicted 0 1
## 0 211301 38017
## 1 35490 22164
## [1] "Accuracy : 0.7605"
The ROC (Receiver Operating Characteristics) curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.
library(pROC)
## Warning: package 'pROC' was built under R version 3.6.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Area Under Curve
auc(roc(testset$loan_outcome , preds))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7006
# Plot ROC curve
plot.roc(testset$loan_outcome , preds , main = "Confidence interval of a threshold" , percent = TRUE ,
ci = TRUE , of = "thresholds" , thresholds = "best" , print.thres = "best" , col = 'blue')
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
The logistic regression model was used to predict the loan status. Different cut off’s were used to decide if the loan should be granted or not. Cut off of 30% - 35% gave a good accuracy of 76.05%. The decision to set a cut off is arbitrary and higher levels of threshold has increased the risk. The Area Under Curve also gives a measure of accuracy, which came out to be 70.06%. Therefore, the credit of this pertaining user is good as it is more than the users loss credit. Therefore, the chance of a loan that will be approve is good for this individual.