Introduction

The analysis to create a decision making granting loans is one of the most important operations for financial institutions. By taking into account past results of a pertain, a model is trained to accurately predict future outcomes of individual loan approval success rate.

library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggthemes)
library(corrplot)
## corrplot 0.84 loaded
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(shiny)
library(DT)
## 
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
## 
##     dataTableOutput, renderDataTable
library(DBI)
## Warning: package 'DBI' was built under R version 3.6.2
library(odbc)
## Warning: package 'odbc' was built under R version 3.6.2
library(rsconnect)
## Warning: package 'rsconnect' was built under R version 3.6.2
## 
## Attaching package: 'rsconnect'
## The following object is masked from 'package:shiny':
## 
##     serverInfo
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(callr)
# Set the blank spaces to NA's
loan = read_csv("loan.csv" , na = "")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   id = col_logical(),
##   member_id = col_logical(),
##   term = col_character(),
##   grade = col_character(),
##   sub_grade = col_character(),
##   emp_title = col_character(),
##   emp_length = col_character(),
##   home_ownership = col_character(),
##   verification_status = col_character(),
##   issue_d = col_character(),
##   loan_status = col_character(),
##   pymnt_plan = col_character(),
##   url = col_logical(),
##   desc = col_logical(),
##   purpose = col_character(),
##   title = col_character(),
##   zip_code = col_character(),
##   addr_state = col_character(),
##   earliest_cr_line = col_character(),
##   initial_list_status = col_character()
##   # ... with 29 more columns
## )
## See spec(...) for full column specifications.
## Warning: 462349 parsing failures.
##   row                       col           expected   actual       file
## 92797 debt_settlement_flag_date 1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_status         1/0/T/F/TRUE/FALSE ACTIVE   'loan.csv'
## 92797 settlement_date           1/0/T/F/TRUE/FALSE Feb-2019 'loan.csv'
## 92797 settlement_amount         1/0/T/F/TRUE/FALSE 5443     'loan.csv'
## 92797 settlement_percentage     1/0/T/F/TRUE/FALSE 65       'loan.csv'
## ..... ......................... .................. ........ ..........
## See problems(...) for more details.
rsconnect::setAccountInfo(name='rode', token='FF0050CE796E6D152241FCDBD70150F1', secret='8CfkDnpsOULx5SVvf1OF/n+YvaLOvK7ImOz5/480')

Feature Selection & Engineering

From the analysis, the pertain contains information of age, annual income, grade of employee, home ownership that affects the probability of default to the borrower.

# Select only the columns mentioned above.
loan = loan %>%
        select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership , 
               annual_inc , term)
loan
## # A tibble: 2,260,668 x 8
##    loan_status loan_amnt int_rate grade emp_length home_ownership
##    <chr>           <dbl>    <dbl> <chr> <chr>      <chr>         
##  1 Current          2500     13.6 C     10+ years  RENT          
##  2 Current         30000     18.9 D     10+ years  MORTGAGE      
##  3 Current          5000     18.0 D     6 years    MORTGAGE      
##  4 Current          4000     18.9 D     10+ years  MORTGAGE      
##  5 Current         30000     16.1 C     10+ years  MORTGAGE      
##  6 Current          5550     15.0 C     10+ years  MORTGAGE      
##  7 Current          2000     18.0 D     4 years    RENT          
##  8 Current          6000     13.6 C     10+ years  RENT          
##  9 Current          5000     18.0 D     10+ years  MORTGAGE      
## 10 Current          6000     14.5 C     < 1 year   OWN           
## # ... with 2,260,658 more rows, and 2 more variables: annual_inc <dbl>,
## #   term <chr>

Missing Values:

sapply(loan , function(x) sum(is.na(x)))
##    loan_status      loan_amnt       int_rate          grade     emp_length 
##              0              0              0              0              0 
## home_ownership     annual_inc           term 
##              0              4              0
# Remove the 4 rows with missing annual income, 49 rows where home ownership is 'NONE' or 'ANY' and rows where emp_length is 'n/a'.

loan = loan %>%
        filter(!is.na(annual_inc) , 
               !(home_ownership %in% c('NONE' , 'ANY')) , 
               emp_length != 'n/a')

Exploratory Data Analysis

loan %>%
       count(loan_status) %>%
        ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) + 
        geom_col() + 
        coord_flip() + labs(x = 'Loan Status' , y = 'Count')

The chart above fully illustrates a conversion variable to binary where (1 represents for default and 0 for non-defaults) but we have 10 different levels. Loans with status Current, Late payments was removed and transformed into a new variable called loan_outcome where

loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0 if loan_status = ‘Fully Paid’

loan = loan %>%
        mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') , 
                                     1, 
                                     ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
                                     ))

barplot(table(loan$loan_outcome) , col = 'lightblue')

Based on the diagram, 0, 1 and no info illustrates a descriptive analysis of each transactional binary value.

# Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status column for the modelling
loan2 = loan %>%
        select(-loan_status) %>%
        filter(loan_outcome %in% c(0 , 1))

Let’s observe how useful these variables would be for credit risk modelling. The graph below illustrates that the better the grade, the lower the interest rate of a pretaining loan.

ggplot(loan2 , aes(x = grade , y = int_rate , fill = grade)) + 
        geom_boxplot() + 
        theme_igray() + 
        labs(y = 'Interest Rate' , x = 'Grade')

table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
##    
##     Fully Paid Default
##   A     201708   12424
##   B     311584   46655
##   C     271318   76680
##   D     128042   54771
##   E      53267   33064
##   F      16447   13413
##   G       4285    4227
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Paid')))) + 
        geom_bar() + 
        theme(legend.title = element_blank())

Based on the diagram above, the report illustrates the impact of annuals income of borrower as other variables. As a result, a majority of credits which have been fully paid are higher than of default rates.

ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annual_inc , y = loan_amnt , color = int_rate)) +
        geom_point(alpha = 0.5 , size = 1.5) + 
        geom_smooth(se = F , color = 'darkred' , method = 'loess') +
        xlim(c(0 , 300000)) + 
        labs(x = 'Annual Income' , y = 'Loan Ammount' , color = 'Interest Rate')
## Warning: Removed 57 rows containing non-finite values (stat_smooth).
## Warning: Removed 57 rows containing missing values (geom_point).

Based on the diagram, it shows a larger annual income than larger demanded ammounts by borrowers.

# Split dataset 
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , replace = F)
trainset = loan2[idx , ]
testset = loan2[-idx , ]

# Fit logistic regression
glm.model = glm(loan_outcome ~ . , trainset , family = binomial(link = 'logit'))
summary(glm.model)
## 
## Call:
## glm(formula = loan_outcome ~ ., family = binomial(link = "logit"), 
##     data = trainset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4570  -0.7119  -0.5306  -0.3213   6.8140  
## 
## Coefficients:
##                       Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)         -3.117e+00  2.005e-02 -155.437  < 2e-16 ***
## loan_amnt            1.171e-05  3.810e-07   30.727  < 2e-16 ***
## int_rate             3.737e-02  1.760e-03   21.231  < 2e-16 ***
## gradeB               6.567e-01  1.369e-02   47.983  < 2e-16 ***
## gradeC               1.080e+00  1.691e-02   63.865  < 2e-16 ***
## gradeD               1.293e+00  2.231e-02   57.971  < 2e-16 ***
## gradeE               1.420e+00  2.813e-02   50.483  < 2e-16 ***
## gradeF               1.474e+00  3.581e-02   41.173  < 2e-16 ***
## gradeG               1.509e+00  4.539e-02   33.248  < 2e-16 ***
## emp_length1 year    -1.499e-02  1.372e-02   -1.093 0.274516    
## emp_length10+ years -5.992e-02  1.046e-02   -5.728 1.02e-08 ***
## emp_length2 years   -4.259e-02  1.272e-02   -3.348 0.000814 ***
## emp_length3 years   -2.235e-02  1.311e-02   -1.706 0.088068 .  
## emp_length4 years   -3.675e-02  1.420e-02   -2.589 0.009635 ** 
## emp_length5 years   -4.750e-02  1.407e-02   -3.377 0.000733 ***
## emp_length6 years   -4.732e-02  1.537e-02   -3.078 0.002086 ** 
## emp_length7 years   -6.510e-02  1.565e-02   -4.160 3.19e-05 ***
## emp_length8 years   -1.008e-02  1.547e-02   -0.651 0.514797    
## emp_length9 years   -1.283e-02  1.640e-02   -0.782 0.434128    
## home_ownershipOTHER  2.465e-01  2.553e-01    0.966 0.334282    
## home_ownershipOWN    1.941e-01  9.512e-03   20.407  < 2e-16 ***
## home_ownershipRENT   3.492e-01  6.097e-03   57.270  < 2e-16 ***
## annual_inc          -2.515e-06  7.339e-08  -34.274  < 2e-16 ***
## term60 months        4.259e-01  6.847e-03   62.202  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 912910  on 920912  degrees of freedom
## Residual deviance: 837103  on 920889  degrees of freedom
## AIC: 837151
## 
## Number of Fisher Scoring iterations: 5

The coefficients of the following features are positive:

  1. Loan Ammount
  2. Interest Rate
  3. Home Ownership - Other
  4. Term
  5. The better the grade the more difficult to default

In contrast to this, the probability of defaulting on the given credit varies directly with these factors. As a result, the higher the amouunt of loan, the higher the risk of credit loss.

The coefficients of the following features are negative:

  1. Annual Income
  2. Home Ownership - Own
  3. Home Ownership - Rent
  4. Borrowers with 10+ years of experience are more likely to pay their debt
  5. There is no significant difference in the early years of employment

This means that the probability of defaulting is inversely proportional to the factors mentioned above.

# Prediction on test set
preds = predict(glm.model , testset , type = 'response')

# Density of probabilities
ggplot(data.frame(preds) , aes(preds)) + 
        geom_density(fill = 'lightblue' , alpha = 0.4) +
        labs(x = 'Predicted Probabilities on test set')

As of now, the accuracy, sensitivity and specificity are transformed for given threshold.

k = 0
accuracy = c()
sensitivity = c()
specificity = c()
for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
        k = k + 1
        preds_binomial = ifelse(preds > i , 1 , 0)
        confmat = table(testset$loan_outcome , preds_binomial)
        accuracy[k] = sum(diag(confmat)) / sum(confmat)
        sensitivity[k] = confmat[1 , 1] / sum(confmat[ , 1])
        specificity[k] = confmat[2 , 2] / sum(confmat[ , 2])
}

If we plot our results we get this visualization.

threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)

data = data.frame(threshold , accuracy , sensitivity , specificity)
head(data)
##   threshold  accuracy sensitivity specificity
## 1      0.01 0.1962133   0.8493151   0.1960580
## 2      0.02 0.1965196   0.9190751   0.1961121
## 3      0.03 0.1973405   0.9343545   0.1962416
## 4      0.04 0.2026862   0.9619220   0.1971906
## 5      0.05 0.2455208   0.9586278   0.2048655
## 6      0.06 0.2919843   0.9509678   0.2135462
# Gather accuracy , sensitivity and specificity in one column
ggplot(gather(data , key = 'Metric' , value = 'Value' , 2:4) , 
       aes(x = threshold , y = Value , color = Metric)) + 
        geom_line(size = 1.5)

Based on the chart, a threshold of 25% - 30% seems ideal cause further increase of the cut off percentage does not have significant impact on the accuracy of the model. The Confusion Matrix for cut off point at 30% will be this,

preds.for.30 = ifelse(preds > 0.3 , 1 , 0)
confusion_matrix_30 = table(Predicted = preds.for.30 , Actual = testset$loan_outcome)
confusion_matrix_30
##          Actual
## Predicted      0      1
##         0 211301  38017
##         1  35490  22164
## [1] "Accuracy : 0.7605"

The ROC (Receiver Operating Characteristics) curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.

library(pROC)
## Warning: package 'pROC' was built under R version 3.6.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
# Area Under Curve
auc(roc(testset$loan_outcome , preds))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7006
# Plot ROC curve
plot.roc(testset$loan_outcome , preds , main = "Confidence interval of a threshold" , percent = TRUE , 
         ci = TRUE , of = "thresholds" , thresholds = "best" , print.thres = "best" , col = 'blue')
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

Conclusion

The logistic regression model was used to predict the loan status. Different cut off’s were used to decide if the loan should be granted or not. Cut off of 30% - 35% gave a good accuracy of 76.05%. The decision to set a cut off is arbitrary and higher levels of threshold has increased the risk. The Area Under Curve also gives a measure of accuracy, which came out to be 70.06%. Therefore, the credit of this pertaining user is good as it is more than the users loss credit. Therefore, the chance of a loan that will be approve is good for this individual.