Sliced Episode 12: Loan Defaults

Publish date: 2021-12-06
Tags: Sliced, xgboost, mean absolute error, tables, kable

Summary

Background: In the final (episode 12) of the 2021 series of Sliced, the competitors were given two hours in which to analyse a set of data on bank loans. The aim was to predict the size of the default on the loan (usually zero). The metric for evaluation was the mean absolute error.
My approach: I explored the data and decided to build two models, one to predict whether or not the customer would default and the other to predict the size of the default in customers who defaulted. The size of the default is obviously related to the amount loaned, so for the second model I predicted the proportion of the loan that was defaulted. Both models were fitted using xgboost. Result: Predictions for the size of the default were taken from the second model, but were replaced by zero if the first model suggested that the customer would not default. The resulting predictions would have come top of the leaderboard by a relatively large margin.
Conclusion: In the previous two episodes I had experimented with mlr3, but for this analysis I reverted to dplyr plus base R code. I felt closer to the data and I think that this helped me to find the best model. The alternative approach of directly minimising mean absolute error in an automatic algorithm does not perform well.

Introduction

The data for the final of Sliced 2021 can be downloaded from www.kaggle.com/c/sliced-s01e12-championship/overview. The organisers selected a set of data on business loans made by a bank and asked the finalists to predict the size of any default, evaluating the models by mean absolute error.

The choice of mean absolute error was an interesting one, not because it is particularly appropriate to the problem, but more because it moved the competitors out of their comfort zone.

Reading the data

It is my practice to read the data asis and to immediately save it in rds format within a directory called data/rData. For details of the way that I organise my analyses you should read my post called Sliced Methods Overview.

# --- setup: libraries & options ------------------------
library(tidyverse)

theme_set( theme_light())

# --- set home directory -------------------------------
home <- "C:/Projects/kaggle/sliced/s01-e12"

# --- read downloaded data -----------------------------
trainRawDF <- readRDS( file.path(home, "data/rData/train.rds") )

testRawDF <- readRDS( file.path(home, "data/rData/test.rds") )

Data Exploration

I usually start by summarising the training data with the skimr package and as always, I hide the output because of its length.

# --- summarise the training set -----------------------
skimr::skim(trainRawDF)

From skim() I learn that there are 83,656 loans and only 16 missing values, all for the variable NewExist, which measures whether the loan was to a new or an existing business.

The Response

The response is the default_amount, which I tabulate using kable.

library(kableExtra)

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE), 
                           labels=c("Repaid", "Defaulted"))) %>%
  group_by( default) %>%
  summarise( n      = n(),
             mean   = mean(default_amount),
             median = median(default_amount),
             min    = min(default_amount),
             max    = max(default_amount),
                      .groups="drop") %>%
  mutate( pct = round(100*n/sum(n),1) ) %>%
  select( default, n, pct, mean, median, min, max) %>%
  kable(caption="Amount defaulted") %>%
  kable_classic(full_width=FALSE)
Table 1: Amount defaulted
default n pct mean median min max
Repaid 64457 77.1 0.00 0 0 0
Defaulted 19199 22.9 62879.67 27552 5 1895579

77% of loans are not defaulted, but defaults can be up be anything from $5 to $1.9 million

Here is a histogram of the non-zero defaults.

# --- non-zero defaults ----------------------------
trainRawDF %>%
  filter( default_amount > 0 ) %>%
  ggplot( aes(x=default_amount)) +
  geom_histogram( bins=100, fill="steelblue") +
  labs(x = "Amount defaulted",
       title = "Size of default when loan not repaid")

The amounts defaulted are very skew, so they are better viewed on a log scale

# --- log10 non-zero defaults ----------------------
trainRawDF %>%
  filter( default_amount > 0 ) %>%
  ggplot( aes(x=log10(default_amount))) +
  geom_histogram( bins=100, fill="steelblue") +
  labs(x = "log10(Amount defaulted)",
       title = "Size of default when loan not repaid")

The Predictors

I consider each of the predictors starting with the business sector.

Sector

The rate of defaulting varies by business sector, from about 10% in Agriculture and Management to a rate that is 3 times higher in Finance and Real Estate.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  group_by(Sector, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default) %>%
  mutate( pct = 100*default/(default+repaid))
## # A tibble: 20 x 4
##    Sector                                                   repaid default   pct
##    <chr>                                                     <int>   <int> <dbl>
##  1 Accommodation and food services                            5865    1878 24.3 
##  2 Administrative and support and waste management and rem~   2904    1103 27.5 
##  3 Agriculture, forestry, fishing and hunting                  788      86  9.84
##  4 Arts, entertainment, and recreation                        1291     389 23.2 
##  5 Construction                                               5892    2142 26.7 
##  6 Educational services                                        612     232 27.5 
##  7 Finance and insurance                                       837     395 32.1 
##  8 Health care and social assistance                          5869     759 11.5 
##  9 Information                                                1031     396 27.8 
## 10 Management of companies and enterprises                      27       3 10   
## 11 Manufacturing                                              6386    1318 17.1 
## 12 Mining, quarrying, and oil and gas extraction               147      20 12.0 
## 13 Other services (except public administration)              6954    1970 22.1 
## 14 Professional, scientific, and technical services           6805    1874 21.6 
## 15 Public administration                                        26       5 16.1 
## 16 Real estate and rental and leasing                         1128     578 33.9 
## 17 Retail trade                                              11553    3853 25.0 
## 18 Transportation and warehousing                             1791     873 32.8 
## 19 Utilities                                                    60       9 13.0 
## 20 Wholesale trade                                            4491    1316 22.7

NAICS

North American Industry Classification System (NAICS) is a numerical system for coding industries. The full code has 6 digits, but the first 2 digits are often used for a less detailed breakdown. The first two digits correspond very closely to Sector, the feature that I have just looked at.

As an example, I take NAICS 541940. 54 refers to all “Professional, Scientific, and Technical Services” and 541940 refers to “Vets”

You can check the codes at www.census.gov/naics

Here are the 6 digit codes with the highest and lowest default rates.

# --- NAICS industry classification -----------------------
#
options(knitr.kable.NA = '..')

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "defaulted"))) %>%
  group_by(NAICS, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( Percent = round(100*defaulted/(defaulted+repaid),1)) %>%
  # --- at least 50 loans in this sector ------------------
  filter( repaid + defaulted > 50) %>%
  # --- top and bottom 3 ----------------------------------
  arrange( Percent ) %>%
  filter(row_number() > max(row_number()) - 3 | 
           row_number() <= 3) %>%
  add_row( Percent = 20)  %>%
  arrange( Percent) %>%
  mutate( Percent = ifelse(Percent == 20, NA, Percent)) %>%
  kable(caption="NAICS codes with the highest and lowest default rates") %>%
  kable_classic(full_width=FALSE) 
Table 2: NAICS codes with the highest and lowest default rates
NAICS repaid defaulted Percent
514210 54 1 1.8
421830 211 6 2.8
421930 66 2 2.9
.. .. .. ..
522310 127 157 55.3
488999 39 52 57.1
236116 37 53 58.9

51 is the information sector and 23 is construction.

Location

You can look separately at the State of the loanee and the State of the bank. The ranges of default rates are surprising to me, but this is probably a reflection of not living in the USA.

The key message seems to be, don’t loan to businesses based in Florida

# --- state of loanee --------------------------------
# --- top 3 and bottom 3 rates -----------------------
trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  group_by(State, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( Percent = round(100*default/(default+repaid),1)) %>%
  # --- at least 50 loans in this sector ------------------
  filter( repaid + default > 50) %>%
  # --- top and bottom 3 ----------------------------------
  arrange( Percent ) %>%
  filter(row_number() > max(row_number()) - 3 | 
           row_number() <= 3) %>%
  add_row( Percent = 20)  %>%
  arrange( Percent) %>%
  mutate( Percent = ifelse(Percent == 20, NA, Percent)) %>%
  kable(caption="Home State of the Loanee") %>%
  kable_classic(full_width=FALSE) 
Table 3: Home State of the Loanee
State repaid default Percent
ND 204 10 4.7
MT 541 46 7.8
WY 137 12 8.1
.. .. .. ..
GA 1340 649 32.6
DC 121 60 33.1
FL 2869 1573 35.4

Don’t borrow from a bank in Virginia (unless plan to default).

# --- state of  Bank --------------------------------
# --- top 3 and bottom 3 rates -----------------------
trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  group_by(BankState, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( Percent = round(100*default/(default+repaid),1)) %>%
  # --- at least 50 loans in this sector ------------------
  filter( repaid + default > 50) %>%
  # --- top and bottom 3 ----------------------------------
  arrange( Percent ) %>%
  filter(row_number() > max(row_number()) - 3 | 
           row_number() <= 3) %>%
  add_row( Percent = 20)  %>%
  arrange( Percent) %>%
  mutate( Percent = ifelse(Percent == 20, NA, Percent)) %>%
  kable(caption="Home State of the Bank") %>%
  kable_classic(full_width=FALSE) 
Table 4: Home State of the Bank
BankState repaid default Percent
ME 163 4 2.4
DC 197 6 3.0
AZ 300 11 3.5
.. .. .. ..
DE 2268 900 28.4
NC 6841 3447 33.5
VA 2123 1663 43.9

An interesting feature is whether the business borrowed from a local bank, by which I mean a bank in its own state.

# --- local (in-state) loans ------------------------------
trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( inState = factor(State == BankState,
                           levels=c(FALSE, TRUE),
                           labels=c("No", "Yes"))) %>%
  group_by(inState, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) %>%
  arrange( pct )
## # A tibble: 2 x 4
##   inState repaid default   pct
##   <fct>    <int>   <int> <dbl>
## 1 Yes      28827    3961  12.1
## 2 No       35630   15238  30.0

Borrowing out of state is more common and is more likely to result in a default.

Urban/Rural

We also know whether the company is urban or rural.

# --- urban=1 or rural=2 or undefined=0 ----------------
trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( UrbanRural = factor(UrbanRural,
                           levels=0:2,
                           labels=c("Undefined", "Urban", "Rural"))) %>%
  group_by(UrbanRural, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) %>%
  arrange( pct )
## # A tibble: 3 x 4
##   UrbanRural repaid default   pct
##   <fct>       <int>   <int> <dbl>
## 1 Undefined   15196     998  6.16
## 2 Rural        7951    2288 22.3 
## 3 Urban       41310   15913 27.8

Bank

There are over 300 different banks. Here are the top 5 and the bottom 5 for defaults.

# --- state of  Bank --------------------------------
trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  group_by(Bank, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( Percent = round(100*default/(default+repaid),1)) %>%
  # --- at least 50 loans in this sector ------------------
  filter( repaid + default > 50) %>%
  # --- top and bottom 3 ----------------------------------
  arrange( Percent ) %>%
  filter(row_number() > max(row_number()) - 5 | 
           row_number() <= 5) %>%
  add_row( Percent = 20)  %>%
  arrange( Percent) %>%
  mutate( Percent = ifelse(Percent == 20, NA, Percent)) %>%
  kable(caption="Best and Worst Banks") %>%
  kable_classic(full_width=FALSE) 
Table 5: Best and Worst Banks
Bank repaid default Percent
BAY AREA EMPLOYMENT DEVEL CO 73 0 0.0
BUSINESS DEVEL FINAN CORP 108 0 0.0
BUSINESS FINANCE GROUP, INC. 97 0 0.0
CALIFORNIA STATEWIDE CERT. DEV 124 0 0.0
CAPITAL ACCESS GROUP, INC. 58 0 0.0
.. .. .. ..
HSBC BK USA NATL ASSOC 343 264 43.5
BANCO POPULAR NORTH AMERICA 499 461 48.0
BBCN BANK 1294 2092 61.8
SUPERIOR FINANCIAL GROUP, LLC 162 473 74.5
CAPITAL ONE BK (USA) NATL ASSO 8 53 86.9

Some of these rates are so poor that it is hard to credit. How can they still be in business? The data comes from the years 1990-2010 so perhaps these data reflect the impact of the 2008 financial crisis. Even so!

Year of Loan

The plot of default rate by financial year confirms the importance of the 2007-2008 crisis. It is interesting that these data make it look as though the financial crisis was predictable in 2005.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  group_by(ApprovalFY, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) %>%
  arrange( pct ) %>%
  ggplot( aes(x=ApprovalFY, y=pct)) +
  geom_line() +
  labs( x= "Year loan approaved",
        y = "Percent defaulting",
        title = "Pattern of defaults over time")

New and Existing Businesses

The coding is, 1 for an existing business and 2 for a new business. There are 16 missing codes and 60 businesses coded as 0. Zero codes are not mentioned in the data dictionary.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( NewExist = factor(NewExist),
          NewExist = fct_explicit_na(NewExist)) %>%
  group_by(NewExist, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) %>%
  arrange( pct )
## # A tibble: 4 x 4
##   NewExist  repaid default   pct
##   <fct>      <int>   <int> <dbl>
## 1 0             56       4  6.67
## 2 (Missing)     14       2 12.5 
## 3 1          47694   13951 22.6 
## 4 2          16693    5242 23.9

I think that it would make sense to merge missing and zero.

Size of the Company

Companies with many employees are less likely to default.

# --- number of employees -----------------------
trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( NoEmp = cut(NoEmp, breaks=c(0, 1, 5, 10, 100, 10000 ), include.lowest=TRUE, dig.lab=5)) %>%
  group_by(NoEmp, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) 
## # A tibble: 5 x 4
##   NoEmp       repaid default   pct
##   <fct>        <int>   <int> <dbl>
## 1 [0,1]        12372    5032 28.9 
## 2 (1,5]        26738    9110 25.4 
## 3 (5,10]       11087    2908 20.8 
## 4 (10,100]     13813    2124 13.3 
## 5 (100,10000]    447      25  5.30

CreateJob measures the number of new jobs created by the loan. When the number is large the default rate is lower. Of course, this factor will not be independent of the original size of the company.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( CreateJob = cut(CreateJob, breaks=c(0, 1, 5, 10, 2000 ), include.lowest=TRUE, dig.lab=4)) %>%
  group_by(CreateJob, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) 
## # A tibble: 4 x 4
##   CreateJob repaid default   pct
##   <fct>      <int>   <int> <dbl>
## 1 [0,1]      48165   13758  22.2
## 2 (1,5]      10391    4166  28.6
## 3 (5,10]      3048     764  20.0
## 4 (10,2000]   2853     511  15.2

Loans that help companies retain jobs are more likely to be repaid.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( RetainedJob = cut(RetainedJob, 
                            breaks=c(0, 1, 5, 10, 100, 5000 ),
                            include.lowest=TRUE, dig.lab=4)) %>%
  group_by(RetainedJob, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) 
## # A tibble: 5 x 4
##   RetainedJob repaid default   pct
##   <fct>        <int>   <int> <dbl>
## 1 [0,1]        32486    6760 17.2 
## 2 (1,5]        17317    7885 31.3 
## 3 (5,10]        7001    2635 27.3 
## 4 (10,100]      7463    1899 20.3 
## 5 (100,5000]     190      20  9.52

Franchise

The Franchise code is undefined in the data dictionary. We are told that 0 and 1 correspond to businesses that are not franchises. So I’ll just consider Franchise Yes/No.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( Franchise = factor(FranchiseCode > 1,
                           levels=c(FALSE, TRUE),
                           labels=c("No", "Yes"))) %>%
  group_by(Franchise, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) %>%
  arrange( pct )
## # A tibble: 2 x 4
##   Franchise repaid default   pct
##   <fct>      <int>   <int> <dbl>
## 1 Yes         3201     809  20.2
## 2 No         61256   18390  23.1

So franchises are the minority and they are slightly less likely to default.

Size of the loan

Disbursement refers to the amount loaned.

trainRawDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "default"))) %>%
  mutate( Disbursement = cut(DisbursementGross,
      breaks=c(0, quantile(DisbursementGross, 
                           probs=seq(0.1, 1.0, by=0.1))),
      dig.lab=7)) %>%
  group_by(Disbursement, default) %>%
  summarise( n = n(), .groups="drop") %>%
  pivot_wider( values_from=n, names_from=default, values_fill=0 ) %>%
  mutate( pct = 100*default/(default+repaid)) %>%
  arrange( pct )
## # A tibble: 10 x 4
##    Disbursement     repaid default   pct
##    <fct>             <int>   <int> <dbl>
##  1 (480659,5294366]   7638     728  8.70
##  2 (255000,480659]    7380     944 11.3 
##  3 (157108,255000]    7014    1393 16.6 
##  4 (110000,157108]    6632    1677 20.2 
##  5 (80000,110000]     6408    2007 23.9 
##  6 (54472,80000]      6254    2118 25.3 
##  7 (25000,41211.5]    6101    2172 26.3 
##  8 (41211.5,54472]    5866    2500 29.9 
##  9 (15000,25000]      5343    2351 30.6 
## 10 (0,15000]          5821    3309 36.2

Businesses are more likely to default if the loan is small. Presumably, banks are less risk adverse when the amount is small.

When companies do default the size of the default is linked to the size of the loan

trainRawDF %>%
  filter( default_amount> 0) %>%
  ggplot( aes(x=DisbursementGross, y=default_amount) ) +
  geom_point() +
  geom_abline() +
  geom_smooth() +
  labs( x = "Size of the loan (disbursement)",
        y = "Size of the default",
        title = "Amount defaulted and the size of the loan")

Alternatively, we can look at the percentage of the loan that is defaulted. Presumably the numbers over 100% refer to businesses that run up interest that they do not repay. On average, companies default on about 60% of the loan.

trainRawDF %>%
  filter( default_amount> 0) %>%
  ggplot( aes(x=DisbursementGross, y=100*default_amount/DisbursementGross) ) +
  geom_point() +
  geom_smooth() +
    labs( x = "Size of the loan (disbursment)",
        y = "Percent of loan defaulted",
        title = "Percent defaulted and the size of the loan")

GrAppv measures the amount that the bank originally approved. In most cases it is very similar to the Disbursement.

trainRawDF %>%
  ggplot( aes(x=GrAppv/100000, y=DisbursementGross/100000) ) +
  geom_point() +
  geom_abline()  +
    labs( y = "Size of the loan (disbursment)",
        x = "Amount originally approved",
        title = "Amount approved and amount eventually loaned in units of $100,000")

SBA_Appv is the amount guaranteed by the SBA (Small Business Administration), an organisation created to help small businesses

trainRawDF %>%
  filter( SBA_Appv > 0 ) %>%
  ggplot( aes(y=SBA_Appv/100000, x=GrAppv/100000) ) +
  geom_point() +
  geom_abline() +
  labs( y = "SBA Guarantee ($100,000)",
        x = "Amount Approved ($100,000)",
        title = "SBA Guarantees")

Data Cleaning

There are multiple codes for Bank, State, BankState and NAICS, so I have decided to use three digits of NAICS and to lump the smaller categories into an ‘Other’ category.

I have also decided to drop Sector as it duplicates NAICS and to ignore Zip code and City. Name is the name of the business and I’ve dropped that as well.

trainRawDF %>%
   mutate( NewExist = factor(ifelse( is.na(NewExist), 
                                     0, NewExist)),
           inState  = as.numeric( State == BankState),
           Bank     = factor(Bank),
           Bank     = fct_lump_prop(Bank, prop=0.001),
           State    = factor(State),
           State    = fct_lump_prop(State, prop=0.01),
           BankState = factor(BankState),
           BankState = fct_lump_prop(BankState, prop=0.01),
           FranchiseCode = ifelse( FranchiseCode <= 1, 0, 1),
           UrbanRural = factor(UrbanRural),
           NAICS      = factor(floor(NAICS/1000)),
           NAICS     = fct_lump_prop(NAICS, prop=0.001)) %>%
  select( -LoanNr_ChkDgt, -Name, -Sector, -City, -Zip ) -> trainDF

The test data must be coded in the same way. That is, I must use the levels of the training data after lumping and not lump the test data.

testRawDF %>%
   mutate( NewExist = factor(ifelse( is.na(NewExist), 0, NewExist)),
           inState  = as.numeric( State == BankState),
           Bank     = factor(Bank,
                             levels=levels(trainDF$Bank)),
           Bank     = fct_explicit_na(Bank, na_level="Other"),
           State    = factor(State,
                             levels=levels(trainDF$State)),
           State    = fct_explicit_na(State, na_level="Other"),
           BankState     = factor(BankState,
                             levels=levels(trainDF$BankState)),
           BankState     = fct_explicit_na(BankState,
                                           na_level="Other"),
           FranchiseCode = ifelse( FranchiseCode <= 1, 0, 1),
           UrbanRural = factor(UrbanRural),
           NAICS      = factor(floor(NAICS/1000),
                               levels=levels(trainDF$NAICS)),
           NAICS     = fct_explicit_na(NAICS,
                                           na_level="Other") ) %>%
  select( -Name, -Sector, -City, -Zip ) -> testDF

I plan to use xgboost so I need to convert the factors into indicators, I’ll do that with my add_dummy() functions (see Appendix)

# --- Add indicators to training set ------------------------
trainDF %>%
  add_dummy( Bank, "B") %>%
  add_dummy( State, "S") %>%
  add_dummy( UrbanRural, "U") %>%
  add_dummy( NewExist, "E") %>%
  add_dummy( NAICS, "N") %>%
  add_dummy( BankState, "T") %>%
  select( -Bank, -State, -BankState, -UrbanRural,
          -NewExist, -NAICS) -> trainDF

# --- Add indicators to test set ----------------------------
testDF %>%
  add_dummy( Bank, "B") %>%
  add_dummy( State, "S") %>%
  add_dummy( UrbanRural, "U") %>%
  add_dummy( NewExist, "E") %>%
  add_dummy( NAICS, "N") %>%
  add_dummy( BankState, "T") %>%
  select( -Bank, -State, -BankState, -UrbanRural,
          -NewExist, -NAICS) -> testDF

Model Building

I want to create two models, one for the probability of defaulting and the other for the amount defaulted when companies do default.

I’ll create an estimation and validation set so that I can assess performance.

# --- Split into estimation and validation --------------
set.seed(7028)
split <- sample(1:83656, size=20000, replace=FALSE)

estimateDF <- trainDF[-split, ]
validateDF <- trainDF[ split, ]

Probability of a Default

First save the features and response in a matrix and vector.

library(xgboost)

# --- Estimation data to matrices ------------------------
estimateDF %>%
  select( -default_amount) %>%
  as.matrix() -> X

estimateDF %>%
  mutate(y = as.numeric(default_amount>0)) %>%
  pull(y) -> Y

dtrain <- xgb.DMatrix(data = X, label = Y)

# --- Validation data to matrices ------------------------
validateDF %>%
  select( -default_amount) %>%
  as.matrix() -> XV

validateDF %>% 
  mutate(y = as.numeric(default_amount>0)) %>%
  pull(y) -> YV

dtest <- xgb.DMatrix(data = XV, label=YV)

Next I’ll investigate how many rounds are needed by fitting an xgboost model with what I hope is too many rounds.

# --- fit the xgboost model -----------------------------
xgb.train(data=dtrain, 
        watchlist=list(train=dtrain, test=dtest),
        objective="binary:logistic",
        nrounds=200, verbose=0) -> xgmod

We can picture the performance as I did for the airbnb analysis.

# --- function to show train & validation metrics -------
xgmod$evaluation_log %>%
    ggplot( aes(x=iter, y=train_error)) +
    geom_line(colour="blue") +
    geom_line( aes(y=test_error), colour="red") +
    labs(x="Iteration", y="loss", 
         title="In-sample and out-of-sample loss")

The 100 iterations looks reasonable. The loss here is the misclassification error, so we are misclassifying 18.5% of the businesses.

I’ll rerun with 100 rounds.

# --- fit the xgboost model -----------------------------
xgb.train(data=dtrain, 
        watchlist=list(train=dtrain, test=dtest),
        objective="binary:logistic",
        nrounds=100, verbose=0) -> xgmod

and look at the predictions.

# --- Performance in the Validation set -----------------
validateDF %>%
  mutate( default = factor(default_amount> 0, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "defaulted"))) %>%
  mutate( p = predict(xgmod, newdata=XV)) %>%
  mutate( predDefault = factor(p > 0.5, 
                           levels=c(FALSE, TRUE),
                           labels=c("repaid", "defaulted"))) %>%
  group_by( default, predDefault)  %>%
  summarise( n = n(), .groups="drop")
## # A tibble: 4 x 3
##   default   predDefault     n
##   <fct>     <fct>       <int>
## 1 repaid    repaid      14543
## 2 repaid    defaulted     902
## 3 defaulted repaid       2771
## 4 defaulted defaulted    1784

The model misclassifies 902 (6%) of those who actually repaid their loan and 2771 (61%) of those who default. Not that great.

Out of interest I look at the importance.

xgb.importance(model=xgmod) %>%
  as_tibble() 
## # A tibble: 217 x 4
##    Feature             Gain  Cover Frequency
##    <chr>              <dbl>  <dbl>     <dbl>
##  1 ApprovalFY        0.376  0.0754   0.104  
##  2 SBA_Appv          0.105  0.0350   0.0860 
##  3 DisbursementGross 0.0754 0.0299   0.148  
##  4 B13               0.0524 0.0128   0.00773
##  5 inState           0.0387 0.0124   0.0145 
##  6 GrAppv            0.0252 0.0143   0.0599 
##  7 NoEmp             0.0219 0.0137   0.0705 
##  8 CreateJob         0.0174 0.0175   0.0525 
##  9 RetainedJob       0.0160 0.0138   0.0534 
## 10 N64               0.0150 0.0140   0.00676
## # ... with 207 more rows

Year the loan approved is most important, reflecting the impact of the financial crisis.

Amount of the default

As before I extract the data into matrices. I will predict the proportion of the disbursement.

# --- Extract the Estimation data ---------------------
estimateDF %>%
  filter( default_amount > 0 ) %>%
  select( -default_amount) %>%
  as.matrix() -> X2

estimateDF %>%
  filter( default_amount > 0 ) %>%
  mutate( y = default_amount/DisbursementGross) %>%
  pull(y) -> Y2

dtrain <- xgb.DMatrix(data = X2, label = Y2)

# --- Extract the Validation data ---------------------
validateDF %>%
  filter( default_amount > 0 ) %>%
  select( -default_amount) %>%
  as.matrix() -> XV2

validateDF %>% 
  filter( default_amount > 0 ) %>%
  mutate( y = default_amount/DisbursementGross) %>%
  pull(y) -> YV2

dtest <- xgb.DMatrix(data = XV2, label=YV2)

I’ll predict using xgboost with the mse criterion even though the final metric will be the mean absolute error.

First I investigate the number of rounds

# --- Model for proportion defaulted -----------------
xgb.train(data=dtrain, 
        watchlist=list(train=dtrain, test=dtest),
         objective="reg:squarederror",
        nrounds=200, verbose=0) -> xgmod2

xgmod2$evaluation_log %>%
    ggplot( aes(x=iter, y=train_rmse)) +
    geom_line(colour="blue") +
    geom_line( aes(y=test_rmse), colour="red") +
    labs(x="Iteration", y="loss", 
         title="In-sample and out-of-sample loss")

25 iterations would be enough.

# --- refit with 25 rounds --------------------------
xgb.train(data=dtrain, 
        watchlist=list(train=dtrain, test=dtest),
         objective="reg:squarederror",
        nrounds=25, verbose=0) -> xgmod2

Prediction

I’ll predict for the full data set, making a prediction of the probability of default and of the amount. The final prediction will be zero if we predict no default and the predicted amount if we predict a default.

validateDF %>%
  mutate( p     = predict(xgmod, newdata=XV ) ) %>%
  mutate( yhat =  DisbursementGross*predict(xgmod2, newdata=XV )) %>%
  mutate( finalPrediction = (p>0.5)*yhat  ) %>%
  summarise( mean(abs(default_amount-finalPrediction)))
## # A tibble: 1 x 1
##   `mean(abs(default_amount - finalPrediction))`
##                                           <dbl>
## 1                                        13200.

A mean absolute error (MAE) of 13,200. This promising because the leading model at the time of the competition scored 13,395 and David Robinson won the final with 13993.

Submission

I refit these models to the full training data.

# --- Probability of Default ---------------------

# --- Training data to matrices ------------------
trainDF %>%
  select( -default_amount) %>%
  as.matrix() -> X

trainDF %>%
  mutate(y = as.numeric(default_amount>0)) %>%
  pull(y) -> Y

dtrain <- xgb.DMatrix(data = X, label = Y)

# --- Model probability of default ---------------
xgb.train(data=dtrain, 
         objective="binary:logistic",
        nrounds=100) -> xgmod

# --- Size of Default ----------------------------

# --- Training data to matrices ------------------
trainDF %>%
  filter( default_amount > 0 ) %>%
  select( -default_amount) %>%
  as.matrix() -> X

trainDF %>%
  filter( default_amount > 0 ) %>%
  mutate( y = default_amount/DisbursementGross) %>%
  pull(y) -> Y

dtrain <- xgb.DMatrix(data = X, label = Y)

# --- Model size of default ---------------------
xgb.train(data=dtrain, 
         objective="reg:squarederror",
        nrounds=25) -> xgmod2

Now the predictions

# --- set the test data ---------------------------
testDF %>%
  select( -LoanNr_ChkDgt) %>%
  as.matrix() -> XT

# --- Make two-stage predictions ------------------
testDF %>%
  mutate( p     = predict(xgmod, newdata=XT ) ) %>%
  mutate( yhat =  DisbursementGross*predict(xgmod2, newdata=XT )) %>%
  mutate( default_amount = (p>0.5)*yhat  ) %>%
  select( LoanNr_ChkDgt, default_amount ) %>%
  write_csv( file.path(home, "temp/submission1.csv"))

13145 top!! without any tuning other than the number of iterations.

What I learned from this analysis

The main point to come out of this analysis is that the head-on approach of fitting a model that directly minimises the mean absolute error does not work well. I have not shown the results because the post is already too long, but I did try it as a secondary analysis. The dream of machine learning is that the data are fed into a giant algorithm that makes highly accurate predictions without any human input. This is just a dream and for the foreseeable future, it is likely to remain so. We are a long way from algorithms with the intelligence that would realise that the bank default problem could be split in two; whether there is a default and the proportion of the default.

The second thing that strikes me is the contrast between this analysis and an analysis based on a machine learning pipeline. In my previous two posts, I used mlr3, an ecosystem for organising machine learning projects and creating analysis pipelines. It is similar in its aim to tidymodels, although I think that mlr3 is better. This time, I reverted to my usual way of working and it felt like I had been set free. Having completed the analysis, I could now turn it into a pipeline, but I doubt if I would have produced such a successful analysis had I tried to develop it from scratch in mlr3 or tidymodels.

Of course, you would be justified in being a little wary of these opinions. I was trained as a statistician and it is hard to break with those traditions. I taught my students the importance of the analyst’s skill when developing a good model and so I cling to the idea that computers will never quite replace humans. History is against me.

Appendix

Function for creating indicator variables

add_dummy <- function(thisDF, col, prefix="X") {
  j <- 0
  sCol <- deparse(substitute(col))
  for( v in levels(thisDF[[sCol]])) {
    j <- j + 1
    dumName <- paste(prefix, j, sep="")
    thisDF[[dumName]] <- as.numeric(thisDF[[sCol]] == v )
  }
  return(thisDF)
}