Methods: Introduction to mlr3

Publish date: 2021-11-01

Tags: Sliced, Object orientated programming (OOP), R6 package, mlr3, hyperparameter tuning, pipelines

Introduction

This post introduces the ecosystem of packages known as mlr3. It is an alternative to tidymodels and one of my reasons for trying mlr3 was to compare the two.

mlr3 is built on R6. a package that enables Object Oriented Programming (OOP) in R. To understand the way that mlr3 works, it is helpful to know a little about OOP and R6, so that is where I will start.

If you want more detail than I provide, follow the links on the project’s website at https://mlr3.mlr-org.com/; the YouTube videos give a good flavour of what mlr3 can do and the developers’ mlr3 book is very comprehensive (https://mlr3book.mlr-org.com/).

Object Orientated programming

mlr3 is written using Object Orientated Programming (OOP), this makes it somewhat different to use than base R or the tidyverse, which primarily use functional programming (FP).

The idea behind OOP is that the fundamental building block is an object. Objects not only contain data, but they also have their own, built-in functions. Typically, those functions control access to the object’s data, or they perform operations on the object’s data.

Good terminology is helpful in avoiding confusion, so I will refer to the data stored in an object as its fields and the built-in functions as methods.

fields (and sometimes methods) may be kept private, that is the user is not able to access them directly. The alternative is to have fields and methods that are public, usually objects contain a mix of private and public fields and methods. The most common practice is to make the fields private, but to provide public methods that enable the user to set or change the data. In that way, the methods can check the requested operations and ensure that the data are not corrupted.

It is possible that a method will return a value that has been calculated and was not itself stored in a field. Such returned data are referred to as active fields.

An example of a object used by mlr3 is a Task. A Task is an object for holding a set of data that is to be used in a machine learning project. The object has space for a data table, but it also contains metadata such as the name of the response variable and the names of the features that are to be used as predictors. When an analysis is started, the first job will be to define an object of class Task and to fill it with the data.

From the point of view of the programmer, a key feature of OOP is inheritance. As an example, suppose that we have created a class of object called a Task, we might also require more specialised objects, such as a classification Task or a regression Task. In a classification Task the response would have to be a factor and in a regression Task the response would have to be numeric, but otherwise they would be similar. Rather than creating these new classes from scratch, the programmer can make them both special cases of a general Task, in which case they will inherit a general Task’s fields and methods and the programmer will only need to add the extras.

R6

R6 package that enables full OOP in R. It is basic to everything in mlr3. There is a very informative chapter on R6 in Hadley Wickham’s Advance R book (https://adv-r.hadley.nz/r6.html). R6 creates classes of object with the fields and methods saved in an R list. Consequently, the code for using an R6 object looks very similar to the base R code that would be used to access a list.

Suppose that we have a class of R6 object called a job and that amongst the methods associated with this class is new, which is a function used to create a new instance of job. Then a typical piece of code might read

myJob <- job$new( data = myDF, model = "xgboost")

This code uses the method new() from the class job with two arguments and creates an instance called myJob. myJob is an object of class job with the specified data and model fields.

If the model field is public then the user will be able to access it directly. For instance, this code would print the contents of the model field.

print(myJob$model)

Internally, myJob is an R list that contains fields (data) and methods (functions) so we access the fields as myJob$fieldName and the methods as myJob$methodName()

In my example, I have supposed that model field stores a string, but more likely a model will itself be an R6 object and as well as the name of the analysis, it might store the default parameters. The nature of those parameters would depend on the model, so there will be a general class of models and special cases, such as xgboost models and linear regression models.

The code

print(myJob$model$params)

Says that myJob which is a my instance of class job contains an object of class model and from within that class, I want to know the values in the params field.

To investigate the structure of an object we could ask for its class using class(myJob), or the names of its contents using names(myJob).

mlr3

Reading the data

We will use the animal adoption data from episode 10 of Sliced as our example. I’ll start with some standard tidyverse code that reads the data and selects some columns for the analysis.

library(mlr3verse)
library(tidyverse)
library(lubridate)

home <- "C:/Projects/kaggle/sliced/s01-e10"

# --- read data and select variable --------------------------
readRDS( file.path(home, "data/rData/train.rds")) %>%
        mutate( outcome_type = factor(outcome_type),
                animal_type  = factor(animal_type),
                date         = as.numeric(date(datetime)) ) %>%
        select( outcome_type, animal_type, date) %>%
        print()  -> rawDF

## # A tibble: 54,408 x 3
##    outcome_type animal_type  date
##    <fct>        <fct>       <dbl>
##  1 adoption     Cat         17018
##  2 no outcome   Other       17464
##  3 transfer     Dog         16137
##  4 transfer     Cat         16408
##  5 transfer     Dog         16816
##  6 adoption     Dog         16318
##  7 adoption     Cat         16214
##  8 adoption     Dog         16490
##  9 adoption     Dog         17150
## 10 adoption     Dog         16452
## # ... with 54,398 more rows

Loading mlr3verse is similar to loading the tidyverse, it brings in the key packages from the mlr3 ecosystem.

Creating a Task

The first job is to create an object of class TaskClassif, this is a special type of Task that is designed for classification problems. The object will store the data and metadata for our training set. I’ll call the instance, myTask.

An object of class TaskClassif can have an id (just a title), a backend (the data frame) and metadata on the target (response). The instance is created using a method called new. It is conventional to use the name new for the method that creates a new instance of an object.

myTask <- TaskClassif$new( id      = "animal_adoption",
                           backend = rawDF,
                           target  = "outcome_type")

Let’s see what I have created

class(myTask)

## [1] "TaskClassif"    "TaskSupervised" "Task"           "R6"

myTask has an inheritance trail. It started as a very general R6 object, a special case of which is a Task, a special case of which is TaskSupervised, a special case of which is TaskClassif.

What fields and methods are provided?

names(myTask)

##  [1] ".__enclos_env__" "negative"        "positive"        "class_names"    
##  [5] "labels"          "weights"         "order"           "groups"         
##  [9] "strata"          "data_formats"    "feature_types"   "ncol"           
## [13] "nrow"            "col_roles"       "row_roles"       "properties"     
## [17] "target_names"    "feature_names"   "row_names"       "row_ids"        
## [21] "hash"            "extra_args"      "man"             "col_info"       
## [25] "backend"         "task_type"       "id"              "clone"          
## [29] "droplevels"      "truth"           "data"            "initialize"     
## [33] "add_strata"      "set_col_roles"   "set_row_roles"   "rename"         
## [37] "cbind"           "rbind"           "select"          "filter"         
## [41] "missings"        "levels"          "head"            "formula"        
## [45] "print"           "format"          "help"

Unfortunately, all we have is the contents of a list so we cannot tell which are fields and which are methods, though we can make intelligent guesses. Usually the fields come first, so my guess is that everything up to id is a field and everything after clone is a method.

Let’s try look at the field nrow

myTask$nrow

## [1] 54408

If I try to print a method, I will just get reference to the function

myTask$head

## function (n = 6L) 
## .__Task__head(self = self, private = private, super = super, 
##     n = n)
## <environment: 0x0000000027fdc740>

and when I run the function, not surprisingly it prints the first 6 lines of the data frame.

myTask$head()

##    outcome_type animal_type  date
## 1:     adoption         Cat 17018
## 2:   no outcome       Other 17464
## 3:     transfer         Dog 16137
## 4:     transfer         Cat 16408
## 5:     transfer         Dog 16816
## 6:     adoption         Dog 16318

What about the field col_roles?

myTask$col_roles

## $feature
## [1] "animal_type" "date"       
## 
## $target
## [1] "outcome_type"
## 
## $name
## character(0)
## 
## $order
## character(0)
## 
## $stratum
## character(0)
## 
## $group
## character(0)
## 
## $weight
## character(0)

You can see col_roles is not a value but is itself another R6 object that looks like a list when we print it. So it makes sense to ask for its names.

names(myTask$col_roles)

## [1] "feature" "target"  "name"    "order"   "stratum" "group"   "weight"

or to refer to specific items in the list

myTask$col_roles$target

## [1] "outcome_type"

or, since we are dealing with R lists, we can also request by position

myTask$col_roles[[1]]

## [1] "animal_type" "date"

Selecting a Learner

Next we will define a classifier. In mlr3 the model used for an analysis, in this case classification, is called a learner. The options available are stored in a dictionary called mlr_learners and as usual in R, the contents can be displayed by typing the name of the object

mlr_learners

## <DictionaryLearner> with 130 stored values
## Keys: classif.AdaBoostM1, classif.bart, classif.C50, classif.catboost,
##   classif.cforest, classif.ctree, classif.cv_glmnet, classif.debug,
##   classif.earth, classif.extratrees, classif.featureless, classif.fnn,
##   classif.gam, classif.gamboost, classif.gbm, classif.glmboost,
##   classif.glmnet, classif.IBk, classif.J48, classif.JRip, classif.kknn,
##   classif.ksvm, classif.lda, classif.liblinear, classif.lightgbm,
##   classif.LMT, classif.log_reg, classif.mob, classif.multinom,
##   classif.naive_bayes, classif.nnet, classif.OneR, classif.PART,
##   classif.qda, classif.randomForest, classif.ranger, classif.rfsrc,
##   classif.rpart, classif.svm, classif.xgboost, clust.agnes, clust.ap,
##   clust.cmeans, clust.cobweb, clust.dbscan, clust.diana, clust.em,
##   clust.fanny, clust.featureless, clust.ff, clust.kkmeans,
##   clust.kmeans, clust.MBatchKMeans, clust.meanshift, clust.pam,
##   clust.SimpleKMeans, clust.xmeans, dens.hist, dens.kde, dens.kde_kd,
##   dens.kde_ks, dens.locfit, dens.logspline, dens.mixed, dens.nonpar,
##   dens.pen, dens.plug, dens.spline, regr.bart, regr.catboost,
##   regr.cforest, regr.ctree, regr.cubist, regr.cv_glmnet, regr.earth,
##   regr.extratrees, regr.featureless, regr.fnn, regr.gam, regr.gamboost,
##   regr.gbm, regr.glm, regr.glmboost, regr.glmnet, regr.IBk, regr.kknn,
##   regr.km, regr.ksvm, regr.liblinear, regr.lightgbm, regr.lm,
##   regr.M5Rules, regr.mars, regr.mob, regr.randomForest, regr.ranger,
##   regr.rfsrc, regr.rpart, regr.svm, regr.xgboost, surv.akritas,
##   surv.blackboost, surv.cforest, surv.coxboost, surv.coxph,
##   surv.coxtime, surv.ctree, surv.cv_coxboost, surv.cv_glmnet,
##   surv.deephit, surv.deepsurv, surv.dnnsurv, surv.flexible,
##   surv.gamboost, surv.gbm, surv.glmboost, surv.glmnet, surv.kaplan,
##   surv.loghaz, surv.mboost, surv.nelson, surv.obliqueRSF,
##   surv.parametric, surv.pchazard, surv.penalized, surv.ranger,
##   surv.rfsrc, surv.rpart, surv.svm, surv.xgboost

We have a 3 class classification problem so a tree created by classif.rpart would be a sensible option. We can create an empty learner of that type by using get on the dictionary, (get is another conventional name) but mlr3 provides a helper function lrn() that does exactly the same job.

# --- set a learner using get -----------------------------
myLearner <- mlr_learners$get("classif.rpart")

# --- set a learner using the helper function -------------
myLearner <- lrn("classif.rpart")

# --- print the learner -----------------------------------
myLearner

## <LearnerClassifRpart:classif.rpart>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
##   twoclass, weights

What are the fields and methods of an object of this class?

names(myLearner)

##  [1] ".__enclos_env__"   "encapsulate"       "param_set"        
##  [4] "predict_type"      "phash"             "hash"             
##  [7] "errors"            "warnings"          "log"              
## [10] "timings"           "model"             "man"              
## [13] "fallback"          "timeout"           "parallel_predict" 
## [16] "predict_sets"      "packages"          "data_formats"     
## [19] "properties"        "feature_types"     "predict_types"    
## [22] "task_type"         "state"             "id"               
## [25] "clone"             "selected_features" "importance"       
## [28] "initialize"        "base_learner"      "reset"            
## [31] "predict_newdata"   "predict"           "train"            
## [34] "help"              "print"             "format"

Training the Learner

The names of myLearner show us that there is a method called train; this enables us to train the model using a set of data

myLearner$train(task = myTask)

Once trained, I can look at the field, model, to see the result

myLearner$model

## n= 54408 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 54408 21133 adoption (0.61158286 0.08702764 0.30138950)  
##    2) animal_type=Bird,Cat,Dog,Livestock 51608 18500 adoption (0.64152845 0.04861649 0.30985506)  
##      4) animal_type=Dog 30830  7785 adoption (0.74748621 0.03707428 0.21543951) *
##      5) animal_type=Bird,Cat,Livestock 20778 10715 adoption (0.48431033 0.06574261 0.44994706)  
##       10) date>=16744.5 10321  4688 adoption (0.54578045 0.05483965 0.39937991) *
##       11) date< 16744.5 10457  5230 transfer (0.42363967 0.07650378 0.49985656) *
##    3) animal_type=Other 2800   574 no outcome (0.05964286 0.79500000 0.14535714) *

This is a decision tree fitted using the rpart() function from the rpart package. model gives us exactly the same list structure that is returned when we use rpart() directly. We can look at its structure with str()

str(myLearner$model)

## List of 15
##  $ frame              :'data.frame': 7 obs. of  9 variables:
##   ..$ var       : chr [1:7] "animal_type" "animal_type" "<leaf>" "date" ...
##   ..$ n         : int [1:7] 54408 51608 30830 20778 10321 10457 2800
##   ..$ wt        : num [1:7] 54408 51608 30830 20778 10321 ...
##   ..$ dev       : num [1:7] 21133 18500 7785 10715 4688 ...
##   ..$ yval      : num [1:7] 1 1 1 1 1 3 2
##   ..$ complexity: num [1:7] 0.0974 0.0189 0 0.0189 0 ...
##   ..$ ncompete  : int [1:7] 1 1 0 1 0 0 0
##   ..$ nsurrogate: int [1:7] 0 0 0 1 0 0 0
##   ..$ yval2     : num [1:7, 1:8] 1 1 1 1 1 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:8] "" "" "" "" ...
##  $ where              : Named int [1:54408] 5 7 3 6 3 3 6 3 3 3 ...
##   ..- attr(*, "names")= chr [1:54408] "1" "2" "3" "4" ...
##  $ call               : language rpart::rpart(formula = task$formula(), data = task$data(), xval = 0L)
##  $ terms              :Classes 'terms', 'formula'  language outcome_type ~ animal_type + date
##   .. ..- attr(*, "variables")= language list(outcome_type, animal_type, date)
##   .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:3] "outcome_type" "animal_type" "date"
##   .. .. .. ..$ : chr [1:2] "animal_type" "date"
##   .. ..- attr(*, "term.labels")= chr [1:2] "animal_type" "date"
##   .. ..- attr(*, "order")= int [1:2] 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, "predvars")= language list(outcome_type, animal_type, date)
##   .. ..- attr(*, "dataClasses")= Named chr [1:3] "factor" "factor" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:3] "outcome_type" "animal_type" "date"
##  $ cptable            : num [1:3, 1:3] 0.0974 0.0189 0.01 0 1 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "1" "2" "3"
##   .. ..$ : chr [1:3] "CP" "nsplit" "rel error"
##  $ method             : chr "class"
##  $ parms              :List of 3
##   ..$ prior: num [1:3(1d)] 0.612 0.087 0.301
##   .. ..- attr(*, "dimnames")=List of 1
##   .. .. ..$ : chr [1:3] "1" "2" "3"
##   ..$ loss : num [1:3, 1:3] 0 1 1 1 0 1 1 1 0
##   ..$ split: num 1
##  $ control            :List of 9
##   ..$ minsplit      : int 20
##   ..$ minbucket     : num 7
##   ..$ cp            : num 0.01
##   ..$ maxcompete    : int 4
##   ..$ maxsurrogate  : int 5
##   ..$ usesurrogate  : int 2
##   ..$ surrogatestyle: int 0
##   ..$ maxdepth      : int 30
##   ..$ xval          : int 0
##  $ functions          :List of 3
##   ..$ summary:function (yval, dev, wt, ylevel, digits)  
##   ..$ print  :function (yval, ylevel, digits)  
##   ..$ text   :function (yval, dev, wt, ylevel, digits, n, use.n)  
##  $ numresp            : int 5
##  $ splits             : num [1:7, 1:5] 54408 54408 51608 51608 20778 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "animal_type" "date" "animal_type" "date" ...
##   .. ..$ : chr [1:5] "count" "ncat" "improve" "index" ...
##  $ csplit             : int [1:4, 1:5] 1 3 1 1 1 3 3 3 1 1 ...
##  $ variable.importance: Named num [1:2] 4004 132
##   ..- attr(*, "names")= chr [1:2] "animal_type" "date"
##  $ y                  : int [1:54408] 1 2 3 3 3 1 1 1 1 1 ...
##  $ ordered            : Named logi [1:2] FALSE FALSE
##   ..- attr(*, "names")= chr [1:2] "animal_type" "date"
##  - attr(*, "xlevels")=List of 1
##   ..$ animal_type: chr [1:5] "Bird" "Cat" "Dog" "Livestock" ...
##  - attr(*, "ylevels")= chr [1:3] "adoption" "no outcome" "transfer"
##  - attr(*, "class")= chr "rpart"

So, I can access it as I would if I had run rpart() for myself. For example, I could code

myLearner$model$cptable %>%
  as_tibble()

## # A tibble: 3 x 3
##       CP nsplit `rel error`
##    <dbl>  <dbl>       <dbl>
## 1 0.0974      0       1    
## 2 0.0189      1       0.903
## 3 0.01        3       0.865

Here is the parameter set of the model

myLearner$param_set

## <ParamSet>
##                 id    class lower upper nlevels        default value
##  1:             cp ParamDbl     0     1     Inf           0.01      
##  2:     keep_model ParamLgl    NA    NA       2          FALSE      
##  3:     maxcompete ParamInt     0   Inf     Inf              4      
##  4:       maxdepth ParamInt     1    30      30             30      
##  5:   maxsurrogate ParamInt     0   Inf     Inf              5      
##  6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
##  7:       minsplit ParamInt     1   Inf     Inf             20      
##  8: surrogatestyle ParamInt     0     1       2              0      
##  9:   usesurrogate ParamInt     0     2       3              2      
## 10:           xval ParamInt     0   Inf     Inf             10     0

param_set has its own print method that creates this layout. The model has been fitted using default parameters, for instance the complexity parameter, cp, was set to 0.01. Let’s change cp.

myLearner$param_set$cp <- 0.001

No good! I cannot “cannot add bindings to a locked environment”, or in English, I cannot change it directly. The logic is that cp has to be a number between 0 and 1, if I could change it directly I might make it equal to -5 or 52 or “rabbit”.

param_set is itself an object so we can look to see what it contains

names(myLearner$param_set)

##  [1] ".__enclos_env__" "has_deps"        "values"          "has_trafo"      
##  [5] "trafo"           "all_categorical" "all_numeric"     "is_categ"       
##  [9] "is_number"       "storage_type"    "tags"            "default"        
## [13] "special_vals"    "is_bounded"      "nlevels"         "levels"         
## [17] "upper"           "lower"           "class"           "is_empty"       
## [21] "length"          "set_id"          "deps"            "params_unid"    
## [25] "params"          "assert_values"   "clone"           "print"          
## [29] "format"          "add_dep"         "assert_dt"       "test_dt"        
## [33] "check_dt"        "assert"          "test"            "check"          
## [37] "search_space"    "subset"          "get_values"      "ids"            
## [41] "add"             "initialize"

There is a field called values, which contains parameter values set by the user. Remembering that everything is stored in lists, the code that we need is

myLearner$param_set$values <- list(cp = 0.001)

myLearner$param_set$print()

## <ParamSet>
##                 id    class lower upper nlevels        default value
##  1:             cp ParamDbl     0     1     Inf           0.01 0.001
##  2:     keep_model ParamLgl    NA    NA       2          FALSE      
##  3:     maxcompete ParamInt     0   Inf     Inf              4      
##  4:       maxdepth ParamInt     1    30      30             30      
##  5:   maxsurrogate ParamInt     0   Inf     Inf              5      
##  6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
##  7:       minsplit ParamInt     1   Inf     Inf             20      
##  8: surrogatestyle ParamInt     0     1       2              0      
##  9:   usesurrogate ParamInt     0     2       3              2      
## 10:           xval ParamInt     0   Inf     Inf             10

myLearner$param_set$print() calls the built-in print method directly, but I would have got the same output had I called it indirectly with myLearner$param_set or via the R print function, print(myLearner$param_set).

Unfortunately, the developers have chosen to print the contents of the object values in a column with the label value, which is slightly confusing.

What has happened to the model?

myLearner$model$cptable %>%
  as_tibble()

## # A tibble: 3 x 3
##       CP nsplit `rel error`
##    <dbl>  <dbl>       <dbl>
## 1 0.0974      0       1    
## 2 0.0189      1       0.903
## 3 0.01        3       0.865

Nothing, we still have the model as it was when it was fitted. I can discover what value of cp that was used for the current model

myLearner$model$control$cp

## [1] 0.01

but we do need to be careful as this value is different from

myLearner$param_set$values$cp

## [1] 0.001

This is somewhat dangerous. Anyway, I can now retrain the model with my specified complexity parameter.

myLearner$train(task = myTask)

myLearner$model$cptable %>%
  as_tibble()

## # A tibble: 10 x 5
##         CP nsplit `rel error` xerror    xstd
##      <dbl>  <dbl>       <dbl>  <dbl>   <dbl>
##  1 0.0974       0       1      1     0.00538
##  2 0.0189       1       0.903  0.903 0.00527
##  3 0.00658      3       0.865  0.867 0.00522
##  4 0.00364      4       0.858  0.856 0.00520
##  5 0.00249      7       0.847  0.850 0.00519
##  6 0.00221     14       0.828  0.833 0.00516
##  7 0.00189     18       0.819  0.826 0.00515
##  8 0.00132     19       0.817  0.824 0.00515
##  9 0.00109     20       0.816  0.823 0.00515
## 10 0.001       22       0.814  0.822 0.00515

The old model is over-written and lost.

Making predictions

I might want to make some predictions based on our model. I’ll start with in-sample predictions for the training data, using the method predict

myPredictions <- myLearner$predict(task = myTask)

myPredictions$print()

## <PredictionClassif> for 54408 observations:
##     row_ids      truth   response
##           1   adoption   adoption
##           2 no outcome no outcome
##           3   transfer   adoption
## ---                              
##       54406   transfer   transfer
##       54407   adoption   adoption
##       54408   transfer   transfer

What have I created? myPredictions is an object of class PredictionClassif, so it too will have its own methods

names(myPredictions)

##  [1] ".__enclos_env__" "confusion"       "prob"            "response"       
##  [5] "missing"         "truth"           "row_ids"         "man"            
##  [9] "predict_types"   "task_properties" "task_type"       "data"           
## [13] "set_threshold"   "initialize"      "clone"           "score"          
## [17] "help"            "print"           "format"

There is a field called confusion, which contains the confusion matrix

myPredictions$confusion

##             truth
## response     adoption no outcome transfer
##   adoption      29392       1769    10403
##   no outcome      167       2226      407
##   transfer       3716        740     5588

What is the class of this object?

class(myPredictions$confusion)

## [1] "table"

It is just a simple R table, so we can do anything with it that we can do with any other table in R

myTab <- myPredictions$confusion

prop.table(myTab)

##             truth
## response        adoption  no outcome    transfer
##   adoption   0.540214674 0.032513601 0.191203499
##   no outcome 0.003069402 0.040913101 0.007480518
##   transfer   0.068298780 0.013600941 0.102705484

Selecting a Measure

We can see from the table of proportions that about 68% were correctly classified and 32% we misclassified. Rather than calculate this for ourselves, mlr3 offers a range of performance measures. They, of course, are stored in a dictionary, so we can list them

mlr_measures

## <DictionaryMeasure> with 84 stored values
## Keys: aic, bic, classif.acc, classif.auc, classif.bacc, classif.bbrier,
##   classif.ce, classif.costs, classif.dor, classif.fbeta, classif.fdr,
##   classif.fn, classif.fnr, classif.fomr, classif.fp, classif.fpr,
##   classif.logloss, classif.mbrier, classif.mcc, classif.npv,
##   classif.ppv, classif.prauc, classif.precision, classif.recall,
##   classif.sensitivity, classif.specificity, classif.tn, classif.tnr,
##   classif.tp, classif.tpr, clust.ch, clust.db, clust.dunn,
##   clust.silhouette, debug, dens.logloss, oob_error, regr.bias,
##   regr.ktau, regr.mae, regr.mape, regr.maxae, regr.medae, regr.medse,
##   regr.mse, regr.msle, regr.pbias, regr.rae, regr.rmse, regr.rmsle,
##   regr.rrse, regr.rse, regr.rsq, regr.sae, regr.smape, regr.srho,
##   regr.sse, selected_features, surv.brier, surv.calib_alpha,
##   surv.calib_beta, surv.chambless_auc, surv.cindex, surv.dcalib,
##   surv.graf, surv.hung_auc, surv.intlogloss, surv.logloss, surv.mae,
##   surv.mse, surv.nagelk_r2, surv.oquigley_r2, surv.rmse, surv.schmid,
##   surv.song_auc, surv.song_tnr, surv.song_tpr, surv.uno_auc,
##   surv.uno_tnr, surv.uno_tpr, surv.xu_r2, time_both, time_predict,
##   time_train

Let’s try classification accuracy. We can either use get to extract it from the dictionary or use the helper function msr()

myMeasure <- mlr_measures$get("classif.acc")

myMeasure <- msr("classif.acc")

myMeasure

## <MeasureClassifSimple:classif.acc>
## * Packages: mlr3measures
## * Range: [0, 1]
## * Minimize: FALSE
## * Parameters: list()
## * Properties: -
## * Predict type: response

To use this measure, I note that myPredictions has a method called score, that does the calculation

myPredictions$score(myMeasure)

## classif.acc 
##   0.6838333

Resampling

The measure confirms that 68% are correctly classified, but this is probably an optimistic estimate of performance because of overfitting. I’ll create a cross-validated estimate.

There are many ways to resample in mlr3. Let’s look at the dictionary

mlr_resamplings

## <DictionaryResampling> with 9 stored values
## Keys: bootstrap, custom, custom_cv, cv, holdout, insample, loo,
##   repeated_cv, subsampling

I will use a basic cross-validation, cv. I create a resampling object, either with get or with the helper function rsmp()

myCV <- mlr_resamplings$get("cv")

myCV <- rsmp("cv")

myCV

## <ResamplingCV> with 10 iterations
## * Instantiated: FALSE
## * Parameters: folds=10

I get 10 folds, which is the default, but I only want 5 folds. I can set the folds with rsmp()

myCV <- rsmp("cv", folds=5)

myCV

## <ResamplingCV> with 5 iterations
## * Instantiated: FALSE
## * Parameters: folds=5

At present, the resampling object is not associated with any data so it cannot actually create the folds. The method for creating real folds has the ugly name instantiate

myCV$instantiate(task = myTask)

I can run the cross-validation with a helper function called resample()

rsFit <- resample( task       = myTask,
                   learner    = myLearner,
                   resampling = myCV)

## INFO  [15:32:44.164] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:44.638] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:45.017] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:45.391] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:45.763] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)

What is the cross-validated accuracy? Here are the 5 cross-validated values

rsFit$score(myMeasure)

##                 task         task_id                   learner    learner_id
## 1: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 2: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 3: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 4: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 5: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
##            resampling resampling_id iteration              prediction
## 1: <ResamplingCV[19]>            cv         1 <PredictionClassif[19]>
## 2: <ResamplingCV[19]>            cv         2 <PredictionClassif[19]>
## 3: <ResamplingCV[19]>            cv         3 <PredictionClassif[19]>
## 4: <ResamplingCV[19]>            cv         4 <PredictionClassif[19]>
## 5: <ResamplingCV[19]>            cv         5 <PredictionClassif[19]>
##    classif.acc
## 1:   0.6775409
## 2:   0.6820437
## 3:   0.6827789
## 4:   0.6846797
## 5:   0.6776032

and here is the aggregated measure (mean)

rsFit$aggregate(myMeasure)

## classif.acc 
##   0.6809293

Not such an overfit as I was expecting.

Tuning the hyperparameters

Let’s try tuning the model’s hyperparameters. I will try a random search algorithm with cp between 0.001 and 0.1 and minsplit between 1 and 10.

I will be trying 10 random sets of parameters each with 5 fold cross-validation. So, to speed things up, the calculations will be run in multiple R sessions. I’ll use the logloss as my performance measure, this requires that the learner predicts probabilities rather than a category of response.

The tuning is set up using to_tune() and executed using tune()

# --- use the future package to create the sessions ---------------------
future::plan("multisession")

# --- set the hyperparameters to be tuned -------------------------------
myLearner$param_set$values$cp = to_tune(0.001, 0.1)
myLearner$param_set$values$minsplit = to_tune(1, 10)
myLearner$predict_type = "prob"

# --- choose the performance measure ------------------------------------
myMeasure <- mlr_measures$get("classif.logloss")

# --- pick 10 random combinations ---------------------------------------
set.seed(9830)
myTuner <-  tune(
  method = "random_search",
  task = myTask,
  learner = myLearner,
  resampling = myCV,
  measure = myMeasure,
  term_evals = 10,
  batch_size = 5 
)

## INFO  [15:32:48.672] [bbotk] Starting to optimize 2 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=10]' 
## INFO  [15:32:48.727] [bbotk] Evaluating 5 configuration(s) 
## INFO  [15:32:48.825] [mlr3]  Running benchmark with 25 resampling iterations 
## INFO  [15:32:49.465] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:49.859] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:50.186] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:50.386] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:50.774] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:51.093] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:50.767] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:51.211] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:51.405] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:51.139] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:51.492] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:51.777] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:52.005] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:51.721] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:52.146] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:52.448] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:52.090] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:52.508] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:52.698] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:52.530] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:52.813] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:53.016] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:53.145] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:53.413] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:53.679] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:53.929] [mlr3]  Finished benchmark 
## INFO  [15:32:54.351] [bbotk] Result of batch 1: 
## INFO  [15:32:54.353] [bbotk]           cp minsplit classif.logloss                                uhash 
## INFO  [15:32:54.353] [bbotk]  0.083748776        5       0.7864746 61e4ef11-5e35-4f72-a30b-932356917cb3 
## INFO  [15:32:54.353] [bbotk]  0.011339234        5       0.7494300 957ecf8d-5617-4c0c-ab99-a50f432fc4fe 
## INFO  [15:32:54.353] [bbotk]  0.009854275        7       0.7494300 8a6efce0-c6e0-44de-b64e-29e3a8930f82 
## INFO  [15:32:54.353] [bbotk]  0.037606757        2       0.7864746 5b04f5ef-9906-440f-9562-97a65f5cc363 
## INFO  [15:32:54.353] [bbotk]  0.066829580       10       0.7864746 86da6032-a1e6-4e98-bc88-aba44feb336f 
## INFO  [15:32:54.356] [bbotk] Evaluating 5 configuration(s) 
## INFO  [15:32:54.414] [mlr3]  Running benchmark with 25 resampling iterations 
## INFO  [15:32:54.435] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:55.008] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:55.592] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:54.450] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:55.024] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:55.609] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:54.467] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:55.530] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:56.092] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:54.489] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:55.328] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:55.957] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:56.404] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:54.512] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5) 
## INFO  [15:32:55.302] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:56.332] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:54.538] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:55.188] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:56.265] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:54.566] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:55.318] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5) 
## INFO  [15:32:56.064] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5) 
## INFO  [15:32:54.604] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:55.686] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5) 
## INFO  [15:32:56.225] [mlr3]  Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5) 
## INFO  [15:32:56.831] [mlr3]  Finished benchmark 
## INFO  [15:32:57.107] [bbotk] Result of batch 2: 
## INFO  [15:32:57.112] [bbotk]           cp minsplit classif.logloss                                uhash 
## INFO  [15:32:57.112] [bbotk]  0.056833792       10       0.7864746 7ab661db-b0ef-4d11-a4d5-370feaab4ac6 
## INFO  [15:32:57.112] [bbotk]  0.027682699        2       0.7864746 ddc39800-e857-4b0c-bd27-06ab47d90f68 
## INFO  [15:32:57.112] [bbotk]  0.092669535        7       0.7864746 c405cb41-c165-4411-a43a-da5b04f278ee 
## INFO  [15:32:57.112] [bbotk]  0.069294124        7       0.7864746 a94dc5c1-7364-4079-a464-8fc8df224956 
## INFO  [15:32:57.112] [bbotk]  0.003608916        3       0.7482130 5b335d08-b42f-48a1-8d4f-8f43513c65fe 
## INFO  [15:32:57.124] [bbotk] Finished optimizing after 10 evaluation(s) 
## INFO  [15:32:57.124] [bbotk] Result: 
## INFO  [15:32:57.126] [bbotk]           cp minsplit learner_param_vals  x_domain classif.logloss 
## INFO  [15:32:57.126] [bbotk]  0.003608916        3          <list[2]> <list[2]>        0.748213

myTuner

## <TuningInstanceSingleCrit>
## * State:  Optimized
## * Objective: <ObjectiveTuning:classif.rpart_on_animal_adoption>
## * Search Space:
## <ParamSet>
##          id    class lower upper nlevels        default value
## 1:       cp ParamDbl 0.001   0.1     Inf <NoDefault[3]>      
## 2: minsplit ParamInt 1.000  10.0      10 <NoDefault[3]>      
## * Terminator: <TerminatorEvals>
## * Terminated: TRUE
## * Result:
##             cp minsplit learner_param_vals  x_domain classif.logloss
## 1: 0.003608916        3          <list[2]> <list[2]>        0.748213
## * Archive:
## <ArchiveTuning>
##         cp minsplit classif.logloss           timestamp batch_nr
##  1: 0.0837        5            0.79 2021-11-02 15:32:54        1
##  2: 0.0113        5            0.75 2021-11-02 15:32:54        1
##  3: 0.0099        7            0.75 2021-11-02 15:32:54        1
##  4: 0.0376        2            0.79 2021-11-02 15:32:54        1
##  5: 0.0668       10            0.79 2021-11-02 15:32:54        1
##  6: 0.0568       10            0.79 2021-11-02 15:32:57        2
##  7: 0.0277        2            0.79 2021-11-02 15:32:57        2
##  8: 0.0927        7            0.79 2021-11-02 15:32:57        2
##  9: 0.0693        7            0.79 2021-11-02 15:32:57        2
## 10: 0.0036        3            0.75 2021-11-02 15:32:57        2

So the best cross-validated logloss corresponds to cp=0.0036 and minsplit=3. Notice that there are several other combinations that give an almost identical performance.

Pipelines

The final step is to create a complete pipeline including both preprocessing and model fitting. To this end, mlr3 offers a large range of pipe_ops. As usual their names are stored in a dictionary.

mlr_pipeops

## <DictionaryPipeOp> with 74 stored values
## Keys: boxcox, branch, chunk, classbalancing, classifavg, classweights,
##   colapply, collapsefactors, colroles, compose_crank, compose_distr,
##   compose_probregr, copy, crankcompose, datefeatures, distrcompose,
##   encode, encodeimpact, encodelmer, featureunion, filter, fixfactors,
##   histbin, ica, imputeconstant, imputehist, imputelearner, imputemean,
##   imputemedian, imputemode, imputeoor, imputesample, kernelpca,
##   learner, learner_cv, missind, modelmatrix, multiplicityexply,
##   multiplicityimply, mutate, nmf, nop, ovrsplit, ovrunite, pca, proxy,
##   quantilebin, randomprojection, randomresponse, regravg,
##   removeconstants, renamecolumns, replicate, scale, scalemaxabs,
##   scalerange, select, smote, spatialsign, subsample, survavg,
##   targetinvert, targetmutate, targettrafoscalerange, textvectorizer,
##   threshold, trafopred_regrsurv, trafopred_survregr,
##   trafotask_regrsurv, trafotask_survregr, tunethreshold, unbranch,
##   vtreat, yeojohnson

Some, such as learner, have already been used, but there are others such as pca, scale, imputemedian and removeconstants that can be used in preprocessing. The idea is to string these together into a complete pipeline.

To create a single pipe_op ready to go into the pipeline, there is a helper function po() that I can use in place of mlr_pipeops$get.

Once the individual pipe_ops have been defined, they are combined in an ordered chain using mlr3’s own pipe operator, %>>%. The resulting pipeline can be saved and even plotted as a graph with the individual pipe_ops as the nodes and edges that show the flow of data between them.

The big question is why? what advantage is there in using pipe operators over running the preprocessing with dplyr?

If the preprocessing is completed before the analysis, then there is no real advantage, indeed dplyr is more flexible and would probably be a simpler option. The advantage comes when the preprocessing itself needs to be tuned. Suppose for example, that preprocessing includes variable selection. Should I use the top 10 features or the top 15 features in the decision tree? By building the variable selection step into the pipeline, it is possible to tune the number of features alongside the other hyperparameters.

A secondary advantage is that the pipe_ops will automatically save the state of the operation. Imagine that I were to run median imputation on the training data. The pipe_op would save those medians. Later, I might want to impute on the test data and those same medians can be recalled and used again.

Extensions

A important feature of mlr3 is that the user is able to add their own extensions, perhaps a learner, or a measure, or a tuner, or a pipe_op.

This can be done by writing a new R6 class that inherits from the corresponding mlr3 class.

To help create a new learner, there is a helper function called create_learner() that does most of the work for you. Once it is done, you can add the learner to the mlr_learners dictionary and use it just like any other learner.

Is mlr3 better than tidymodels?

Of course, the answer is, it depends. For me, mlr3’s main advantages over tidymodels are

what it does is transparent
the user remains in control
all intermediate results can be accessed
it is extendible
there is a clear and coherent design

The main disadvantage compared with tidymodels is

it is much more difficult to use

You need to have a grasp of OOP and R6 before you can make sensible use of mlr3.

In my opinion, mlr3 is better than tidymodels, but I accept that the learning curve will be too steep for many people.

For me, a more important choice is between mlr3 and R code written specifically for a given analysis. In their videos and posts, the authors of mlr3 make much of pipelines and the flexibility with which pipe_ops can be combined. I am yet to be convinced that I would make much use of this feature. I suspect that for the vast majority of my work, I would find it easier to run the preprocessing in dplyr, or in purpose written R code.

Modelling with R

contrasting statistical and machine learning approaches