Methods: Introduction to mlr3
Introduction
This post introduces the ecosystem of packages known as mlr3. It is an alternative to tidymodels and one of my reasons for trying mlr3 was to compare the two.
mlr3 is built on R6. a package that enables Object Oriented Programming (OOP) in R. To understand the way that mlr3 works, it is helpful to know a little about OOP and R6, so that is where I will start.
If you want more detail than I provide, follow the links on the project’s website at https://mlr3.mlr-org.com/; the YouTube videos give a good flavour of what mlr3 can do and the developers’ mlr3 book is very comprehensive (https://mlr3book.mlr-org.com/).
Object Orientated programming
mlr3 is written using Object Orientated Programming (OOP), this makes it somewhat different to use than base R or the tidyverse, which primarily use functional programming (FP).
The idea behind OOP is that the fundamental building block is an object. Objects not only contain data, but they also have their own, built-in functions. Typically, those functions control access to the object’s data, or they perform operations on the object’s data.
Good terminology is helpful in avoiding confusion, so I will refer to the data stored in an object as its fields and the built-in functions as methods.
fields (and sometimes methods) may be kept private, that is the user is not able to access them directly. The alternative is to have fields and methods that are public, usually objects contain a mix of private and public fields and methods. The most common practice is to make the fields private, but to provide public methods that enable the user to set or change the data. In that way, the methods can check the requested operations and ensure that the data are not corrupted.
It is possible that a method will return a value that has been calculated and was not itself stored in a field. Such returned data are referred to as active fields.
An example of a object used by mlr3 is a Task. A Task is an object for holding a set of data that is to be used in a machine learning project. The object has space for a data table, but it also contains metadata such as the name of the response variable and the names of the features that are to be used as predictors. When an analysis is started, the first job will be to define an object of class Task and to fill it with the data.
From the point of view of the programmer, a key feature of OOP is inheritance. As an example, suppose that we have created a class of object called a Task, we might also require more specialised objects, such as a classification Task or a regression Task. In a classification Task the response would have to be a factor and in a regression Task the response would have to be numeric, but otherwise they would be similar. Rather than creating these new classes from scratch, the programmer can make them both special cases of a general Task, in which case they will inherit a general Task’s fields and methods and the programmer will only need to add the extras.
R6
R6 package that enables full OOP in R. It is basic to everything in mlr3. There is a very informative chapter on R6 in Hadley Wickham’s Advance R book (https://adv-r.hadley.nz/r6.html). R6 creates classes of object with the fields and methods saved in an R list. Consequently, the code for using an R6 object looks very similar to the base R code that would be used to access a list.
Suppose that we have a class of R6 object called a job and that amongst the methods associated with this class is new, which is a function used to create a new instance of job. Then a typical piece of code might read
myJob <- job$new( data = myDF, model = "xgboost")
This code uses the method new() from the class job with two arguments and creates an instance called myJob. myJob is an object of class job with the specified data and model fields.
If the model field is public then the user will be able to access it directly. For instance, this code would print the contents of the model field.
print(myJob$model)
Internally, myJob is an R list that contains fields (data) and methods (functions) so we access the fields as myJob$fieldName and the methods as myJob$methodName()
In my example, I have supposed that model field stores a string, but more likely a model will itself be an R6 object and as well as the name of the analysis, it might store the default parameters. The nature of those parameters would depend on the model, so there will be a general class of models and special cases, such as xgboost models and linear regression models.
The code
print(myJob$model$params)
Says that myJob which is a my instance of class job contains an object of class model and from within that class, I want to know the values in the params field.
To investigate the structure of an object we could ask for its class using class(myJob), or the names of its contents using names(myJob).
mlr3
Reading the data
We will use the animal adoption data from episode 10 of Sliced as our example. I’ll start with some standard tidyverse code that reads the data and selects some columns for the analysis.
library(mlr3verse)
library(tidyverse)
library(lubridate)
home <- "C:/Projects/kaggle/sliced/s01-e10"
# --- read data and select variable --------------------------
readRDS( file.path(home, "data/rData/train.rds")) %>%
mutate( outcome_type = factor(outcome_type),
animal_type = factor(animal_type),
date = as.numeric(date(datetime)) ) %>%
select( outcome_type, animal_type, date) %>%
print() -> rawDF
## # A tibble: 54,408 x 3
## outcome_type animal_type date
## <fct> <fct> <dbl>
## 1 adoption Cat 17018
## 2 no outcome Other 17464
## 3 transfer Dog 16137
## 4 transfer Cat 16408
## 5 transfer Dog 16816
## 6 adoption Dog 16318
## 7 adoption Cat 16214
## 8 adoption Dog 16490
## 9 adoption Dog 17150
## 10 adoption Dog 16452
## # ... with 54,398 more rows
Loading mlr3verse is similar to loading the tidyverse, it brings in the key packages from the mlr3 ecosystem.
Creating a Task
The first job is to create an object of class TaskClassif, this is a special type of Task that is designed for classification problems. The object will store the data and metadata for our training set. I’ll call the instance, myTask.
An object of class TaskClassif can have an id (just a title), a backend (the data frame) and metadata on the target (response). The instance is created using a method called new. It is conventional to use the name new for the method that creates a new instance of an object.
myTask <- TaskClassif$new( id = "animal_adoption",
backend = rawDF,
target = "outcome_type")
Let’s see what I have created
class(myTask)
## [1] "TaskClassif" "TaskSupervised" "Task" "R6"
myTask has an inheritance trail. It started as a very general R6 object, a special case of which is a Task, a special case of which is TaskSupervised, a special case of which is TaskClassif.
What fields and methods are provided?
names(myTask)
## [1] ".__enclos_env__" "negative" "positive" "class_names"
## [5] "labels" "weights" "order" "groups"
## [9] "strata" "data_formats" "feature_types" "ncol"
## [13] "nrow" "col_roles" "row_roles" "properties"
## [17] "target_names" "feature_names" "row_names" "row_ids"
## [21] "hash" "extra_args" "man" "col_info"
## [25] "backend" "task_type" "id" "clone"
## [29] "droplevels" "truth" "data" "initialize"
## [33] "add_strata" "set_col_roles" "set_row_roles" "rename"
## [37] "cbind" "rbind" "select" "filter"
## [41] "missings" "levels" "head" "formula"
## [45] "print" "format" "help"
Unfortunately, all we have is the contents of a list so we cannot tell which are fields and which are methods, though we can make intelligent guesses. Usually the fields come first, so my guess is that everything up to id is a field and everything after clone is a method.
Let’s try look at the field nrow
myTask$nrow
## [1] 54408
If I try to print a method, I will just get reference to the function
myTask$head
## function (n = 6L)
## .__Task__head(self = self, private = private, super = super,
## n = n)
## <environment: 0x0000000027fdc740>
and when I run the function, not surprisingly it prints the first 6 lines of the data frame.
myTask$head()
## outcome_type animal_type date
## 1: adoption Cat 17018
## 2: no outcome Other 17464
## 3: transfer Dog 16137
## 4: transfer Cat 16408
## 5: transfer Dog 16816
## 6: adoption Dog 16318
What about the field col_roles?
myTask$col_roles
## $feature
## [1] "animal_type" "date"
##
## $target
## [1] "outcome_type"
##
## $name
## character(0)
##
## $order
## character(0)
##
## $stratum
## character(0)
##
## $group
## character(0)
##
## $weight
## character(0)
You can see col_roles is not a value but is itself another R6 object that looks like a list when we print it. So it makes sense to ask for its names.
names(myTask$col_roles)
## [1] "feature" "target" "name" "order" "stratum" "group" "weight"
or to refer to specific items in the list
myTask$col_roles$target
## [1] "outcome_type"
or, since we are dealing with R lists, we can also request by position
myTask$col_roles[[1]]
## [1] "animal_type" "date"
Selecting a Learner
Next we will define a classifier. In mlr3 the model used for an analysis, in this case classification, is called a learner. The options available are stored in a dictionary called mlr_learners and as usual in R, the contents can be displayed by typing the name of the object
mlr_learners
## <DictionaryLearner> with 130 stored values
## Keys: classif.AdaBoostM1, classif.bart, classif.C50, classif.catboost,
## classif.cforest, classif.ctree, classif.cv_glmnet, classif.debug,
## classif.earth, classif.extratrees, classif.featureless, classif.fnn,
## classif.gam, classif.gamboost, classif.gbm, classif.glmboost,
## classif.glmnet, classif.IBk, classif.J48, classif.JRip, classif.kknn,
## classif.ksvm, classif.lda, classif.liblinear, classif.lightgbm,
## classif.LMT, classif.log_reg, classif.mob, classif.multinom,
## classif.naive_bayes, classif.nnet, classif.OneR, classif.PART,
## classif.qda, classif.randomForest, classif.ranger, classif.rfsrc,
## classif.rpart, classif.svm, classif.xgboost, clust.agnes, clust.ap,
## clust.cmeans, clust.cobweb, clust.dbscan, clust.diana, clust.em,
## clust.fanny, clust.featureless, clust.ff, clust.kkmeans,
## clust.kmeans, clust.MBatchKMeans, clust.meanshift, clust.pam,
## clust.SimpleKMeans, clust.xmeans, dens.hist, dens.kde, dens.kde_kd,
## dens.kde_ks, dens.locfit, dens.logspline, dens.mixed, dens.nonpar,
## dens.pen, dens.plug, dens.spline, regr.bart, regr.catboost,
## regr.cforest, regr.ctree, regr.cubist, regr.cv_glmnet, regr.earth,
## regr.extratrees, regr.featureless, regr.fnn, regr.gam, regr.gamboost,
## regr.gbm, regr.glm, regr.glmboost, regr.glmnet, regr.IBk, regr.kknn,
## regr.km, regr.ksvm, regr.liblinear, regr.lightgbm, regr.lm,
## regr.M5Rules, regr.mars, regr.mob, regr.randomForest, regr.ranger,
## regr.rfsrc, regr.rpart, regr.svm, regr.xgboost, surv.akritas,
## surv.blackboost, surv.cforest, surv.coxboost, surv.coxph,
## surv.coxtime, surv.ctree, surv.cv_coxboost, surv.cv_glmnet,
## surv.deephit, surv.deepsurv, surv.dnnsurv, surv.flexible,
## surv.gamboost, surv.gbm, surv.glmboost, surv.glmnet, surv.kaplan,
## surv.loghaz, surv.mboost, surv.nelson, surv.obliqueRSF,
## surv.parametric, surv.pchazard, surv.penalized, surv.ranger,
## surv.rfsrc, surv.rpart, surv.svm, surv.xgboost
We have a 3 class classification problem so a tree created by classif.rpart would be a sensible option. We can create an empty learner of that type by using get on the dictionary, (get is another conventional name) but mlr3 provides a helper function lrn() that does exactly the same job.
# --- set a learner using get -----------------------------
myLearner <- mlr_learners$get("classif.rpart")
# --- set a learner using the helper function -------------
myLearner <- lrn("classif.rpart")
# --- print the learner -----------------------------------
myLearner
## <LearnerClassifRpart:classif.rpart>
## * Model: -
## * Parameters: xval=0
## * Packages: rpart
## * Predict Type: response
## * Feature types: logical, integer, numeric, factor, ordered
## * Properties: importance, missings, multiclass, selected_features,
## twoclass, weights
What are the fields and methods of an object of this class?
names(myLearner)
## [1] ".__enclos_env__" "encapsulate" "param_set"
## [4] "predict_type" "phash" "hash"
## [7] "errors" "warnings" "log"
## [10] "timings" "model" "man"
## [13] "fallback" "timeout" "parallel_predict"
## [16] "predict_sets" "packages" "data_formats"
## [19] "properties" "feature_types" "predict_types"
## [22] "task_type" "state" "id"
## [25] "clone" "selected_features" "importance"
## [28] "initialize" "base_learner" "reset"
## [31] "predict_newdata" "predict" "train"
## [34] "help" "print" "format"
Training the Learner
The names of myLearner show us that there is a method called train; this enables us to train the model using a set of data
myLearner$train(task = myTask)
Once trained, I can look at the field, model, to see the result
myLearner$model
## n= 54408
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 54408 21133 adoption (0.61158286 0.08702764 0.30138950)
## 2) animal_type=Bird,Cat,Dog,Livestock 51608 18500 adoption (0.64152845 0.04861649 0.30985506)
## 4) animal_type=Dog 30830 7785 adoption (0.74748621 0.03707428 0.21543951) *
## 5) animal_type=Bird,Cat,Livestock 20778 10715 adoption (0.48431033 0.06574261 0.44994706)
## 10) date>=16744.5 10321 4688 adoption (0.54578045 0.05483965 0.39937991) *
## 11) date< 16744.5 10457 5230 transfer (0.42363967 0.07650378 0.49985656) *
## 3) animal_type=Other 2800 574 no outcome (0.05964286 0.79500000 0.14535714) *
This is a decision tree fitted using the rpart() function from the rpart package. model gives us exactly the same list structure that is returned when we use rpart() directly. We can look at its structure with str()
str(myLearner$model)
## List of 15
## $ frame :'data.frame': 7 obs. of 9 variables:
## ..$ var : chr [1:7] "animal_type" "animal_type" "<leaf>" "date" ...
## ..$ n : int [1:7] 54408 51608 30830 20778 10321 10457 2800
## ..$ wt : num [1:7] 54408 51608 30830 20778 10321 ...
## ..$ dev : num [1:7] 21133 18500 7785 10715 4688 ...
## ..$ yval : num [1:7] 1 1 1 1 1 3 2
## ..$ complexity: num [1:7] 0.0974 0.0189 0 0.0189 0 ...
## ..$ ncompete : int [1:7] 1 1 0 1 0 0 0
## ..$ nsurrogate: int [1:7] 0 0 0 1 0 0 0
## ..$ yval2 : num [1:7, 1:8] 1 1 1 1 1 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:8] "" "" "" "" ...
## $ where : Named int [1:54408] 5 7 3 6 3 3 6 3 3 3 ...
## ..- attr(*, "names")= chr [1:54408] "1" "2" "3" "4" ...
## $ call : language rpart::rpart(formula = task$formula(), data = task$data(), xval = 0L)
## $ terms :Classes 'terms', 'formula' language outcome_type ~ animal_type + date
## .. ..- attr(*, "variables")= language list(outcome_type, animal_type, date)
## .. ..- attr(*, "factors")= int [1:3, 1:2] 0 1 0 0 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:3] "outcome_type" "animal_type" "date"
## .. .. .. ..$ : chr [1:2] "animal_type" "date"
## .. ..- attr(*, "term.labels")= chr [1:2] "animal_type" "date"
## .. ..- attr(*, "order")= int [1:2] 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, "predvars")= language list(outcome_type, animal_type, date)
## .. ..- attr(*, "dataClasses")= Named chr [1:3] "factor" "factor" "numeric"
## .. .. ..- attr(*, "names")= chr [1:3] "outcome_type" "animal_type" "date"
## $ cptable : num [1:3, 1:3] 0.0974 0.0189 0.01 0 1 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "1" "2" "3"
## .. ..$ : chr [1:3] "CP" "nsplit" "rel error"
## $ method : chr "class"
## $ parms :List of 3
## ..$ prior: num [1:3(1d)] 0.612 0.087 0.301
## .. ..- attr(*, "dimnames")=List of 1
## .. .. ..$ : chr [1:3] "1" "2" "3"
## ..$ loss : num [1:3, 1:3] 0 1 1 1 0 1 1 1 0
## ..$ split: num 1
## $ control :List of 9
## ..$ minsplit : int 20
## ..$ minbucket : num 7
## ..$ cp : num 0.01
## ..$ maxcompete : int 4
## ..$ maxsurrogate : int 5
## ..$ usesurrogate : int 2
## ..$ surrogatestyle: int 0
## ..$ maxdepth : int 30
## ..$ xval : int 0
## $ functions :List of 3
## ..$ summary:function (yval, dev, wt, ylevel, digits)
## ..$ print :function (yval, ylevel, digits)
## ..$ text :function (yval, dev, wt, ylevel, digits, n, use.n)
## $ numresp : int 5
## $ splits : num [1:7, 1:5] 54408 54408 51608 51608 20778 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "animal_type" "date" "animal_type" "date" ...
## .. ..$ : chr [1:5] "count" "ncat" "improve" "index" ...
## $ csplit : int [1:4, 1:5] 1 3 1 1 1 3 3 3 1 1 ...
## $ variable.importance: Named num [1:2] 4004 132
## ..- attr(*, "names")= chr [1:2] "animal_type" "date"
## $ y : int [1:54408] 1 2 3 3 3 1 1 1 1 1 ...
## $ ordered : Named logi [1:2] FALSE FALSE
## ..- attr(*, "names")= chr [1:2] "animal_type" "date"
## - attr(*, "xlevels")=List of 1
## ..$ animal_type: chr [1:5] "Bird" "Cat" "Dog" "Livestock" ...
## - attr(*, "ylevels")= chr [1:3] "adoption" "no outcome" "transfer"
## - attr(*, "class")= chr "rpart"
So, I can access it as I would if I had run rpart() for myself. For example, I could code
myLearner$model$cptable %>%
as_tibble()
## # A tibble: 3 x 3
## CP nsplit `rel error`
## <dbl> <dbl> <dbl>
## 1 0.0974 0 1
## 2 0.0189 1 0.903
## 3 0.01 3 0.865
Here is the parameter set of the model
myLearner$param_set
## <ParamSet>
## id class lower upper nlevels default value
## 1: cp ParamDbl 0 1 Inf 0.01
## 2: keep_model ParamLgl NA NA 2 FALSE
## 3: maxcompete ParamInt 0 Inf Inf 4
## 4: maxdepth ParamInt 1 30 30 30
## 5: maxsurrogate ParamInt 0 Inf Inf 5
## 6: minbucket ParamInt 1 Inf Inf <NoDefault[3]>
## 7: minsplit ParamInt 1 Inf Inf 20
## 8: surrogatestyle ParamInt 0 1 2 0
## 9: usesurrogate ParamInt 0 2 3 2
## 10: xval ParamInt 0 Inf Inf 10 0
param_set has its own print method that creates this layout. The model has been fitted using default parameters, for instance the complexity parameter, cp, was set to 0.01. Let’s change cp.
myLearner$param_set$cp <- 0.001
No good! I cannot “cannot add bindings to a locked environment”, or in English, I cannot change it directly. The logic is that cp has to be a number between 0 and 1, if I could change it directly I might make it equal to -5 or 52 or “rabbit”.
param_set is itself an object so we can look to see what it contains
names(myLearner$param_set)
## [1] ".__enclos_env__" "has_deps" "values" "has_trafo"
## [5] "trafo" "all_categorical" "all_numeric" "is_categ"
## [9] "is_number" "storage_type" "tags" "default"
## [13] "special_vals" "is_bounded" "nlevels" "levels"
## [17] "upper" "lower" "class" "is_empty"
## [21] "length" "set_id" "deps" "params_unid"
## [25] "params" "assert_values" "clone" "print"
## [29] "format" "add_dep" "assert_dt" "test_dt"
## [33] "check_dt" "assert" "test" "check"
## [37] "search_space" "subset" "get_values" "ids"
## [41] "add" "initialize"
There is a field called values, which contains parameter values set by the user. Remembering that everything is stored in lists, the code that we need is
myLearner$param_set$values <- list(cp = 0.001)
myLearner$param_set$print()
## <ParamSet>
## id class lower upper nlevels default value
## 1: cp ParamDbl 0 1 Inf 0.01 0.001
## 2: keep_model ParamLgl NA NA 2 FALSE
## 3: maxcompete ParamInt 0 Inf Inf 4
## 4: maxdepth ParamInt 1 30 30 30
## 5: maxsurrogate ParamInt 0 Inf Inf 5
## 6: minbucket ParamInt 1 Inf Inf <NoDefault[3]>
## 7: minsplit ParamInt 1 Inf Inf 20
## 8: surrogatestyle ParamInt 0 1 2 0
## 9: usesurrogate ParamInt 0 2 3 2
## 10: xval ParamInt 0 Inf Inf 10
myLearner$param_set$print() calls the built-in print method directly, but I would have got the same output had I called it indirectly with myLearner$param_set or via the R print function, print(myLearner$param_set).
Unfortunately, the developers have chosen to print the contents of the object values in a column with the label value, which is slightly confusing.
What has happened to the model?
myLearner$model$cptable %>%
as_tibble()
## # A tibble: 3 x 3
## CP nsplit `rel error`
## <dbl> <dbl> <dbl>
## 1 0.0974 0 1
## 2 0.0189 1 0.903
## 3 0.01 3 0.865
Nothing, we still have the model as it was when it was fitted. I can discover what value of cp that was used for the current model
myLearner$model$control$cp
## [1] 0.01
but we do need to be careful as this value is different from
myLearner$param_set$values$cp
## [1] 0.001
This is somewhat dangerous. Anyway, I can now retrain the model with my specified complexity parameter.
myLearner$train(task = myTask)
myLearner$model$cptable %>%
as_tibble()
## # A tibble: 10 x 5
## CP nsplit `rel error` xerror xstd
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0974 0 1 1 0.00538
## 2 0.0189 1 0.903 0.903 0.00527
## 3 0.00658 3 0.865 0.867 0.00522
## 4 0.00364 4 0.858 0.856 0.00520
## 5 0.00249 7 0.847 0.850 0.00519
## 6 0.00221 14 0.828 0.833 0.00516
## 7 0.00189 18 0.819 0.826 0.00515
## 8 0.00132 19 0.817 0.824 0.00515
## 9 0.00109 20 0.816 0.823 0.00515
## 10 0.001 22 0.814 0.822 0.00515
The old model is over-written and lost.
Making predictions
I might want to make some predictions based on our model. I’ll start with in-sample predictions for the training data, using the method predict
myPredictions <- myLearner$predict(task = myTask)
myPredictions$print()
## <PredictionClassif> for 54408 observations:
## row_ids truth response
## 1 adoption adoption
## 2 no outcome no outcome
## 3 transfer adoption
## ---
## 54406 transfer transfer
## 54407 adoption adoption
## 54408 transfer transfer
What have I created? myPredictions is an object of class PredictionClassif, so it too will have its own methods
names(myPredictions)
## [1] ".__enclos_env__" "confusion" "prob" "response"
## [5] "missing" "truth" "row_ids" "man"
## [9] "predict_types" "task_properties" "task_type" "data"
## [13] "set_threshold" "initialize" "clone" "score"
## [17] "help" "print" "format"
There is a field called confusion, which contains the confusion matrix
myPredictions$confusion
## truth
## response adoption no outcome transfer
## adoption 29392 1769 10403
## no outcome 167 2226 407
## transfer 3716 740 5588
What is the class of this object?
class(myPredictions$confusion)
## [1] "table"
It is just a simple R table, so we can do anything with it that we can do with any other table in R
myTab <- myPredictions$confusion
prop.table(myTab)
## truth
## response adoption no outcome transfer
## adoption 0.540214674 0.032513601 0.191203499
## no outcome 0.003069402 0.040913101 0.007480518
## transfer 0.068298780 0.013600941 0.102705484
Selecting a Measure
We can see from the table of proportions that about 68% were correctly classified and 32% we misclassified. Rather than calculate this for ourselves, mlr3 offers a range of performance measures. They, of course, are stored in a dictionary, so we can list them
mlr_measures
## <DictionaryMeasure> with 84 stored values
## Keys: aic, bic, classif.acc, classif.auc, classif.bacc, classif.bbrier,
## classif.ce, classif.costs, classif.dor, classif.fbeta, classif.fdr,
## classif.fn, classif.fnr, classif.fomr, classif.fp, classif.fpr,
## classif.logloss, classif.mbrier, classif.mcc, classif.npv,
## classif.ppv, classif.prauc, classif.precision, classif.recall,
## classif.sensitivity, classif.specificity, classif.tn, classif.tnr,
## classif.tp, classif.tpr, clust.ch, clust.db, clust.dunn,
## clust.silhouette, debug, dens.logloss, oob_error, regr.bias,
## regr.ktau, regr.mae, regr.mape, regr.maxae, regr.medae, regr.medse,
## regr.mse, regr.msle, regr.pbias, regr.rae, regr.rmse, regr.rmsle,
## regr.rrse, regr.rse, regr.rsq, regr.sae, regr.smape, regr.srho,
## regr.sse, selected_features, surv.brier, surv.calib_alpha,
## surv.calib_beta, surv.chambless_auc, surv.cindex, surv.dcalib,
## surv.graf, surv.hung_auc, surv.intlogloss, surv.logloss, surv.mae,
## surv.mse, surv.nagelk_r2, surv.oquigley_r2, surv.rmse, surv.schmid,
## surv.song_auc, surv.song_tnr, surv.song_tpr, surv.uno_auc,
## surv.uno_tnr, surv.uno_tpr, surv.xu_r2, time_both, time_predict,
## time_train
Let’s try classification accuracy. We can either use get to extract it from the dictionary or use the helper function msr()
myMeasure <- mlr_measures$get("classif.acc")
myMeasure <- msr("classif.acc")
myMeasure
## <MeasureClassifSimple:classif.acc>
## * Packages: mlr3measures
## * Range: [0, 1]
## * Minimize: FALSE
## * Parameters: list()
## * Properties: -
## * Predict type: response
To use this measure, I note that myPredictions has a method called score, that does the calculation
myPredictions$score(myMeasure)
## classif.acc
## 0.6838333
Resampling
The measure confirms that 68% are correctly classified, but this is probably an optimistic estimate of performance because of overfitting. I’ll create a cross-validated estimate.
There are many ways to resample in mlr3. Let’s look at the dictionary
mlr_resamplings
## <DictionaryResampling> with 9 stored values
## Keys: bootstrap, custom, custom_cv, cv, holdout, insample, loo,
## repeated_cv, subsampling
I will use a basic cross-validation, cv. I create a resampling object, either with get or with the helper function rsmp()
myCV <- mlr_resamplings$get("cv")
myCV <- rsmp("cv")
myCV
## <ResamplingCV> with 10 iterations
## * Instantiated: FALSE
## * Parameters: folds=10
I get 10 folds, which is the default, but I only want 5 folds. I can set the folds with rsmp()
myCV <- rsmp("cv", folds=5)
myCV
## <ResamplingCV> with 5 iterations
## * Instantiated: FALSE
## * Parameters: folds=5
At present, the resampling object is not associated with any data so it cannot actually create the folds. The method for creating real folds has the ugly name instantiate
myCV$instantiate(task = myTask)
I can run the cross-validation with a helper function called resample()
rsFit <- resample( task = myTask,
learner = myLearner,
resampling = myCV)
## INFO [15:32:44.164] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:44.638] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:45.017] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:45.391] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:45.763] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
What is the cross-validated accuracy? Here are the 5 cross-validated values
rsFit$score(myMeasure)
## task task_id learner learner_id
## 1: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 2: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 3: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 4: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## 5: <TaskClassif[47]> animal_adoption <LearnerClassifRpart[36]> classif.rpart
## resampling resampling_id iteration prediction
## 1: <ResamplingCV[19]> cv 1 <PredictionClassif[19]>
## 2: <ResamplingCV[19]> cv 2 <PredictionClassif[19]>
## 3: <ResamplingCV[19]> cv 3 <PredictionClassif[19]>
## 4: <ResamplingCV[19]> cv 4 <PredictionClassif[19]>
## 5: <ResamplingCV[19]> cv 5 <PredictionClassif[19]>
## classif.acc
## 1: 0.6775409
## 2: 0.6820437
## 3: 0.6827789
## 4: 0.6846797
## 5: 0.6776032
and here is the aggregated measure (mean)
rsFit$aggregate(myMeasure)
## classif.acc
## 0.6809293
Not such an overfit as I was expecting.
Tuning the hyperparameters
Let’s try tuning the model’s hyperparameters. I will try a random search algorithm with cp between 0.001 and 0.1 and minsplit between 1 and 10.
I will be trying 10 random sets of parameters each with 5 fold cross-validation. So, to speed things up, the calculations will be run in multiple R sessions. I’ll use the logloss as my performance measure, this requires that the learner predicts probabilities rather than a category of response.
The tuning is set up using to_tune() and executed using tune()
# --- use the future package to create the sessions ---------------------
future::plan("multisession")
# --- set the hyperparameters to be tuned -------------------------------
myLearner$param_set$values$cp = to_tune(0.001, 0.1)
myLearner$param_set$values$minsplit = to_tune(1, 10)
myLearner$predict_type = "prob"
# --- choose the performance measure ------------------------------------
myMeasure <- mlr_measures$get("classif.logloss")
# --- pick 10 random combinations ---------------------------------------
set.seed(9830)
myTuner <- tune(
method = "random_search",
task = myTask,
learner = myLearner,
resampling = myCV,
measure = myMeasure,
term_evals = 10,
batch_size = 5
)
## INFO [15:32:48.672] [bbotk] Starting to optimize 2 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=10]'
## INFO [15:32:48.727] [bbotk] Evaluating 5 configuration(s)
## INFO [15:32:48.825] [mlr3] Running benchmark with 25 resampling iterations
## INFO [15:32:49.465] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:49.859] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:50.186] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:50.386] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:50.774] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:51.093] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:50.767] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:51.211] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:51.405] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:51.139] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:51.492] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:51.777] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:52.005] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:51.721] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:52.146] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:52.448] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:52.090] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:52.508] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:52.698] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:52.530] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:52.813] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:53.016] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:53.145] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:53.413] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:53.679] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:53.929] [mlr3] Finished benchmark
## INFO [15:32:54.351] [bbotk] Result of batch 1:
## INFO [15:32:54.353] [bbotk] cp minsplit classif.logloss uhash
## INFO [15:32:54.353] [bbotk] 0.083748776 5 0.7864746 61e4ef11-5e35-4f72-a30b-932356917cb3
## INFO [15:32:54.353] [bbotk] 0.011339234 5 0.7494300 957ecf8d-5617-4c0c-ab99-a50f432fc4fe
## INFO [15:32:54.353] [bbotk] 0.009854275 7 0.7494300 8a6efce0-c6e0-44de-b64e-29e3a8930f82
## INFO [15:32:54.353] [bbotk] 0.037606757 2 0.7864746 5b04f5ef-9906-440f-9562-97a65f5cc363
## INFO [15:32:54.353] [bbotk] 0.066829580 10 0.7864746 86da6032-a1e6-4e98-bc88-aba44feb336f
## INFO [15:32:54.356] [bbotk] Evaluating 5 configuration(s)
## INFO [15:32:54.414] [mlr3] Running benchmark with 25 resampling iterations
## INFO [15:32:54.435] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:55.008] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:55.592] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:54.450] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:55.024] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:55.609] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:54.467] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:55.530] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:56.092] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:54.489] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:55.328] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:55.957] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:56.404] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:54.512] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 4/5)
## INFO [15:32:55.302] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:56.332] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:54.538] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:55.188] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:56.265] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:54.566] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:55.318] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 3/5)
## INFO [15:32:56.064] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 2/5)
## INFO [15:32:54.604] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:55.686] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 5/5)
## INFO [15:32:56.225] [mlr3] Applying learner 'classif.rpart' on task 'animal_adoption' (iter 1/5)
## INFO [15:32:56.831] [mlr3] Finished benchmark
## INFO [15:32:57.107] [bbotk] Result of batch 2:
## INFO [15:32:57.112] [bbotk] cp minsplit classif.logloss uhash
## INFO [15:32:57.112] [bbotk] 0.056833792 10 0.7864746 7ab661db-b0ef-4d11-a4d5-370feaab4ac6
## INFO [15:32:57.112] [bbotk] 0.027682699 2 0.7864746 ddc39800-e857-4b0c-bd27-06ab47d90f68
## INFO [15:32:57.112] [bbotk] 0.092669535 7 0.7864746 c405cb41-c165-4411-a43a-da5b04f278ee
## INFO [15:32:57.112] [bbotk] 0.069294124 7 0.7864746 a94dc5c1-7364-4079-a464-8fc8df224956
## INFO [15:32:57.112] [bbotk] 0.003608916 3 0.7482130 5b335d08-b42f-48a1-8d4f-8f43513c65fe
## INFO [15:32:57.124] [bbotk] Finished optimizing after 10 evaluation(s)
## INFO [15:32:57.124] [bbotk] Result:
## INFO [15:32:57.126] [bbotk] cp minsplit learner_param_vals x_domain classif.logloss
## INFO [15:32:57.126] [bbotk] 0.003608916 3 <list[2]> <list[2]> 0.748213
myTuner
## <TuningInstanceSingleCrit>
## * State: Optimized
## * Objective: <ObjectiveTuning:classif.rpart_on_animal_adoption>
## * Search Space:
## <ParamSet>
## id class lower upper nlevels default value
## 1: cp ParamDbl 0.001 0.1 Inf <NoDefault[3]>
## 2: minsplit ParamInt 1.000 10.0 10 <NoDefault[3]>
## * Terminator: <TerminatorEvals>
## * Terminated: TRUE
## * Result:
## cp minsplit learner_param_vals x_domain classif.logloss
## 1: 0.003608916 3 <list[2]> <list[2]> 0.748213
## * Archive:
## <ArchiveTuning>
## cp minsplit classif.logloss timestamp batch_nr
## 1: 0.0837 5 0.79 2021-11-02 15:32:54 1
## 2: 0.0113 5 0.75 2021-11-02 15:32:54 1
## 3: 0.0099 7 0.75 2021-11-02 15:32:54 1
## 4: 0.0376 2 0.79 2021-11-02 15:32:54 1
## 5: 0.0668 10 0.79 2021-11-02 15:32:54 1
## 6: 0.0568 10 0.79 2021-11-02 15:32:57 2
## 7: 0.0277 2 0.79 2021-11-02 15:32:57 2
## 8: 0.0927 7 0.79 2021-11-02 15:32:57 2
## 9: 0.0693 7 0.79 2021-11-02 15:32:57 2
## 10: 0.0036 3 0.75 2021-11-02 15:32:57 2
So the best cross-validated logloss corresponds to cp=0.0036 and minsplit=3. Notice that there are several other combinations that give an almost identical performance.
Pipelines
The final step is to create a complete pipeline including both preprocessing and model fitting. To this end, mlr3 offers a large range of pipe_ops. As usual their names are stored in a dictionary.
mlr_pipeops
## <DictionaryPipeOp> with 74 stored values
## Keys: boxcox, branch, chunk, classbalancing, classifavg, classweights,
## colapply, collapsefactors, colroles, compose_crank, compose_distr,
## compose_probregr, copy, crankcompose, datefeatures, distrcompose,
## encode, encodeimpact, encodelmer, featureunion, filter, fixfactors,
## histbin, ica, imputeconstant, imputehist, imputelearner, imputemean,
## imputemedian, imputemode, imputeoor, imputesample, kernelpca,
## learner, learner_cv, missind, modelmatrix, multiplicityexply,
## multiplicityimply, mutate, nmf, nop, ovrsplit, ovrunite, pca, proxy,
## quantilebin, randomprojection, randomresponse, regravg,
## removeconstants, renamecolumns, replicate, scale, scalemaxabs,
## scalerange, select, smote, spatialsign, subsample, survavg,
## targetinvert, targetmutate, targettrafoscalerange, textvectorizer,
## threshold, trafopred_regrsurv, trafopred_survregr,
## trafotask_regrsurv, trafotask_survregr, tunethreshold, unbranch,
## vtreat, yeojohnson
Some, such as learner, have already been used, but there are others such as pca, scale, imputemedian and removeconstants that can be used in preprocessing. The idea is to string these together into a complete pipeline.
To create a single pipe_op ready to go into the pipeline, there is a helper function po() that I can use in place of mlr_pipeops$get.
Once the individual pipe_ops have been defined, they are combined in an ordered chain using mlr3’s own pipe operator, %>>%. The resulting pipeline can be saved and even plotted as a graph with the individual pipe_ops as the nodes and edges that show the flow of data between them.
The big question is why? what advantage is there in using pipe operators over running the preprocessing with dplyr?
If the preprocessing is completed before the analysis, then there is no real advantage, indeed dplyr is more flexible and would probably be a simpler option. The advantage comes when the preprocessing itself needs to be tuned. Suppose for example, that preprocessing includes variable selection. Should I use the top 10 features or the top 15 features in the decision tree? By building the variable selection step into the pipeline, it is possible to tune the number of features alongside the other hyperparameters.
A secondary advantage is that the pipe_ops will automatically save the state of the operation. Imagine that I were to run median imputation on the training data. The pipe_op would save those medians. Later, I might want to impute on the test data and those same medians can be recalled and used again.
Extensions
A important feature of mlr3 is that the user is able to add their own extensions, perhaps a learner, or a measure, or a tuner, or a pipe_op.
This can be done by writing a new R6 class that inherits from the corresponding mlr3 class.
To help create a new learner, there is a helper function called create_learner() that does most of the work for you. Once it is done, you can add the learner to the mlr_learners dictionary and use it just like any other learner.
Is mlr3 better than tidymodels?
Of course, the answer is, it depends. For me, mlr3’s main advantages over tidymodels are
- what it does is transparent
- the user remains in control
- all intermediate results can be accessed
- it is extendible
- there is a clear and coherent design
The main disadvantage compared with tidymodels is
- it is much more difficult to use
You need to have a grasp of OOP and R6 before you can make sensible use of mlr3.
In my opinion, mlr3 is better than tidymodels, but I accept that the learning curve will be too steep for many people.
For me, a more important choice is between mlr3 and R code written specifically for a given analysis. In their videos and posts, the authors of mlr3 make much of pipelines and the flexibility with which pipe_ops can be combined. I am yet to be convinced that I would make much use of this feature. I suspect that for the vast majority of my work, I would find it easier to run the preprocessing in dplyr, or in purpose written R code.