Sliced 2021 Episode Overview

Publish date: 2021-09-07

Tags: Sliced, Sliced episode overview

Introduction

Sliced is a data science competition hosted by Nick Wan and Meg Risdal on twitch ( https://www.twitch.tv/nickwan_datasci ). In each episode, 4 data scientists with very different backgrounds are given the same dataset and in 2 hours they must explore the data and build a predictive model. As well as the training data, they are given a test set that lacks the variable to be predicted. The model that best predicts in the test data according to a specified metric, wins the prediction element of the competition, but other points are available for data visualization, golden features and popularity as voted for by chat.

The datasets are all available on kaggle and while the competition is live on twitch, members of the audience can run their own analyses and submit them so that they are assessed alongside the models of the competitors. Afterwards, anyone can download the data and submit a late entry, but it will not be added to the leaderboard.

I have taken each of the datasets used by sliced and run my own analyses, which I present in a series of blog posts. I think that the main interest in my analyses lies in the fact that I was trained as a statistician and not as a data scientist, so my analyses highlight some interesting differences of approach between the two traditions.

I must stress that I did not analyse these data under competition conditions. In most cases, I spend less than 2 hours on the analysis, but much more time writing the blog posts that explain what I did.

In this post, I give an overview of each episode and summarise the methods that I used. If you are looking for examples of a particular form of analysis, this will show you which posts to go to.

My approach

I analyse all of the datasets in R, but prefer not to use tidymodels. The tidymodels package forces you into a machine learning mindset and it is my belief that this approach has several important weaknesses that lead to inferior results. Exposing those weaknesses is one of my main reasons for writing this blog.

I do make heavy use of the tidyverse because it produces quick and readable code. I supplement this with my own functions and a smattering of base R. For a couple of the later episodes I have used mlr3 a machine learning competitor of tidymodels that I have been interested in for some time, but not used before.

In my opinion statistics and machine learning are closely related and the key differences are not in the models that they use, but in the ways in which they use them. So, I am quite happy to use tree-based model including xgboost. What I find unsatisfactory is when tree-based model are used in an automated pipeline. I’ll discuss points such as these at the end of each post.

Here are some differences of approach between my analyses and the machine learning analyses favoured by most of the competitors

an emphasis on understanding what the data mean and how they were collected
data cleaning as a separate step carried out prior to modelling
a liking for moving from the simple to the complex
a preference for simple models
a preference for models that can be interpreted
a preference for models that would generalise
a hatred of black box methods
a hatred of long pipelines in which the analyst does not look at the intermediate results
emphasis on model checking and model interpretation
scepticism about obsessional hyperparameter tuning
scepticism about meaningless improvement is prediction accuracy (not a good attitude to have in a data science competition)

For each episode, I have tried to find a different approach so that the posts are not repetitive. Occasionally, the search for variety is at the expense of predictive performance, but on the whole my models are competitive with the submission made live during the competition and some of them would have won. Given that I was not under time pressure this is perhaps not surprising.

The episodes

Episode 1: Boardgames

Keywords

boardgame rating; transformed response; text features; user written functions; replacing impossible values; generalized additive models; splines; interactions;

Packages

tidyverse; broom; mgcv

The data

The data can be download from https://www.kaggle.com/c/sliced-s01e01.

The problem

The data were taken from the website https://boardgamegeek.com/. The training set has full details on 3499 boardgames. The object is to predict the geek rating for each game. Some predictors are numeric, such as number of players, time to play the game and the year that the game was released but others are textual, such as a description of the mechanics of the game.

The test data has the same predictor variables on 1500 games, but has the ratings for those games removed. Predictive models are assessed using the root mean square error (RMSE).

My Analysis

My exploratory analysis led me to work with a transformed response, log10(geek_rating-5.5). The models predicted this transformed response and then I back-transformed to get the predicted geek rating.

Data cleaning was minimal because the data are reasonably complete, with just a few missing values and a handful of extreme or unlikely looking values.

My analysis of the textual data is fairly basic. I extract 10 features relating to the game mechanics and 10 features related to the category of the game. To do this I wrote simple functions that are included as a appendix to the post.

There is quite a degree of non-linearity in the relationships between geek rating and the predictors, so I opted for splines in a GAM (generalized additive model). In R, GAMs can be fitted with the gam package, but I prefer a package called mgcv because it incorporates methods for estimating the degree of smoothness that is required. It also allows 2-dimensional splines that represent interactions.

My final model did well and have come close to the top of the leaderboard.

In the post, I try to show that predictions from linear models can be competitive (you do not always need to used XGBoost) and at the same time, the models are interpretable. I also emphasise that it is not necessary, or even advisable, to make all of your modelling decisions based on minimising the loss function.

Episode 2: Wildlife strikes

Keywords

wildlife strikes; data cleaning; missing data; imputation; categorising text features; user-written ggplot function; curly-curly {{}}; nonstandard evaluation; logistic regression; analysis of deviance; Hosmer-Leweshow plot;

Packages

tidyverse; broom; forcats; lubridate;

The data

The data can be download from https://www.kaggle.com/c/sliced-s01e02-xunyc5.

The problem

These data were collect by the Federal Aviation Authority (FAA) in the USA and contain details of wildlife strikes with aircraft. Everything from a commercial jet running into a flock of sparrows to a private plane hitting an elk. The objective is to predict whether or not the aircraft was damaged in the collision. The metric used for evaluation was the mean logloss.

As you might expect, these data are far from perfect; lots of missing information and factors with hundreds of levels.

My Analysis

This is primarily an exercise in data cleaning. I have a great deal of sympathy with the competitors who would have been under real pressure to cut short the cleaning and jump to the model fitting. It was a tough dataset to use in this type of competition.

Most of my code performs data cleaning and data exploration. Once the data were in a good form for analysis, I used a simple logistic regression and got perfectly good results.

For the data exploration, I wrote a function that creates a stacked bar chart with added annotation to give the percent of aircraft damaged. The code uses curly-curly {{}} to provide nonstandard evaluation (NSE), which allows the function to be incorporated into a pipe.

I decided against imputation of the missing data, because the exploration made it clear that the data were not missing at random. Instead, I added a missing category to each of the categorical variables and used missingness as a predictive feature.

I used an analysis of deviance table to look at contributions to the logistic regression model and plots based on the Hosmer-Leweshow statistic to assess the fit. I calculated a cross-validation estimate of model performance, but noted that, in this type of example, the in-sample estimate is perfectly adequate.

Episode 3: Superstore profits

Keywords

superstore profits; knowledge external to the data; linear models; offsets; weaknesses of machine learning;

Packages

tidyverse; broom;

The data

The data for this episode are available from https://www.kaggle.com/c/sliced-s01e03-DcSXes.

The problem

The dataset contains information on products sold by an on-line store in the USA and the objective is to predict the profit made on each item. Evaluation is by RMSE.

Compared with the previous episode, these data are a pleasure to analyse. There is very little processing required before we start modelling and the structure is simple.

The most important aspect of this problem is that we have data on profits made on individual items but the items are only categorised into very broad groups, such as tables or copiers. Obviously, some tables cost more than others and more importantly, some are sold at a discount. We are told the price of the individual items and the size of any discount that was applied.

My Analysis

There is a class of modelling problems were we know something about the relationship between the variables. The classic example is a physics experiment, where we know that all of the variables must obey the natural laws of physics. In such cases, it is vital that the model takes these known laws into account, otherwise it is unlikely to perform well. Analysts who jump in with their favourite algorithm are likely to do poorly on this type of problem.

In the case of the superstore, we know the relationship between sales price, discount and profit. If you use this known relationship then

it tells us what structure the model should have
it makes it obvious that most of the predictors can be ignored
it produces extremely good results

My model beats the submissions on the leaderboard by a wide margin, so I can only assume that those competitors ignored the specific structure of the problem.

The code used for this problem is very basic. All that you needed to win the competition were a few scatter plots and some simple linear models.

Episode 4: Rain tomorrow

Keywords

weather prediction; Bayesian estimates of a proportion; multiple imputation by chained equations (mice); logistic regression; combining many models; list columns; map() functions; user-written functions;

Packages

tidyverse; broom; mice; purrr;

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e04-knyna9.

The problem

The data consist of daily weather records from 49 different locations in Australia. The objective is to predict whether or not it rained on the following day and the metric was logloss.

Like episode 2 this is a large complex dataset that would be quite challenging under competition conditions. Australia is huge and the weather patterns vary enormously by location.

When the organisers set up this episode they must have taken weather records going back many years and randomly picked the days for each location so that a random sample went into the training set and another random sample went into the test set. However, they seem to have drawn the two sample independently because sometimes it happens that the day after for a test day is one of the days given in the training data. This means that for about a quarter of the test data be are actually told the thing that we are meant to predict.

One of the rules of the competition is ‘no exploits’. So I have dutifully ignored the known answer when making my predictions, but it was tempting.

Not only is this dataset large and location-specific but it contains a fair amount of missing data. Weather variables, such as humidity, pressure, wind speed, temperature and rainfall are all closely interrelated so it is important to allow for this when imputing the missing data. Like episode 2 the problem is really too difficult for a 2 hour competition.

My Analysis

I considered two ways of predicting the probability of rain

take long-term averages for that time of year
use today’s conditions to predict rain tomorrow

I started with the approach of long-term averages and used a Bayesian estimate of the proportion of rainy days in a given location during a given month that avoids the problem of a zero probability prediction.

Data exploration shows that there is considerable extra information in the daily weather pattern. Because of the amount of missing data, today’s conditions cannot be used without imputation.

I used multiple imputation based on chained equations as implemented by the mice package.

The data exploration shows that the impact of today’s conditions varies depending on location. This suggests that we will need a model with interactions between location and the other predictors. As an alternative to using interactions, I decided to fit separate models to each location. This is implemented in a tibble with list columns using the map functions from purrr.

Predictions are made separately for each imputed dataset and then averaged. To save time, I omitted some of the potentially predictive variables and this adversely affected performance. The model is competitive but definitely could be improved.

Episode 5: Airbnb prices

Keywords

Airbnb price per night prediction; transformed response; merging small areas; extraction of keywords from free text; user-written functions; boosted models; xgboost; model overfitting; hyperparameter tuning;

Packages

tidyverse; xgboost;

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e05-WXx7h8/overview

The problem

The dataset contains details of properties in New York City that were advertised on Airbnb. The objective is to predict the rental price of the property given details of the location, type of property, host and property reviews. There is also a short description of each property in free text.

My Analysis

I started with a simple exploratory analysis of the data and once I had a good idea of what features were likely to be important, I started data cleaning and feature extraction. I felt that location would have a strong influence on price and decided to use the 300+ local neighbourhoods as my measure of location. Some neighbourhoods only contain a few properties and I felt that this might lead to unstable price estimates, so I developed an algorithm for merging geographically close neighbourhoods until all had at least 25 properties. I used code similar to that which I developed for episode 1 in order to extract keywords from the free text.

I decided to use this example to illustrate the use of XGBoost, perhaps the most popular algorithm in machine learning. I wanted to get close to the algorithm in order to start to understand how it works, so I used the package directly and not through one of the machine learning packages, such as caret or tidymodels or mlr3.

My first attempt at an xgboost model worked well and would have come second on the Sliced leaderboard.

I started the process of looking at hyperparameter tuning (it is too big a topic for one post) by investigating the effects of the number of iterations (rounds) and the learning rate. Selecting a sensible number of iterations of the boosting algorithm is key to good performance, but the learning rate is much less critical.

I look at the way that estimates of the hyperparameters are themselves subject to error and consider how the data do not provide the precision needed to make choices between apparently very different values of the parameters. I use these data to demonstrate how, when it is misused, hyperparameter tuning can reduce to following meaningless random fluctuation in the estimates.

After the analysis, I have a small rant about bad practice in machine learning, in particular, I object to the misuse of hyperparameter tuning.

Episode 6: Ranking games on twitch

Keywords

predicting ranks;

Packages

tidyverse;

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e06-2ld97c

The problem

The competitors were given historical data on games streamed on twitch and were required to predict the exact ranks of the top 200 games in May 2021, where rank depends on the number of person-hours spent watching.

My Analysis

The description of the data provided on the kaggle webpage includes a rather vague definition of the something called the average viewer ratio. It says that it is the ratio of hours watched to hours streamed. However, we know the hours streamed and we want to predict the hours watched because that is the basis of the ranks!

Sure enough if you multiple the ratio by the hours streamed you get the hours watched exactly!! There is no prediction error, we can just calculate the answer in a couple of lines of R code.

If you look at the leaderboard you will see that three people scored a perfect 1.0, so presumably they must have either cheated by downloading the data from twitch, or they noticed the relationship for themselves or they stumbled on a model, such as a linear model on a log scale, that predicts the data exactly. The best of the rest were a long way behind with a score around 0.3.

If you read the data description carefully, you can solve the problem in about 5 minutes.

Episode 7: Customer churn

Keywords

predicting bank churn; distance-based exploratory analysis; multi-dimensional scaling (MDS); self-organising maps; xgboost; tree depth

Packages

tidyverse; gridExtra; class; xgboost

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e07-HmPsw2

The problem

The data describe the characteristics of bank customers such as their credit limit, their balance and the number of transactions that they made in a fixed (but unstated) time. From these features we have to predict whether or not the customers will churn (close their account and move to another bank). Model evaluation is based on logloss.

My Analysis

This is a rather routine problem, so to create some interest I decided to try multivariate data exploration using distance-based methods that highlight clusters of customers with similar characteristics. These analyses support by suspicion that customers churn for different reasons and that the model will need to identify groups of customers with different characteristics but the same outcome.

xgboost is a natural choice of algorithm for modelling these data. Boosting will create a range of trees such that some will be good at capturing one cluster while others will identify another. Tree-based boosting works well with these data and creates a model with prediction accuracy similar to the models near the top of the leaderboard.

Episode 8: Spotify popularity

Keywords

hexagonal scatter plots; merging files; feature extraction from text; random forest models

Packages

tidyverse; randomForest;

The data

The data for this episode can be downloaded from www.kaggle.com/c/sliced-s01e08-KJSEks

The problem

The data describe 21,000 tracks available through Spotify and include measures such as speechiness and tempo as well as data on the artists. Spotify gives each track a numerical popularity score between 0 and 100 and the objective is to predict that popularity from the track and artist data. Model evaluation is by RMSE.

My Analysis

I merged the track and artist data to create a single file of training data. A visualization of the data using hexagonal scatter plots showed that the continuous measures relating to the tracks are quite predictive, but the relationships are not linear. The popularity score over the training data is strongly bimodal. Together these feature suggest that standard regression models would not perform well.

I used standard text handling methods to identify musical genres that are predictive of popularity and ended with a large number (over 150) of indicator variables.

I decided to use these data to investigate the use of random forests, however the large number of potential features combined with the large number of tracks meant that random forests were very slow to fit. As a result, detailed modelling was impractical. Despite this, a default random forest model would have come fourth on the leaderboard.

Episode 9: Baseball home runs

Keywords

hexagonal scatter plots; xgboost; cross-validation; hyperparameter selection;

Packages

tidyverse; xgboost;

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e09-playoffs-1.

The problem

The data describe the 46244 hits made during the playoffs for the 2020 baseball season. The objective is to predict whether or not a hit resulted in a home run based on information about the hit, the pitch and the ball park. Evaluation was by logloss.

My Analysis

I performed a fairly lengthy exploratory analysis looking at the impact of different predictors on the probability of a home run. Based on this exploration I made a subjective selection of what I though would be the important predictors.

I fed my selected predictors into xgboost and use cross-validation to select a suitable number of iterations. With default values for the other hyperparameters the resulting model would have finished second on the leaderboard.

Based on my experience of using xgboost in earlier episodes of Sliced, I subjective choose what I though would be a better set of hyperparameters and I created a model that would have topped the leaderboard.

The top five models on the leaderboard all had similar performance and the difference between my model and anyone of these is meaningless. I argue that one should not take the position on the leaderboard too seriously.

Episode 10:Animal adoption

Keywords

mlr3; xgboost; cross-validation; hyperparameter tuning;

Packages

tidyverse; mlr3; mlr3verse; xgboost;

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e10-playoffs-2/overview/description.

The problem

The data describe the fates of 54,408 animals abandoned or rescued between 7th Nov 2015 and 1 Feb 2018 in, I think, USA. The objective is to predict whether the animal was adopted, transferred to another facility or put down. Evaluation was by logloss.

My Analysis

I decided to use these data to illustrate the use of the mlr3 ecosystem of packages. Although, tidymodels is well-known and much used, mlr3, which does much the same job, is hardly known at all outside of a small group of enthusiasts. This is a great shame because in many respects mlr3 is better than tidymodels.

In my analysis I analyse cats, dogs and other animals separately since it seems to me that the predictors that are relevant to the adoption of cats will be very different to those that are important for bats or squirrels.

Within mlr3 I use xgboost to model each animal type and I run cross-validations to estimate the likely performance of each model. I also demonstrate how mlr3 can be used to tune the hyperparameters of xgboost, even though I am not convinced that this is a sensible thing to do.

The model produced by my analysis does moderately well in comparison to the model on the leaderboard, but this is not the primary aim of the analysis. More important for me, is to illustrate the use of mlr3 and to make its potential better-known. At the end, I give my opinion on the relative merits of mlr3 and tidymodels. The fact that I have not used tidymodels for any of my analyses will give you a good clue as to which I prefer.

Episode 11: Austin House Prices

Keywords

mlr3; pipelines; pipe operators; tables; random forests; cross-validation; hyperparameter tuning;

Packages

tidyverse; mlr3; mlr3verse; randomForest; janitor; kableExtra; quanteda;

The data

The data for this episode can be downloaded from https://www.kaggle.com/c/sliced-s01e11-semifinals/.

The problem

The data consist of the prices of 10,000 properties from the area around Austin, Texas that were advertised by the on-line estate agent, Zillow. The house prices where categorised into 5 price bands and the objective was to predict the price band from variables that describe the property. Evaluation was by multiclass logloss.

My Analysis

I decided to use these data to continue my demonstration of the mlr3 ecosystem of packages that I started in Episode 10. For this analysis I concentrated on the production of analysis pipelines.

For the exploratory analysis, I decided to use descriptive tables in place of my more usual graphical visualizations. For many problems tables of counts or tables of means are the best way of conveying the data structure. It is my impression that tables are under-used in exploratory analyses.

The property descriptors include a piece of free text produced by the estate agents to summarise the appeal of the property. I used the package quanteda to extract keywords from the free text and then developed am mlr3 pipeline in which the other property descriptors were scaled while a filter was applied to the keywords so that those most significant in a Kruskal-Wallis test were kept. The combined set of scaled variables and filtered variables was then used in a random forest.

The entire pipeline was used in a cross-validation and I simultaneously tuned the number of filtered keywords and the hyperparameters of the random forest.

The analysis is a good example of the use of pipelines but it proved to be poor at predicting house price categories. I discuss why this might be.

Episode 12: Loan Defaults

Keywords

tables; xgboost; two-stage modelling;

Packages

xgboost; kableExtra;

The data

The data for this episode can be downloaded from www.kaggle.com/c/sliced-s01e12-championship/overview.

The problem

The data contain information on 83,656 bank loans made to US businesses. The objective was to predict the amount of any default. Evaluation was by mean absolute error.

My Analysis

I carried out a routine exploratory analysis based on tables and graphs and as a result I decided to create two models; one for the probability of a default and the other for the size of the default in those businesses that defaulted. For the second model, I took the size of the default as a proportion of the original loan as the response. Both models were based on xgboost.

My predictions for the size of the default were based on the second model but with the estimated default replaced by zero when the first model predicted that the business would not default. The approach is similar to that taken by zero-inflated models. I suspected that this approach would work better than directly modelling the size of the default and this hunch proved correct, even though neither of my models used the mean absolute error is its loss function.

One nice spin-off from the analysis is a demonstration of how to insert a blank line in a table to represent rows that you do not want to show.

Modelling with R

contrasting statistical and machine learning approaches

Sliced 2021 Episode Overview

Introduction

My approach

The episodes

Episode 1: Boardgames

Keywords

Packages

The data

The problem

My Analysis

Episode 2: Wildlife strikes

Keywords

Packages

The data

The problem

My Analysis

Episode 3: Superstore profits

Keywords

Packages

The data

The problem

My Analysis

Episode 4: Rain tomorrow

Keywords

Packages

The data

The problem

My Analysis

Episode 5: Airbnb prices

Keywords

Packages

The data

The problem

My Analysis

Episode 6: Ranking games on twitch

Keywords

Packages

The data

The problem

My Analysis

Episode 7: Customer churn

Keywords

Packages

The data

The problem

My Analysis

Episode 8: Spotify popularity

Keywords

Packages

The data

The problem

My Analysis

Episode 9: Baseball home runs

Keywords

Packages

The data

The problem

My Analysis

Episode 10:Animal adoption

Keywords

Packages

The data

The problem

My Analysis

Episode 11: Austin House Prices

Keywords

Packages

The data

The problem

My Analysis

Episode 12: Loan Defaults

Keywords

Packages

The data

The problem

My Analysis