Sliced Episode 6: Ranking Games on Twitch
Summary
Background: In episode 6 of the 2021 series of Sliced, the competitors were given two hours to analyse a set of data on the top 200 games broadcast on twitch. The aim was to predict their exact rankings.
My approach: The ranking is based on the number of hours of streaming that were watched. Presumably the organisers did not notice that they provided two predictors, which when multiplied gave the number of hours watched. So the ranks can be predicted with 100% accuracy.
Result: I got a perfect score.
Conclusion: Always read the question.
Introduction
The sixth of the sliced datasets asks the competitors to predict the rank order of the top 200 computer games featured on Twitch using predictors such as the games ranks in previous months, the number of people streaming the games etc.
The ranking of a game depends on the number of hours that people watch that game being streamed; the more hours the higher the rank. So we have the choice of predicting rank directly, or predicting hours watched and then calculating the rank.
The training data are given monthly from the start of 2016 until April 2021 and we are asked to predict the ranks for May 2021.
Evaluation is by simple accuracy, so if game A is ranked 1 out of 200 and game B is ranked 2 out of 200, you score 2/200 for A=1, B=2, 1/200 for A=1, B=200 and 0/200 for A=2, B=1. Get all 200 ranks correct and you score a perfect 1.
Data Exploration
Let’s first inspect the training data. I’ve followed my normal practice of downloading the raw data and saving it in rds files. I have chosen to refer to the training set as trainRawDF.
# --- setup the libraries etc. ---------------------------------
library(tidyverse)
theme_set( theme_light())
# --- the project folder ---------------------------------------
home <- "C:/Projects/kaggle/sliced/s01-e06"
# --- read the training data -----------------------------------
read.csv( file.path(home, "data/rawData/train.csv")) %>%
as_tibble() %>%
saveRDS( file.path(home, "data/rData/train.rds"))
trainRawDF <- readRDS(file.path(home, "data/rData/train.rds"))
# --- summarise with skimr -------------------------------------
skimr::skim(trainRawDF)
| Name | trainRawDF |
| Number of rows | 12750 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Game | 0 | 1 | 0 | 128 | 1 | 1640 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Rank | 0 | 1 | 100.60 | 57.82 | 1.00 | 50.00 | 101.00 | 151.00 | 200.00 | ▇▇▇▇▇ |
| Month | 0 | 1 | 6.27 | 3.48 | 1.00 | 3.00 | 6.00 | 9.00 | 12.00 | ▇▅▅▅▇ |
| Year | 0 | 1 | 2018.20 | 1.55 | 2016.00 | 2017.00 | 2018.00 | 2020.00 | 2021.00 | ▇▅▅▅▂ |
| Hours_watched | 0 | 1 | 4275710.94 | 15067784.68 | 89811.00 | 332580.50 | 718087.00 | 1975239.50 | 344551979.00 | ▇▁▁▁▁ |
| Hours_Streamed | 0 | 1 | 141869.72 | 524824.72 | 19.00 | 10995.50 | 28237.00 | 79082.50 | 10245704.00 | ▇▁▁▁▁ |
| Peak_viewers | 0 | 1 | 49662.30 | 118284.13 | 441.00 | 7656.75 | 18349.00 | 41708.75 | 3123208.00 | ▇▁▁▁▁ |
| Peak_channels | 0 | 1 | 525.59 | 2543.71 | 1.00 | 47.00 | 109.00 | 286.75 | 129860.00 | ▇▁▁▁▁ |
| Streamers | 0 | 1 | 16073.08 | 53540.57 | 0.00 | 1345.00 | 3767.50 | 9889.00 | 1013029.00 | ▇▁▁▁▁ |
| Avg_viewer_ratio | 0 | 1 | 84.36 | 379.35 | 2.27 | 15.96 | 29.06 | 58.09 | 13601.87 | ▇▁▁▁▁ |
For once I have shown the output from skim(). This is a relatively small dataset with no missing data.
Ask a silly question
Before we launch into data exploration, it pays to look carefully at the definitions of the predictors.
One of the variables that we are given for prediction is
Avg_viewer_ratio, the definition given on kaggle is a little confusing. It reads
“The average viewers watching a given game divided by the average channels streaming a given game, both in the same month + year”
but it amounts to
Avg_viewer_ratio = Hours_watched / Hours_Streamed
where we are told Hours_Streamed and we are asked to predict Hours_watched in order to be able to calculate the ranks.
It follows that the exact rank from the just two of the predictors. There is no machine learning problem!
Just to confirm it
# --- plot measured vs calculated Hours_watched ---------
trainRawDF %>%
mutate( yhat = Hours_Streamed * Avg_viewer_ratio) %>%
ggplot( aes(y=Hours_watched, x=yhat)) +
geom_point() +
geom_abline( intercept=0, slope=1, colour="red") +
labs( title="Hours watched can be calculated exactly",
x="Hours_Streamed * Avg_viewer_ratio")

Of course, if you do not notice this, then a good machine learning algorithm will discover the relationship and make exact predictions. Indeed, if you opt to work on a log scale, a simple linear regression model will give perfect predictions.
Let’s read the test data and see how we go
# --- read test data --------------------------------------
read.csv( file.path(home, "data/rawData/test.csv")) %>%
as_tibble() %>%
saveRDS( file.path(home, "data/rData/test.rds"))
testRawDF <- readRDS(file.path(home, "data/rData/test.rds"))
# --- create a submission ----------------------------------
testRawDF %>%
mutate( yhat = Hours_Streamed * Avg_viewer_ratio) %>%
# --- rank large to small -----------------
mutate( Rank = rank(-yhat) ) %>%
# --- format submission -----------------
select( Game, Rank) %>%
arrange( Rank) %>%
print() %>%
write.csv( file.path( home, "temp/submission1.csv"),
row.names=FALSE)
## # A tibble: 200 x 2
## Game Rank
## <chr> <dbl>
## 1 Just Chatting 1
## 2 Grand Theft Auto V 2
## 3 League of Legends 3
## 4 VALORANT 4
## 5 Call of Duty: Warzone 5
## 6 Fortnite 6
## 7 Minecraft 7
## 8 Counter-Strike: Global Offensive 8
## 9 Apex Legends 9
## 10 Resident Evil Village 10
## # ... with 190 more rows
I submitted this and, of course, I scored a perfect 1.0. Three other competitors also scored 1.0, but apart from those entries the best score was 0.36. What can you say?
What this example shows
Fortunately for Sliced, the four competitors did not notice that the correct predictions were so obvious, otherwise it would have been a very short episode.