Sliced Episode 6: Ranking Games on Twitch

Publish date: 2021-09-28
Tags: Sliced, rank

Summary

Background: In episode 6 of the 2021 series of Sliced, the competitors were given two hours to analyse a set of data on the top 200 games broadcast on twitch. The aim was to predict their exact rankings.
My approach: The ranking is based on the number of hours of streaming that were watched. Presumably the organisers did not notice that they provided two predictors, which when multiplied gave the number of hours watched. So the ranks can be predicted with 100% accuracy.
Result: I got a perfect score.
Conclusion: Always read the question.

Introduction

The sixth of the sliced datasets asks the competitors to predict the rank order of the top 200 computer games featured on Twitch using predictors such as the games ranks in previous months, the number of people streaming the games etc.

The ranking of a game depends on the number of hours that people watch that game being streamed; the more hours the higher the rank. So we have the choice of predicting rank directly, or predicting hours watched and then calculating the rank.

The training data are given monthly from the start of 2016 until April 2021 and we are asked to predict the ranks for May 2021.

Evaluation is by simple accuracy, so if game A is ranked 1 out of 200 and game B is ranked 2 out of 200, you score 2/200 for A=1, B=2, 1/200 for A=1, B=200 and 0/200 for A=2, B=1. Get all 200 ranks correct and you score a perfect 1.

Data Exploration

Let’s first inspect the training data. I’ve followed my normal practice of downloading the raw data and saving it in rds files. I have chosen to refer to the training set as trainRawDF.

# --- setup the libraries etc. ---------------------------------
library(tidyverse)

theme_set( theme_light())

# --- the project folder ---------------------------------------
home  <- "C:/Projects/kaggle/sliced/s01-e06"

# --- read the training data -----------------------------------
read.csv( file.path(home, "data/rawData/train.csv")) %>%
  as_tibble() %>%
  saveRDS( file.path(home, "data/rData/train.rds")) 

trainRawDF <- readRDS(file.path(home, "data/rData/train.rds"))

# --- summarise with skimr -------------------------------------
skimr::skim(trainRawDF)
Table 1: Data summary
Name trainRawDF
Number of rows 12750
Number of columns 10
_______________________
Column type frequency:
character 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Game 0 1 0 128 1 1640 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Rank 0 1 100.60 57.82 1.00 50.00 101.00 151.00 200.00 ▇▇▇▇▇
Month 0 1 6.27 3.48 1.00 3.00 6.00 9.00 12.00 ▇▅▅▅▇
Year 0 1 2018.20 1.55 2016.00 2017.00 2018.00 2020.00 2021.00 ▇▅▅▅▂
Hours_watched 0 1 4275710.94 15067784.68 89811.00 332580.50 718087.00 1975239.50 344551979.00 ▇▁▁▁▁
Hours_Streamed 0 1 141869.72 524824.72 19.00 10995.50 28237.00 79082.50 10245704.00 ▇▁▁▁▁
Peak_viewers 0 1 49662.30 118284.13 441.00 7656.75 18349.00 41708.75 3123208.00 ▇▁▁▁▁
Peak_channels 0 1 525.59 2543.71 1.00 47.00 109.00 286.75 129860.00 ▇▁▁▁▁
Streamers 0 1 16073.08 53540.57 0.00 1345.00 3767.50 9889.00 1013029.00 ▇▁▁▁▁
Avg_viewer_ratio 0 1 84.36 379.35 2.27 15.96 29.06 58.09 13601.87 ▇▁▁▁▁

For once I have shown the output from skim(). This is a relatively small dataset with no missing data.

Ask a silly question

Before we launch into data exploration, it pays to look carefully at the definitions of the predictors.

One of the variables that we are given for prediction is Avg_viewer_ratio, the definition given on kaggle is a little confusing. It reads
“The average viewers watching a given game divided by the average channels streaming a given game, both in the same month + year”

but it amounts to

Avg_viewer_ratio = Hours_watched / Hours_Streamed

where we are told Hours_Streamed and we are asked to predict Hours_watched in order to be able to calculate the ranks.

It follows that the exact rank from the just two of the predictors. There is no machine learning problem!

Just to confirm it

# --- plot measured vs calculated Hours_watched ---------
trainRawDF %>%
  mutate( yhat = Hours_Streamed * Avg_viewer_ratio) %>%
  ggplot( aes(y=Hours_watched, x=yhat)) +
  geom_point() +
  geom_abline( intercept=0, slope=1, colour="red") +
  labs( title="Hours watched can be calculated exactly",
        x="Hours_Streamed * Avg_viewer_ratio")

Of course, if you do not notice this, then a good machine learning algorithm will discover the relationship and make exact predictions. Indeed, if you opt to work on a log scale, a simple linear regression model will give perfect predictions.

Let’s read the test data and see how we go

# --- read test data --------------------------------------
read.csv( file.path(home, "data/rawData/test.csv")) %>%
  as_tibble() %>%
  saveRDS( file.path(home, "data/rData/test.rds")) 

testRawDF <- readRDS(file.path(home, "data/rData/test.rds"))

# --- create a submission ----------------------------------
testRawDF %>%
  mutate( yhat = Hours_Streamed * Avg_viewer_ratio) %>%
  # --- rank large to small -----------------
  mutate( Rank = rank(-yhat) ) %>%
  # --- format submission -----------------
  select( Game, Rank) %>%
  arrange( Rank) %>% 
  print() %>%
  write.csv( file.path( home, "temp/submission1.csv"),
                        row.names=FALSE)
## # A tibble: 200 x 2
##    Game                              Rank
##    <chr>                            <dbl>
##  1 Just Chatting                        1
##  2 Grand Theft Auto V                   2
##  3 League of Legends                    3
##  4 VALORANT                             4
##  5 Call of Duty: Warzone                5
##  6 Fortnite                             6
##  7 Minecraft                            7
##  8 Counter-Strike: Global Offensive     8
##  9 Apex Legends                         9
## 10 Resident Evil Village               10
## # ... with 190 more rows

I submitted this and, of course, I scored a perfect 1.0. Three other competitors also scored 1.0, but apart from those entries the best score was 0.36. What can you say?

What this example shows

Fortunately for Sliced, the four competitors did not notice that the correct predictions were so obvious, otherwise it would have been a very short episode.