Last Update: 2019-02-06 14:36:10

Libraries

Before we start, let’s load a few libraries.

rm(list = ls())

set.seed(100)

options(warn = -1)

library(knitr)
library(ggplot2)
library(caret)
library(doParallel)

registerDoParallel(cores = (detectCores() - 1))

We register all but one core so we can have a lot of parallelsism when we start training our models.

Data Loading

Let’s read in our data.

data.2015 = read.csv("data/2015.csv")
data.2016 = read.csv("data/2016.csv")
data.2017 = read.csv("data/2017.csv")
data.2018 = read.csv("data/2018.csv")

Now, we will only deal with regular season events. So let’s remove the playoffs from our datasets.

get.regular.season = function(data) {
    subset(data, isPlayoffGame == 0)
}

season.2015 = get.regular.season(data.2015)
season.2016 = get.regular.season(data.2016)
season.2017 = get.regular.season(data.2017)
season.2018 = get.regular.season(data.2018)

Now let’s remove extraneous columns. At the end, we will have the following columns (I’ve changed their names for ease):

Old Column Name New Column Name
xCordAdjusted x
yCordAdjusted y
shotAngleAdjusted angle
shotDistance dist
goal goal
get.helpful.data = function(data) {
    data.frame(x = data$xCordAdjusted,
           y = data$yCordAdjusted,
           angle = data$shotAngleAdjusted,
           dist = data$shotDistance,
           team = data$teamCode,
           goal = data$goal)
}

analysis.2015 = get.helpful.data(season.2015)
analysis.2016 = get.helpful.data(season.2016)
analysis.2017 = get.helpful.data(season.2017)
analysis.2018 = get.helpful.data(season.2018)

Sometimes, there is incomplete data. Let’s just keep all the complete cases and remove the incomplete ones.

analysis.2015 = analysis.2015[complete.cases(analysis.2015),]
analysis.2016 = analysis.2016[complete.cases(analysis.2016),]
analysis.2017 = analysis.2017[complete.cases(analysis.2017),]
analysis.all = rbind(analysis.2017, rbind(analysis.2016, analysis.2015))
analysis.all = analysis.all[complete.cases(analysis.all),]
analysis.2018 = analysis.2018[complete.cases(analysis.2018),]

We’ll need a function to get team data.

get.team.data = function(data, code) {
    subset(data, team == code)
}

Creating the Models

With our data, we can start creating models. We’ll be creating the following models:

control = trainControl(method = "repeatedcv", number = 5, repeats = 2)

model.nnet = train(goal ~ . -goal -team,
                   data = analysis.all,
                   method = "nnet",
                   trControl = control)
## # weights:  31
## initial  value 79308.617571 
## iter  10 value 21110.433003
## iter  20 value 19408.328843
## iter  30 value 18788.755236
## iter  40 value 18693.310231
## iter  50 value 18597.284636
## iter  60 value 18567.439924
## iter  70 value 18548.289994
## iter  80 value 18540.726000
## iter  90 value 18529.751641
## iter 100 value 18523.279147
## final  value 18523.279147 
## stopped after 100 iterations
model.knn = train(goal ~ . -goal -team,
                  data = analysis.all,
                  method = "knn",
                  trControl = control)

Extracting Predictions

Our predictions will come from analysis.2018. Here’s what a little bit of that data looks like:

analysis.2018

Now, we can use the predict function to get our predictions.

nnet.prediction = predict(model.nnet, newdata = analysis.2018)
knn.prediction = predict(model.knn, newdata = analysis.2018)

nnet.prediction.data = data.frame(analysis.2018)
nnet.prediction.data$predict = nnet.prediction

knn.prediction.data = data.frame(analysis.2018)
knn.prediction.data$predict = knn.prediction

So, our Neural Network data looks like:

nnet.prediction.data

Our K-Nearest Neighbors data looks like:

knn.prediction.data