NHL Shot Effectiveness Model

Sasank Vishnubhatla

February 8th, 2019

Last Update: 2019-03-19 10:43:28

Libraries

Before we start, let’s load a few libraries.

rm(list = ls())

options(warn = -1)

library(knitr)
library(ggplot2)
library(caret)
library(doParallel)

registerDoParallel(cores = (detectCores() - 1))

With our libraries loaded we can start loading our data.

Data Loading

Let’s read in our data. Data was downloaded on March 13 at 10:24pm.

data.2015 = read.csv("data/2015.csv")
data.2016 = read.csv("data/2016.csv")
data.2017 = read.csv("data/2017.csv")
data.2018 = read.csv("data/2018.csv")

Now, we will only deal with regular season events. So let’s remove the playoffs from our datasets.

get.regular.season = function(data) {
    subset(data, isPlayoffGame == 0)
}

season.2015 = get.regular.season(data.2015)
season.2016 = get.regular.season(data.2016)
season.2017 = get.regular.season(data.2017)
season.2018 = get.regular.season(data.2018)

Here is a table of all the columns we shall keep and what we shall rename them to.

Old Column Name New Column Name
xCordAdjusted x
yCordAdjusted y
shotAngleAdjusted angle
shotDistance dist
teamCode team
shotType type
shooterName shooter
goalieNameForShot goalie
get.helpful.data = function(data) {
    data.frame(x = data$xCordAdjusted,
               y = data$yCordAdjusted,
               angle = data$shotAngleAdjusted,
               dist = data$shotDistance,
               type = data$shotType,
               typeNum = as.numeric(data$shotType),
               goal = data$goal,
               team = data$teamCode,
               shooter = data$shooterName,
               goalie = data$goalieNameForShot)
}

# type:
# 1 -> empty
# 2 -> BACK
# 3 -> DEFL
# 4 -> SLAP
# 5 -> SNAP
# 6 -> TIP
# 7 -> WRAP
# 8 -> WRIST

analysis.2015 = get.helpful.data(season.2015)
analysis.2016 = get.helpful.data(season.2016)
analysis.2017 = get.helpful.data(season.2017)
analysis.2018 = get.helpful.data(season.2018)

Now, we can remove incomplete cases and create our machine learning model’s giant data set.

analysis.2015 = analysis.2015[complete.cases(analysis.2015),]
analysis.2016 = analysis.2016[complete.cases(analysis.2016),]
analysis.2017 = analysis.2017[complete.cases(analysis.2017),]
analysis.all = rbind(analysis.2017, rbind(analysis.2016, analysis.2015))
analysis.all = analysis.all[complete.cases(analysis.all),]
analysis.all = droplevels(analysis.all)
analysis.2018 = analysis.2018[complete.cases(analysis.2018),]
analysis.2018 = droplevels(analysis.2018)

Here’s what analysis.2018 looks like:

analysis.2018

Now, we need a few functions to help us select certain subsets of data. We’ll define three functions: get.team.data, get.shooter.data, get.goalie.data.

get.team.data = function(data, code) {
    subset(data, team == code)
}

get.shooter.data = function(data, code) {
    subset(data, shooter == code)
}

get.goalie.data = function(data, code) {
    subset(data, goalie == code)
}

Calculating Statistics

We can calculate a few statistics, like goal (effective) percentage for a certain shot. Let’s write a function to do that right now.

calculate.goal.percentage = function(data) {
    goals = sum(data$goal == 1)
    total = nrow(data)
    goals / total
}

So, for example, Penguins’s goal percentage against slap shots would be calculated as follows:

penguins.2018 = get.team.data(analysis.2018, "PIT")
penguins.2018.slap = subset(penguins.2018, typeNum == 4)
penguins.2018.slap.eff = calculate.goal.percentage(penguins.2018.slap)

Their goal percentage is 0.0612245.

Here is Ovechkin’s backhand percentage:

ovechkin.2018 = get.shooter.data(analysis.2018, "Alex Ovechkin")
ovechkin.2018.back = subset(ovechkin.2018, typeNum == 2)
ovechkin.2018.back.eff = calculate.goal.percentage(ovechkin.2018.back)

His goal percentage is 0.0833333.

Now, let’s look at Carey Price’s goal percentage against wraparound shots.

price.2018 = get.goalie.data(analysis.2018, "Carey Price")
price.2018.wrap = subset(price.2018, typeNum == 7)
price.2018.wrap.eff = calculate.goal.percentage(price.2018.wrap)

His goal percentage is 0.1052632.

Creating the Models

Using the caret package, we can build machine learning models to help us determine which shot type is best.

Let’s get started with training our control.

control = trainControl(method = "repeatedcv", number = 5, repeats = 3)

Now we will train a few different types of models. Here is a list of the models we will train:

model.nnet = train(goal ~ x + y + typeNum + dist + angle,
                   data = analysis.all,
                   method = "nnet",
                   trControl = control)
## # weights:  22
## initial  value 77737.560248 
## iter  10 value 20655.164118
## iter  20 value 18813.075704
## iter  30 value 18771.856595
## iter  40 value 18659.540051
## iter  50 value 18573.913604
## iter  60 value 18530.538848
## iter  70 value 18522.049872
## iter  80 value 18519.265512
## iter  90 value 18512.419443
## iter 100 value 18511.099883
## final  value 18511.099883 
## stopped after 100 iterations
model.knn = train(goal ~ x + y + typeNum + dist + angle,
                  data = analysis.all,
                  method = "knn",
                  trControl = control)

Now our models have been made.

Testing the Models

Let’s test our models on the 2018 data. Here’s what the testing data looks like:

analysis.2018