Last Update: 2019-03-19 10:43:28

Before we start, let’s load a few libraries.

```
rm(list = ls())
options(warn = -1)
library(knitr)
library(ggplot2)
library(caret)
library(doParallel)
registerDoParallel(cores = (detectCores() - 1))
```

With our libraries loaded we can start loading our data.

Let’s read in our data. Data was downloaded on March 13 at 10:24pm.

```
data.2015 = read.csv("data/2015.csv")
data.2016 = read.csv("data/2016.csv")
data.2017 = read.csv("data/2017.csv")
data.2018 = read.csv("data/2018.csv")
```

Now, we will only deal with regular season events. So let’s remove the playoffs from our datasets.

```
get.regular.season = function(data) {
subset(data, isPlayoffGame == 0)
}
season.2015 = get.regular.season(data.2015)
season.2016 = get.regular.season(data.2016)
season.2017 = get.regular.season(data.2017)
season.2018 = get.regular.season(data.2018)
```

Here is a table of all the columns we shall keep and what we shall rename them to.

Old Column Name | New Column Name |
---|---|

`xCordAdjusted` |
`x` |

`yCordAdjusted` |
`y` |

`shotAngleAdjusted` |
`angle` |

`shotDistance` |
`dist` |

`teamCode` |
`team` |

`shotType` |
`type` |

`shooterName` |
`shooter` |

`goalieNameForShot` |
`goalie` |

```
get.helpful.data = function(data) {
data.frame(x = data$xCordAdjusted,
y = data$yCordAdjusted,
angle = data$shotAngleAdjusted,
dist = data$shotDistance,
type = data$shotType,
typeNum = as.numeric(data$shotType),
goal = data$goal,
team = data$teamCode,
shooter = data$shooterName,
goalie = data$goalieNameForShot)
}
# type:
# 1 -> empty
# 2 -> BACK
# 3 -> DEFL
# 4 -> SLAP
# 5 -> SNAP
# 6 -> TIP
# 7 -> WRAP
# 8 -> WRIST
analysis.2015 = get.helpful.data(season.2015)
analysis.2016 = get.helpful.data(season.2016)
analysis.2017 = get.helpful.data(season.2017)
analysis.2018 = get.helpful.data(season.2018)
```

Now, we can remove incomplete cases and create our machine learning model’s giant data set.

```
analysis.2015 = analysis.2015[complete.cases(analysis.2015),]
analysis.2016 = analysis.2016[complete.cases(analysis.2016),]
analysis.2017 = analysis.2017[complete.cases(analysis.2017),]
analysis.all = rbind(analysis.2017, rbind(analysis.2016, analysis.2015))
analysis.all = analysis.all[complete.cases(analysis.all),]
analysis.all = droplevels(analysis.all)
analysis.2018 = analysis.2018[complete.cases(analysis.2018),]
analysis.2018 = droplevels(analysis.2018)
```

Here’s what `analysis.2018`

looks like:

`analysis.2018`

Now, we need a few functions to help us select certain subsets of data. We’ll define three functions: `get.team.data`

, `get.shooter.data`

, `get.goalie.data`

.

```
get.team.data = function(data, code) {
subset(data, team == code)
}
get.shooter.data = function(data, code) {
subset(data, shooter == code)
}
get.goalie.data = function(data, code) {
subset(data, goalie == code)
}
```

We can calculate a few statistics, like goal (effective) percentage for a certain shot. Let’s write a function to do that right now.

```
calculate.goal.percentage = function(data) {
goals = sum(data$goal == 1)
total = nrow(data)
goals / total
}
```

So, for example, Penguins’s goal percentage against slap shots would be calculated as follows:

```
penguins.2018 = get.team.data(analysis.2018, "PIT")
penguins.2018.slap = subset(penguins.2018, typeNum == 4)
penguins.2018.slap.eff = calculate.goal.percentage(penguins.2018.slap)
```

Their goal percentage is 0.0612245.

Here is Ovechkin’s backhand percentage:

```
ovechkin.2018 = get.shooter.data(analysis.2018, "Alex Ovechkin")
ovechkin.2018.back = subset(ovechkin.2018, typeNum == 2)
ovechkin.2018.back.eff = calculate.goal.percentage(ovechkin.2018.back)
```

His goal percentage is 0.0833333.

Now, let’s look at Carey Price’s goal percentage against wraparound shots.

```
price.2018 = get.goalie.data(analysis.2018, "Carey Price")
price.2018.wrap = subset(price.2018, typeNum == 7)
price.2018.wrap.eff = calculate.goal.percentage(price.2018.wrap)
```

His goal percentage is 0.1052632.

Using the `caret`

package, we can build machine learning models to help us determine which shot type is best.

Let’s get started with training our control.

`control = trainControl(method = "repeatedcv", number = 5, repeats = 3)`

Now we will train a few different types of models. Here is a list of the models we will train:

- Neural Network
- K Nearest Neighbors

```
model.nnet = train(goal ~ x + y + typeNum + dist + angle,
data = analysis.all,
method = "nnet",
trControl = control)
```

```
## # weights: 22
## initial value 77737.560248
## iter 10 value 20655.164118
## iter 20 value 18813.075704
## iter 30 value 18771.856595
## iter 40 value 18659.540051
## iter 50 value 18573.913604
## iter 60 value 18530.538848
## iter 70 value 18522.049872
## iter 80 value 18519.265512
## iter 90 value 18512.419443
## iter 100 value 18511.099883
## final value 18511.099883
## stopped after 100 iterations
```

```
model.knn = train(goal ~ x + y + typeNum + dist + angle,
data = analysis.all,
method = "knn",
trControl = control)
```

Now our models have been made.

Let’s test our models on the 2018 data. Here’s what the testing data looks like:

`analysis.2018`