Sixteen Wins: Statistical Modeling of the Stanley Cup Playoffs: April 2015

It's that time of year again... time for a new model and new predictions.

Last year was a rough first go. The model had a very high error rate, despite what seemed like strong out-of-sample validation. I posited at the time that this was due at least in part to an abnormal number of key trades and injuries (e.g., Bishop, Gaborik, St. Louis), which were not included. I still believe this to be partly true.

However, in putting everything together this year, I also discovered a significant mistake in my data-compilation code from last year - one that likely was the main culprit in the discrepancy between the strong predictions and weak results (see Note 1 below). I suppose this kind of thing is bound to happen occasionally when one only has 4 days between the end of the season and start of the playoffs in which to compile the new data and run the models (and one has a day job during 3 of those days), but I apologize to my readers nonetheless.

So this year, I am cleaning house and moving on... like the Leafs.

The new data are in and thoroughly checked (needless to say, last year's error is corrected), and I have added some of the trendy new variables (like Corsi-for percentage) that are all the rage in the hockey stats world these days (data this year come from war-on-ice.com and hockey-reference.com; variables listed in Note 2 below). The drawback of this is that the data now only go back to 2003 (instead of 1968); but I tried running the analysis both with and without the new data, and the model seems to produce fewer errors when it includes the new data, despite having fewer observations (n = 165).

The prediction format is also a little different this year. First, I am using model averaging (with AIC weights) instead of picking a single model. In a nutshell, this procedure fits the full suite of possible models and then calculates average parameter estimates for each variable across all models, weighting the estimate from each model by its relative goodness of fit. The result is a 'committee model' that theoretically is a much better predictor than any of its component models (provided the errors are not perfectly correlated across models, which is unlikely). This is similar to the procedure Nate Silver uses on FiveThirtyEight (though he uses Bayesian model averaging instead of AIC model averaging, which I would like to do eventually, but do not yet have an easily importable script for that would work for my dataset..maybe next round or next season).

Additionally, because there are several important factors determining playoff outcomes that are not captured by my dataset (e.g., trades, injuries), and because the model suggests only moderate series predictability (see Figure 2015.1 below); I will also note where I disagree with the model predictions, and make my own bracket below the model bracket. I'll update both brackets at the start of each round (like last year) and keep a running tally of which bracket performs better. We'll see!

Without further ado, here are the predictions:

Model Predictions:
(Continuous response variable - playoff winning percentage:
1 = won in 4; 0.8 = won in 5; 0.67 = won in 6; 0.57 = won in 7; 0.43 = lost in 7; 0.33 = lost in 6; 0.2 = lost in 5; 0 = lost in 4)

Round 1:
Montreal - Ottawa (Montreal in 7 (Pred[W pct. = 0.55]))
Tampa Bay - Detroit (Detroit in 7 (Pred[W pct. = 0.52])
New York Rangers - Pittsburgh (New York in 7 - Pred[W pct. = 0.57])
New York Islanders - Washington (Washington in 7 Pred[W pct. = 0.51])

Anaheim - Winnipeg (Anaheim in 7 (Pred[W pct.] = 0.56))
Vancouver - Calgary (Calgary in 7 (Pred[W pct.] = 0.52))
St. Louis - Minnesota (St. Louis in 7 (Pred[W pct.] = 0.51))
Chicago - Nashville (Chicago in 7 (Pred[W pct.] = 0.59)

Round 2:
Montreal - Detroit (Montreal in 7 (Pred[W pct.] = 0.52))
New York - Washington (New York in 7 (Pred[W pct.] = 0.52))

Anaheim - Calgary (Anaheim in 7 (Pred[W pct.] = 0.59))
Chicago - St. Louis (Chicago in 7 (Pred[W pct.] = 0.53))

Round 3:
Montreal - New York Rangers (Montreal in 7 (Pred[W pct.] = 0.54))
Anaheim - Chicago (Chicago in 7 (Pred[W pct.] = 0.55))

Round 4:
Montreal - Chicago (Chicago in 7 (Pred[W pct.] = 0.59))

The fact that the model picks every series to be close reflects its moderate predictive ability in past series (see Figure 2015.1 below). The average (in-smaple) error-rate over the sample is 28%, meaning that the model is picking 3/10 series wrong on average. This high error-rate could be a symptom of still insufficient data; or perhaps it indicates that the Stanley Cup Playoffs are just difficult to predict in general. Interestingly, we now see a slight declining trend in average error rate - one that is likely spurious though.

Figure 2015.1. Historical in-sample error rates of model average (fraction of series for which Pred[W pct.] > 0.5 for the team that lost).

My Picks:

Round 1:
Montreal - Ottawa (Ottawa in 6):
I waffle a lot over this series. On the one hand, Ottawa bested Montreal in the season series and is the hottest team in the league right now. Max Pacioretty is also hurt, though it seems like he may be back before the end of the series. Moreover, it is worth noting that he was not a big factor in Montreal's success in the playoffs last year, despite a similarly strong season. On the other hand, Ottawa is a very young team being carried by a group of first- and second-year players, including their goalie, lacking playoff experience. I wouldn't be surprised to see inconsistent performances out of Andrew Hammond, similar to those of Carey Price in his early years. That said, Ottawa proved to be an incredibly clutch and resilient team in big games down the stretch at the end of the season. My gut says that the team that gets momentum early will win this series. All things considered, I have to go (with my head, against my heart) with Ottawa in 6.

Tampa Bay - Detroit (Tampa Bay in 5):
Here I 100% think the model is wrong. The weak goaltending of late in Detroit and Tampa Bay's strong season from start to finish are too important to ignore.

New York Rangers - Pittsburgh (New York in 5):
I like the model's pick here, but think it will be a shorter series.

New York Islanders - Washington (Washington in 7):
This should be a great series. It's tough to pick, but on balance I think the model has it right.

Anaheim - Winnipeg (Anaheim in 7):
Winnipeg is a great team with the best building in hockey right now, but I think Anaheim will close them out on home ice. I'm with the model.

Vancouver - Calgary (Calgary in 7):
This one is tough too, for many of the same reasons as the Montreal series. It's experience in Vancouver vs. a hotter young team in Calgary. I think it will go to 7, and I think Calgary will find a way. The clincher for me is the uncertain goaltending situation in Vancouver.

St. Louis - Minnesota (Minnesota in 7):
I'm going against the model again here. Minnesota has been a different team since they acquired Devan Dubnyk in the middle of the season. I think they are a much better team than their record suggests.

Chicago - Nashville (Chicago in 6):
I like the model's pick, but I think it will be a shorter series. Nashville is inexperienced and have had a lackluster second half of the season.

Round 2:
Tampa Bay - Ottawa (Tampa Bay in 7):
The model picks Tampa weakly here and I agree. Ottawa is hot, but I think Tampa will be too much down the stretch if they can stay healthy.

New York - Washington (New York in 7):
I'm with the model here. I think New York is the better team and will prevail, but Washington will get big performances out of their stars and will make it interesting.

Anaheim - Calgary (Anaheim in 7):
I'm with the model again here. Calgary will be pumped up and will make it interesting, but Anaheim will ultimately be too tough to handle.

Chicago - Minnesota (Chicago in 7):
The model picks Chicago here and I agree. Minnesota is a great team, but Chicago has their number in the playoffs.

Round 3:
New York Rangers - Tampa Bay (Tampa Bay in 6):
The model picks Tampa Bay here and I agree. Both teams play a similar speed game, but Tampa is a bit faster. They have dominated the season series (3-0-0). I think this series will look a lot like New York - Montreal last year.

Anaheim - Chicago (Chicago in 7):
I am going with the model here. I think Chicago's playoff poise will give them the edge over an Anaheim team with an inexperienced goalie.

Round 4:
Tampa Bay - Chicago (Chicago in 7):
The model picks Chicago. I waffle a bit here because Tampa Bay seems like a slightly stronger team and won the season series (1-0-1; Chicago's win being a shootout), but in the end I have to go with Chicago's playoff experience to get the job done.

Note 1: Last year's error:
When I was tallying goals-for-goals-against in match-ups for the season last year, the playoffs were inadvertently included as well. This means, for example, that the model was predicting a past year's playoff results based on the combination of season and playoff results from the same year. This is a huge mistake - one that introduced a large amount of circularity (or 'endogeneity', as economists like to call it), and in doing so created a facade of predictability.

Note 2: Variables included in the analysis:
Season Series (SS) Winning pct. (shootouts count as ties for both teams)
SS Mean Goal Differential
Transformed SS ((SS Winning pct. - 0.5)*Number of games in SS)
Total SS Goal Differential (SS GF - SS GA)
Difference in Average GF
Difference in Average GA
Difference in Season-Long Winning pct. (Home, Away, and Total)
Difference in Corsi-for percentage (both overall and 5 vs. 5)
Difference in Corsi-for per 60 minutes (both overall and 5 vs. 5)
Difference in on-ice unblocked shot attempts on goal (both overall and 5 vs. 5)
Difference in on-ice shooting percentage (both overall and 5 vs. 5)
Difference in on-ice save percentage (both overall and 5 vs. 5)
Difference in PDO (on-ice shooting percentage plus on-ice save percentage) (both overall and 5 vs. 5)
Difference in faceoff percentage (both overall and 5 vs. 5)
Difference in ZSO% (fraction of faceoffs in offensive zone vs. defensive zone) (both overall and 5 vs. 5)

Sixteen Wins: Statistical Modeling of the Stanley Cup Playoffs

Thursday, 30 April 2015

Round 2 Predictions: Chicago, Anaheim, New York, Tampa Bay

Tuesday, 14 April 2015

Sixteen Wins 2015: Cleaning house like the Leafs. Cup Pick: Chicago