Sixteen Wins: Statistical Modeling of the Stanley Cup Playoffs

Tuesday, 16 June 2015

Fourteen wins

It's all over. Final score:

Model Predictions: 12/15 (80%) = 6/8 (Rd. 1) + 3/4 (Rd. 2) +2/2 (Rd. 3) + 1/1 (Rd. 4)
My Predictions: 14/15 (93%) = 7/8 (Rd. 1) + 4/4 (Rd. 2) + 2/2 (Rd. 3) + 1/1 (Rd. 4)

Both I and the model predicted a majority of series right, and predicted the cup winner correctly from day one. My bracket was in the 100th percentile on nhl.com's bracket challenge (rank 720).

Did I get lucky? Yes. Partly. The model's in-sample error rate from previous seasons was 30%; this year it was only 20%. So the model beat its own average error rate this year. Next year could be different (and this year could have been different if a couple of the close series, e.g., Washington vs. New York Islanders, had turned out differently).

On the other hand, the success of both the model's predictions and my predictions this year suggest: (i) that the Stanley Cup Playoffs are at least somewhat predictable; and (ii) as useful as statistical models are, there will likely always be room for human expertise so long as there are factors contributing to teams' playoff success that are difficult to measure (I beat the model on 3/4 series in which we disagreed).

Lastly, it's worth acknowledging the competition. SAP had an impressive prediction interface on nhl.com that predicted the outcome of each series. Their record was:

SAP (nhl.com): 10/15 (67%) = 4/8 (Rd. 1) + 3/4 (Rd. 2) + 2/2 (Rd. 3) + 1/1 (Rd. 4)

SAP got every series in the East wrong in the first round (picking Ottawa, Detroit, Pittsburgh, NYI), and picked Montreal over Tampa Bay in round 2. Does this mean their model is worse than mine? Not necessarily. Their 33% error rate is well within the expected error range of my model. Of the two series the models picked differently (Pittsburgh-NYR, Washington-NYI), one was very close. The only series that seems like a strange pick for SAP's model is NYR-Pittsburgh.

However, the success of the Sixteen Wins predictions this year is a sign of hope for hacks like me - devoted fans with publicly available data - wanting to try our hands at predicting the playoffs on tight time and financial budgets.

Saturday, 30 May 2015

Final pick: Chicago in 7

Both I and the model were 2/2 this round, so nothing has changed going forward:

Final pick: Chicago over Tampa Bay in 7 (Model Pred[Wpct] = 0.56).
I think this is going to be a great series with a decent number of goals.

Prediction scores so far:
Model Predictions: 11/14 (79%) = 6/8 (Rd. 1) + 3/4 (Rd. 2) + 2/2 (Rd. 3)
My Predictions: 13/14 (93%) = 7/8 (Rd. 1) + 4/4 (Rd. 2) + 2/2 (Rd. 3)

Thursday, 14 May 2015

Round 3 Predictions: Chicago, Tampa Bay. I'm 11/12 so far.

Sad to see the Habs go, but I (and many others) predicted it. All of my picks were right this past round; the model very weakly picked Montreal, so it got one wrong. So far, the model and I have agreed on 8/12 series, of which we are 8/8; and we have disagreed on 4 series, of which I am 3/4.

Cumulative totals are:

Model Predictions: 9/12 (75%) = 6/8 (Rd. 1) + 3/4 (Rd. 2)
My Predictions: 11/12 (92%) = 7/8 (Rd. 1) + 4/4 (Rd. 2)

Here are the predictions for the next round (mine, needless to say, have not changed). The model and I now agree on a Chicago over Tampa Bay Stanley Cup Final.

Model Predictions:

Round 3:
Tampa Bay over New York Rangers (Pred[W pct] = 0.51)
Chicago over Anaheim (Pred[W pct] = 0.55)

Stanley Cup Final:
Chicago over Tampa Bay (Pred[W pct] = 0.56)

My Predictions:

Round 3:
Tampa Bay over New York Rangers in 6
Chicago over Anaheim in 7

Stanley Cup Final:
Chicago over Tampa Bay in 7

Thursday, 30 April 2015

Round 2 Predictions: Chicago, Anaheim, New York, Tampa Bay

Not a bad first round!

All five series that the model and I agreed on were right (Chicago, Anaheim, Calgary, Washington, New York Rangers). Of the three we disagreed on, I got two right (Minnesota, Tampa Bay); the model got one (Montreal). I underestimated my Habs - emotions clearly clouding my judgment. So all in all:

Model Predictions: 6/8
My Predictions: 7/8

As a result, the predictions for the remainder of the playoffs are largely the same:

Model Predictions

Round 2:
Montreal over Tampa Bay (barely: Pred[W pct.] = 0.5008; and it should be noted that the model averages from only the variables that span the full dataset (1968-2014) pick Tampa Bay (Pred[W pct.] = 0.53).

Chicago over Minnesota (Pred[W pct.] = 0.58)

The rest of the predictions are the same as before (see Round 1 post for predicted winning percentages):

New York Rangers over Washington
Anaheim over Calgary

Round 3:
Montreal over New York Rangers
Chicago over Anaheim

Stanley Cup Final:
Chicago over Montreal

My Predictions

Round 2:
Tampa Bay over Montreal in 6. I really hope I am underestimating the Habs again, but with the Lightning healthy this time, their season series record against the Habs (5-0-0) is too decisive to ignore.

The rest of my predictions are the same as the last round's as well.

New York Rangers over Washington in 7
Chicago over Minnesota in 7
Anaheim over Calgary in 7

Round 3:
Tampa Bay over New York Rangers in 6
Chicago over Anaheim in 7

Stanley Cup Final:
Chicago over Tampa Bay in 7

Tuesday, 14 April 2015

Sixteen Wins 2015: Cleaning house like the Leafs. Cup Pick: Chicago

It's that time of year again... time for a new model and new predictions.

Last year was a rough first go. The model had a very high error rate, despite what seemed like strong out-of-sample validation. I posited at the time that this was due at least in part to an abnormal number of key trades and injuries (e.g., Bishop, Gaborik, St. Louis), which were not included. I still believe this to be partly true.

However, in putting everything together this year, I also discovered a significant mistake in my data-compilation code from last year - one that likely was the main culprit in the discrepancy between the strong predictions and weak results (see Note 1 below). I suppose this kind of thing is bound to happen occasionally when one only has 4 days between the end of the season and start of the playoffs in which to compile the new data and run the models (and one has a day job during 3 of those days), but I apologize to my readers nonetheless.

So this year, I am cleaning house and moving on... like the Leafs.

The new data are in and thoroughly checked (needless to say, last year's error is corrected), and I have added some of the trendy new variables (like Corsi-for percentage) that are all the rage in the hockey stats world these days (data this year come from war-on-ice.com and hockey-reference.com; variables listed in Note 2 below). The drawback of this is that the data now only go back to 2003 (instead of 1968); but I tried running the analysis both with and without the new data, and the model seems to produce fewer errors when it includes the new data, despite having fewer observations (n = 165).

The prediction format is also a little different this year. First, I am using model averaging (with AIC weights) instead of picking a single model. In a nutshell, this procedure fits the full suite of possible models and then calculates average parameter estimates for each variable across all models, weighting the estimate from each model by its relative goodness of fit. The result is a 'committee model' that theoretically is a much better predictor than any of its component models (provided the errors are not perfectly correlated across models, which is unlikely). This is similar to the procedure Nate Silver uses on FiveThirtyEight (though he uses Bayesian model averaging instead of AIC model averaging, which I would like to do eventually, but do not yet have an easily importable script for that would work for my dataset..maybe next round or next season).

Additionally, because there are several important factors determining playoff outcomes that are not captured by my dataset (e.g., trades, injuries), and because the model suggests only moderate series predictability (see Figure 2015.1 below); I will also note where I disagree with the model predictions, and make my own bracket below the model bracket. I'll update both brackets at the start of each round (like last year) and keep a running tally of which bracket performs better. We'll see!

Without further ado, here are the predictions:

Model Predictions:
(Continuous response variable - playoff winning percentage:
1 = won in 4; 0.8 = won in 5; 0.67 = won in 6; 0.57 = won in 7; 0.43 = lost in 7; 0.33 = lost in 6; 0.2 = lost in 5; 0 = lost in 4)

Round 1:
Montreal - Ottawa (Montreal in 7 (Pred[W pct. = 0.55]))
Tampa Bay - Detroit (Detroit in 7 (Pred[W pct. = 0.52])
New York Rangers - Pittsburgh (New York in 7 - Pred[W pct. = 0.57])
New York Islanders - Washington (Washington in 7 Pred[W pct. = 0.51])

Anaheim - Winnipeg (Anaheim in 7 (Pred[W pct.] = 0.56))
Vancouver - Calgary (Calgary in 7 (Pred[W pct.] = 0.52))
St. Louis - Minnesota (St. Louis in 7 (Pred[W pct.] = 0.51))
Chicago - Nashville (Chicago in 7 (Pred[W pct.] = 0.59)

Round 2:
Montreal - Detroit (Montreal in 7 (Pred[W pct.] = 0.52))
New York - Washington (New York in 7 (Pred[W pct.] = 0.52))

Anaheim - Calgary (Anaheim in 7 (Pred[W pct.] = 0.59))
Chicago - St. Louis (Chicago in 7 (Pred[W pct.] = 0.53))

Round 3:
Montreal - New York Rangers (Montreal in 7 (Pred[W pct.] = 0.54))
Anaheim - Chicago (Chicago in 7 (Pred[W pct.] = 0.55))

Round 4:
Montreal - Chicago (Chicago in 7 (Pred[W pct.] = 0.59))

The fact that the model picks every series to be close reflects its moderate predictive ability in past series (see Figure 2015.1 below). The average (in-smaple) error-rate over the sample is 28%, meaning that the model is picking 3/10 series wrong on average. This high error-rate could be a symptom of still insufficient data; or perhaps it indicates that the Stanley Cup Playoffs are just difficult to predict in general. Interestingly, we now see a slight declining trend in average error rate - one that is likely spurious though.

Figure 2015.1. Historical in-sample error rates of model average (fraction of series for which Pred[W pct.] > 0.5 for the team that lost).

My Picks:

Round 1:
Montreal - Ottawa (Ottawa in 6):
I waffle a lot over this series. On the one hand, Ottawa bested Montreal in the season series and is the hottest team in the league right now. Max Pacioretty is also hurt, though it seems like he may be back before the end of the series. Moreover, it is worth noting that he was not a big factor in Montreal's success in the playoffs last year, despite a similarly strong season. On the other hand, Ottawa is a very young team being carried by a group of first- and second-year players, including their goalie, lacking playoff experience. I wouldn't be surprised to see inconsistent performances out of Andrew Hammond, similar to those of Carey Price in his early years. That said, Ottawa proved to be an incredibly clutch and resilient team in big games down the stretch at the end of the season. My gut says that the team that gets momentum early will win this series. All things considered, I have to go (with my head, against my heart) with Ottawa in 6.

Tampa Bay - Detroit (Tampa Bay in 5):
Here I 100% think the model is wrong. The weak goaltending of late in Detroit and Tampa Bay's strong season from start to finish are too important to ignore.

New York Rangers - Pittsburgh (New York in 5):
I like the model's pick here, but think it will be a shorter series.

New York Islanders - Washington (Washington in 7):
This should be a great series. It's tough to pick, but on balance I think the model has it right.

Anaheim - Winnipeg (Anaheim in 7):
Winnipeg is a great team with the best building in hockey right now, but I think Anaheim will close them out on home ice. I'm with the model.

Vancouver - Calgary (Calgary in 7):
This one is tough too, for many of the same reasons as the Montreal series. It's experience in Vancouver vs. a hotter young team in Calgary. I think it will go to 7, and I think Calgary will find a way. The clincher for me is the uncertain goaltending situation in Vancouver.

St. Louis - Minnesota (Minnesota in 7):
I'm going against the model again here. Minnesota has been a different team since they acquired Devan Dubnyk in the middle of the season. I think they are a much better team than their record suggests.

Chicago - Nashville (Chicago in 6):
I like the model's pick, but I think it will be a shorter series. Nashville is inexperienced and have had a lackluster second half of the season.

Round 2:
Tampa Bay - Ottawa (Tampa Bay in 7):
The model picks Tampa weakly here and I agree. Ottawa is hot, but I think Tampa will be too much down the stretch if they can stay healthy.

New York - Washington (New York in 7):
I'm with the model here. I think New York is the better team and will prevail, but Washington will get big performances out of their stars and will make it interesting.

Anaheim - Calgary (Anaheim in 7):
I'm with the model again here. Calgary will be pumped up and will make it interesting, but Anaheim will ultimately be too tough to handle.

Chicago - Minnesota (Chicago in 7):
The model picks Chicago here and I agree. Minnesota is a great team, but Chicago has their number in the playoffs.

Round 3:
New York Rangers - Tampa Bay (Tampa Bay in 6):
The model picks Tampa Bay here and I agree. Both teams play a similar speed game, but Tampa is a bit faster. They have dominated the season series (3-0-0). I think this series will look a lot like New York - Montreal last year.

Anaheim - Chicago (Chicago in 7):
I am going with the model here. I think Chicago's playoff poise will give them the edge over an Anaheim team with an inexperienced goalie.

Round 4:
Tampa Bay - Chicago (Chicago in 7):
The model picks Chicago. I waffle a bit here because Tampa Bay seems like a slightly stronger team and won the season series (1-0-1; Chicago's win being a shootout), but in the end I have to go with Chicago's playoff experience to get the job done.

Note 1: Last year's error:
When I was tallying goals-for-goals-against in match-ups for the season last year, the playoffs were inadvertently included as well. This means, for example, that the model was predicting a past year's playoff results based on the combination of season and playoff results from the same year. This is a huge mistake - one that introduced a large amount of circularity (or 'endogeneity', as economists like to call it), and in doing so created a facade of predictability.

Note 2: Variables included in the analysis:
Season Series (SS) Winning pct. (shootouts count as ties for both teams)
SS Mean Goal Differential
Transformed SS ((SS Winning pct. - 0.5)*Number of games in SS)
Total SS Goal Differential (SS GF - SS GA)
Difference in Average GF
Difference in Average GA
Difference in Season-Long Winning pct. (Home, Away, and Total)
Difference in Corsi-for percentage (both overall and 5 vs. 5)
Difference in Corsi-for per 60 minutes (both overall and 5 vs. 5)
Difference in on-ice unblocked shot attempts on goal (both overall and 5 vs. 5)
Difference in on-ice shooting percentage (both overall and 5 vs. 5)
Difference in on-ice save percentage (both overall and 5 vs. 5)
Difference in PDO (on-ice shooting percentage plus on-ice save percentage) (both overall and 5 vs. 5)
Difference in faceoff percentage (both overall and 5 vs. 5)
Difference in ZSO% (fraction of faceoffs in offensive zone vs. defensive zone) (both overall and 5 vs. 5)

Wednesday, 4 June 2014

Final Pick: NYR in 6

Eeek! In the craziness of preparing for my PhD defense (which is on Monday), I almost forgot to post predictions for the finals! Anyway, the model (which struggled again last round) picks NYR quite strongly; and (in case anyone thinks I'm only saying this because they're up 2-1 right now) you can see this in my last post - by comparing NYR's predicted odds against Montreal and their odds of winning the cup, which were very similar because the model strongly picks NYR over either LA or Chicago - that the model indeed picks NYR.

I will post a more detailed post-mortem on last round after my defense, but here are the picks in the meantime:

NYR over LA in 6 (Logit model: 91% chance of NYR win; Continuous model: E[W%] = 0.65 (CI: 0.32, 0.98))

Despite my model's struggles this year (which I think are in large part due to key injuries and trade deadline trades on key teams (e.g. NYR, LAK, MTL, Tampa Bay)), my hockey gut tells me this prediction will be right, though I could see it going to 7 games again. I don't think Quick has been solid enough these playoffs to weather the NYR storm, but think that Hank is up to the challenge.

#BecauseItsTheCup

Saturday, 17 May 2014

Conference Finals: Updated Predictions

With Round 2 over and Round 3 starting later today, here are the quick updated predictions. The model got 3/4 series right in Round 2 (it only got Anaheim-LAK wrong) so most of the predictions are the same as last time:

Revised Picks (Model):

Montreal vs. NYR [92% (Logit), predicted W pct = 0.66 (CI: 0.33, 0.99)]
Chicago vs. LAK [99% (Logit), predicted W pct = 0.80 (CI: 0.47, 1)]
Chicago vs. Montreal [66% (Logit), predicted W pct = 0.53 (CI: 0.20, 0.86)]

Based on the predicted winning percentages, here are the revised predicted series lengths:

Chicago over LAK in 5
Montreal over NYR in 6
Montreal over Chicago in 7

And here are the predicted probabilities for each team of winning the cup (summed over all brackets):

Montreal: 60%
Chicago: 31%
NYR: 8%
LAK: 1%

Obviously, I'm a little disappointed that the model narrowly missed a perfect round, but actually, the model did predict that it would get on an average of 1 series wrong last round given the fact that the predicted margins were very narrow (see my last post for details), so I still definitely see this round as redemption for the mode compared to the first round. One could maybe argue that the Gibson start in Game 7 (which I think was a huge mistake, as much as I think Gibson is a great goalie - you can't start a rookie goalie with less than 10 games NHL experience in a Game 7 against a veteran team like that and not expect him to make some mistakes) and the Gaborik acquisition at the deadline are two key things that the model did not consider when it picked the Ducks, but I also, as I said in my last post, don't think we should read too much into model errors that were in Game 7s, as single-game series come with a lot of stochasticity.

And I'm of course thrilled that the Habs prediction is still intact. The model does strongly pick LA over the Habs though, so I'm also rooting for Chicago this round. Go Habs Go!