Wednesday, 16 April 2014

Comparing Models with Binary & Continuous Response Variables

In case people are skeptical of the high probabilities the model is predicting for some series, two things:

1) It is important to note that these predicted probabilities are point estimates produced by the model, and thus have some error. One way to estimate this error is to bootstrap (i.e. re-run the model a whole bunch of times on versions of the dataset resampled with replacement and compare the different probability estimates you get), which I may do at some point if people are interested (I haven't done it yet because it's a little complicated in the stats software I use, but definitely can be done, so I'll do it if I can find the time in the next few days).

2) For comparison, here are the original predictions from my last post side by side with predictions from a model with identical explanatory variables and a continuous response variable (winning % (W%) (0-1, where 0 means being swept, 1 means sweeping)), along with 95% confidence intervals on the predicted winning percentages (these might also be helpful in picking the series length):

Predicted Winner in Bold (Predicted p[W] for winner - Logit Model) (Predicted W% for winner (CI) - Continuous Model):
Atlantic Division:
Boston vs. Detroit (94.7%) (0.69 (0.36,1))
Tampa Bay vs. Montreal (75.2%) (0.58 (0.25,0.91))
Tampa Bay vs. Detroit (90.6%) (0.65 (0.32,0.98))

Metropolitan Division:
Pittsburgh vs. Columbus (97.4%) (0.74 (0.41,1))
New York vs. Philadelphia (84.4%) (0.62 (0.29,0.95))
Pittsburgh vs. NYR (72.4%) (0.57 (0.24,0.90))

Central Division:
Colorado vs. Minnesota (88.1%) (0.63 (0.30,0.97))
St. Louis vs. Chicago (96.9%) (0.73 (0.40,1))
Colorado vs. Chicago (71.7%) (0.55 (0.22,0.88))

Pacific Division:
Anaheim vs. Dallas (76.7%) (0.59 (0.26,0.92))
San Jose vs. LA (82.4%) (0.61 (0.28,0.94))
LA vs. Dallas (60.9%) (0.54 (0.21,0.87))

Eastern Conference Final:
Tampa Bay vs. NYR (99.8%) (0.93 (0.60,1))

Western Conference Final:
Colorado vs. Dallas (97%) (0.72 (0.39,1))

Stanley Cup Final:
Colorado vs. Tampa Bay (68.6%) (0.54 (0.21,0.87))

As you can see, with the continuous model, the picks are the same, but the only series with a pick with >95% confidence is Tampa Bay over NYR. The observed vs. predicted plot for the continuous model is shown below. It is very similar to the plot shown for the logit model in the previous post.

Figure 2014.2. Observed vs. predicted winning percentages from continuous model. 
 

Model v 1.0 and 2014 Playoff Picks!

Hello Blogosphere!

For those who don't know me, my name is Matt Burgess.  By day, I'm an ecologist/applied economist finishing up my PhD at University of Minnesota this semester, and heading to UC Santa Barbara in the summer to start a post-doc.  By night, I'm a die hard hockey fan.  I grew up in Brossard, just outside of Montreal, Canada, so more specifically, I'm a die hard Habs fan.  Inspired by Nate Silver's popular FiveThirtyEight blog (I'm also a political junkie), I decided last season to try my hand at statistically modeling the NHL Playoffs - the idea being that 7-game series should be reasonably predictable (compared to, say, the NFL playoffs or March Madness).

So this past year, I downloaded team and player stats from every NHL season and playoffs going back to 1967-1968 (beginning of the modern era), and have been playing around with models.  Here are predictions from a logit model (response variable is binary W/L) I selected based on AIC.  I won't give away exactly what specific effects are included yet, but I will say that both the overall team stats and season series stats are very important.

The modeling approach I am currently using is very much in its beta version, as there are many things I haven't added in yet but would like to. For example, currently the models do not explicitly consider multi-season trends in a particular matchup, injuries and other within season roster trends, or newer measures such as Corsi for percentage, but I will hopefully be able to add some things like that in by next year.

Sixteen Wins Stanley Cup Playoff Predictions 2014:
April 16, 2014 (I will post revised predictions after each round)

Descriptive Stats:
Time period considered: Modern era (1968-present)
n = 594 (599 - 5 series with no season series)
DF: (593 (Total) = 5 (Fitted) + 588 (Lack of Fit))
Model Failure Rate (all years)* = 11.2% (i.e. fraction of series where team model predicted to have p[W] > 50% went on to lose)
Series model would have gotten wrong in the last 3 seasons**:
2013: Anaheim vs. Detroit (Pred. p[W] = 82.3% for ANA) (DET won in 7)
2012: New Jersey vs. NYR (Pred. p[W] = 55.3% for NYR) (NJD won in 6)
2011: Chicago vs. Vancouver (Pred. p[W] = 98.1% for CHI) (VAN won in 7) 
          NYR vs. Washington (Pred. p[W] = 71.8% for NYR) (WSH won in 5) 
          Detroit vs. San Jose (Pred. p[W] = 63.4% for DET) (SJS won in 7) 

*based on in-sample predictions.
**based on out-of-sample predictions (model re-fit excluding years 2011-2013)

Figure 2014.1. Comparison of model predictions and observed series winning percentages (n = 594). W% = 0.5 (dashed line) separates teams that won their series (W% > 0.5) from those that lost (W% < 0.5).

Model Predictions for 2014:
Predicted Winner in Bold (Predicted p[W] for winner):
Atlantic Division:
Boston vs. Detroit (94.7%)
Tampa Bay vs. Montreal (75.2%)
Tampa Bay vs. Detroit (90.6%)

Metropolitan Division:
Pittsburgh vs. Columbus (97.4%)
New York vs. Philadelphia (84.4%)
Pittsburgh vs. NYR (72.4%)

Central Division:
Colorado vs. Minnesota (88.1%)
St. Louis vs. Chicago (96.9%)
Colorado vs. Chicago (71.7%)

Pacific Division:
Anaheim vs. Dallas (76.7%)
San Jose vs. LA (82.4%)
LA vs. Dallas (60.9%)

Eastern Conference Final:
Tampa Bay vs. NYR (99.8%)

Western Conference Final:
Colorado vs. Dallas (97%)

Stanley Cup Final:
Colorado vs. Tampa Bay (68.6%)

Coming Soon: Predicted Odds of Each Team Winning the Cup (integrated over all possible brackets)