by Jizhou Wang, Abhay Puri, Binulal Narayanan, Qilin Wang
In the previous post we showed how to obtain, tidy and visualize NHL games data. Now, let us experiment with the data to predict goals.
The experiments described in this blog are logged to this comet.ml workspace.
For feature engineering, we have visualized the data to get an intuition of outliers, biases that could be present for feature selection/imputation later on.
Although teams like to shoot around -30, 0 (meaning shooting directly in front of the net) and 30 degrees, the goals are distributed as bell-curve shape with 0 degree shots having the highest probability to score.
Scoring is harder when the shot distance is larger than 25 feet, yet teams shots are quite evenly distributed across different distances. Shots are rare past 70 feet, which makes sense as it is shot from the defensive zone.
It seems that shots have a very hard time turning into goals when shot angles are larger than 50 or when distance larger than 40.
The goal ratio is highest within 20 feet, lower when between 20-40 feet and much lower outside of 40 feet. It seems that as shot distance increase past 100 feet, the goal ratio starts to increase again. It suggest that, although the number of shots taken at a higher distance is low, the opportunity of scoring when taking a long shot is high. This makes sense especially when there is an empty net.
The goal ratio to shot angle seems to have a bell-shape curve around zero from 90 to -90 degrees. Although interestingly the goal ratio increases dramatically from -90 to -180 degrees. After looking at the data and verifying with some video examples (Sharp Angle Shot footage), it suggest that, although the number of shots taken at this angle is low, the opportunity of scoring when taken can be high.
An examples of wrong information (rinkSide):
Given gameID: 2018020953 and eventidx: 368 with Jakub Voracek’s goal, we have found that data was wrongly recorded with a much larger shot distance while scoring on a non-empty net shown here.
For our baseline, we used a logistic regression model with only a single distance to goal metric. After fitting to the shot data, it has achieved an accuracy of 0.91 and an AUC score of 0.69. However, the model pushes all its predictions towards non-goal situations, which means that it will predict all outcome to be a non-goal. Therefore, it has a zero score on precision, recall and f1-score for shots that are goals. This is a sign of overfitting because we want our model to predict situations that would result in a goal as well.
All simple features tested on the logistic regression model:
The goal rate curves of different models. The logistic regression with angle only plot seems to be predicting a constant goal rate at every shot percentile, which does not seem to be reliable when determining goals.
The ROC curves show that logistic regression with both distance and angle (or with just the angle) as features produce the best results. Logistic regression with only angle as feature produces a result not much better than a random guess.
For the calibration curves, we can also see that logistic regression with angle and distance produces some improvement, but its probabilistic predictions hardly fits a perfectly calibrated binary classifier.
Logistic Regression, trained on distance
Logistic Regression, trained on angle only
Logistic regression , trained on both
This time we have added many novel features from the shot data to improve our model prediction.
Feature list Description (before preprocessing)
The dataframe can be found in this link.
Using all the newly added features, we made sure to preprocess the data into a numerical form.
After preprocessing, the data was stratified-split into train and validation sets. Grid Search CV was used to search for the most optimal model parameter during training and validation.
The ROC curve has AUC value of 0.76 and the calibration is much smoother compared to before, and it is almost perfectly calibrated in for mean probabilities less than 0.6. For probabilities larger than 0.6, suggesting hyperparameter search and additional features has improved the prediction in a significant manner.
We have used various feature selection methods such as correlation matrix, XGboost feature importance weight, chi square method and SHAP value.
The selected features are the top 10 most important according to the picture listed below:
For reference, the correlation matrix is also included:
The SHAP value importance are as follows:
The AUC value is 0.76, which shows no deterioration from the full model, which suggests that our predictive power of the slimmed down model is as good as the full model.
For the logistic regression model, a Grid Search CV strategy was used to select the C parameter and the number of components after fitting the data with PCA. The data is first standardized in the pipeline, then PCA is applied then finally the logistic regression model. The resulting best parameters were PCA_n=90, C=10000.
Below are the plots for the logistic regression model.
For the svc model, a Grid Search CV strategy was used to select the C, kernel parameters. The data is first standardized in the pipeline, then PCA (with number of components = 85) is applied then finally the svc model. The resulting best parameters were C=10, kernel=’rbf’.
For the random forest model, The data is directly applied to the random forest model with the default parameters. The random forest model has an AUC of 0.76, highest among all models.
For the adaboost model, a Grid Search CV strategy was used to select the n_estimators, learning_rate parameters. The data is directly applied to the adaboost model. The resulting best parameters were n_estimators=10, learning_rate=0.01. The adaboost model has an AUC of 0.68.
Below are the plots for the adaboost model.
For the mlp model, a Grid Search CV strategy was used to select the hidden_layer_sizes. The data is first standardized in the pipeline, then PCA is applied then finally the mlp model. The resulting best parameters were C=10, kernel=’rbf’, hidden_layer_sizes = (100,100).
The MLP model seems to have a smooth curve on the shot prob model percentile plots. It has an AUC of 0.74
Below are the plots for the adaboost model.
Finally For the voting ensemble model, we have included the previous random forest, mlp, logistic regression and adaboost models with equal weights. The data is fitted with soft voting. The final model used on the test set was the ensemble model in hopes for a better generalization performance. It has an AUC of 0.76 as good as the random forest model.
Below are the plots for the voting ensemble model.
The precision metric for most of the models are close to 1, other than svc (0.6). The f1 score is around 0.09 for most of the models, other than svc (0.12). The recall score is around 0.05 for most of the models, other than svc (0.067). These scores seem to be a lot lower because of the false negatives. It seems to be difficult for the models to learn which shots lead to a goal.
Experiment links on comet.ml for each of the models described above.
For models tested on the regular season test set, the results are the following:
The voting ensemble model and the XGboost model had similar performance on the test set from the regular seasons as on the validation set. It performed slightly better on the test set for regular seasons. The AUC score of the ensemble and XGboost models are 0.76 and 0.77 respectively , on par with our result on the training data. This shows that the models gives a satisfactory result in terms of generalization.
For models tested on the playoffs season test set, the results are the following:
In conclusion, the voting ensemble model seems to generalize well as its performance on the test set from the playoffs data is comparably the same as on the validation set from the regular season data. The AUC score of the ensemble method is 0.70, which is a slight drop compared to our regular season prediction. It is still a satisfactory performance. However The fined tuned XGboost model had a better ROC curve with an AUC score of 0.74 which seems to do a better job compared to the ensemble model at ‘generalization’ performance.
tags: Hockey - Goal - Predictions