Predicting NBA game outcomes using machine learning
The objective of every coach is to win. Pre-game and in-game, they must decide which players deserve more minutes so the team can improve in key performance areas, and by extension, which areas should be focused on.
My goal was to predict ‘win’ or ‘loss’, taking measures of performance as input. The final model correctly classified 84.44% of unseen game data.
Coaches can estimate team performance categories, use this model to predict the outcome, then shuffle their lineups to target certain categories in their control that will output winning predictions.
My assumptions are that coaches can aggregate each player’s season averages to accurately estimate:
- The performance of the opposing lineup
- The performance of their own lineup
The dataset (
games.csv) contains information on 23,195 NBA games from every season between 2003–2019.
This was randomly split into a
train set (80% of data) to learn from, and a
test set (20% of data) to evaluate performance. Further, this was split into
X_test (features), and
For each game (row), we have the following features (columns) for both
FG_PCT_away: % of made shots (excluding free throws)
FT_PCT_away: % of made free throws
FG3_PCT_away: % of made 3-point shots
AST_away: # of assists
REB_away: # of rebounds
I wanted to predict the
HOME_TEAM_WINS column (target); assuming that ours is always the
home team. A 0 represents a loss, and a 1 represents a win.
train set was missing data for 77 games. Estimating each feature of these games would detract from the analysis, so they were dropped. I also dropped one game with an erroneous 36–33 score.
The following histograms display, for each feature, the count of values in a bin (value range). Most are centered, meaning there are no values far outside the middle that shift the distribution (potential outliers).
There are some games with very low
FT_PCT, shifting the distribution right. These are real; NBA games typically have high
FT_PCT with occasional bad performances. They were kept because eliminating them would artificially divorce our data from reality.
Some games had high
REB, shifting the following distributions left, but they too were kept. Teams usually shoot a low
FG3_PCT, but with few 3-point attempts, an accuracy of 1.0 is possible. Likewise, overtime games allow teams to collect supernormal rebounds.
Outcome Splits by Feature
I then separated histograms by outcome, where orange represents values recorded in wins; and blue in losses.
FG_PCT_home shows that games with high efficiency are overwhelmingly won (orange).
FG_PCT_away is the reverse; games where the opponents had high efficiency are overwhelmingly lost (blue). Similar splits appear for the following features, but the relationship is less strong.
Input transformations improve the model’s ability to learn.
preproc_pipe takes inputs, fills in missing values with the median of that feature from the
train set, and then squishes every value into a distribution that has a mean of 0 and a standard deviation of 1 (standard normal distribution). This second step (scaling) prevents the model from artificially favoring features like
REB_away that are larger.
After trying multiple approaches (
RandomForestClassifier) and settings (hyperparameters) for each, the best model was
LogisticRegression with default settings.
LogisticRegression defines a linear equation where the outputs (y) lie between 0 and 1. Outputs ≥ 0.5, are classified as wins (1) and outputs < 0.5 as losses (0). When training, coefficients are continuously tweaked to minimize the difference between actual and predicted values. Then, it can multiply unseen inputs by trained parameters, and predict.
The final line splits
train into 10 chunks. For each of 10 iterations, the model preprocesses the inputs, then trains
LogisticRegression on 9/10 chunks before validating performance on the 10th (unseen) chunk. Using different validation chunks each iteration allows for more training on limited data, and increased confidence in results.
This model should beat the baseline accuracy (% correct predictions) of 59% (obtained by always predicting the most frequent outcome, wins).
Above, we see average scores on the validation chunk are close to average scores on the training chunks. This means our model has not overfit; it performs similarly on unseen data. In the
- It correctly classified 83.83% of games (
- Of ‘win’ classifications, 85.61% were truly won (
- It identified 87.52% of actual wins (
The average difference in accuracy across validation chunks (standard deviation) was small: 0.007263.
On the 20% of data the model had never seen, it correctly classified 84.44% of game outcomes, which was close (within standard deviation) to the
train set (83.83%). This again builds confidence that our model has not overfit.
The below visualization shows the impact of each feature (its coefficient) on prediction; something that might interest coaches.
Larger positive coefficients influence the model to predict wins, whereas negative coefficients influence the model to predict losses. As before,
FG_PCT had the highest impact on predictions; greater
FG_PCT_home influences the model to predict ‘win’, and greater
FG_PCT_away influences the model to predict ‘loss’.
AST_away were the least important features because they had the smallest magnitude of impact.
Given the starting assumptions, accurate models might not help decision-making if coaches poorly estimate the performances of lineups.
Other features could significantly impact outcomes; especially defensive statistics like steals and blocks. A richer dataset could mitigate this weakness.
New rules and strategies constantly redefine the NBA, like the recent popularity of the 3-point shot. Changes in gameplay trends can reduce future predictive power, as the data becomes fundamentally different.
Full code available here.