Winning Basketball

Sherwyn D’Souza
6 min readDec 1, 2020

--

Predicting NBA game outcomes using machine learning

📸 Photo by JC Gellidon on Unsplash

The objective of every coach is to win. Pre-game and in-game, they must decide which players deserve more minutes so the team can improve in key performance areas, and by extension, which areas should be focused on.

My goal was to predict ‘win’ or ‘loss’, taking measures of performance as input. The final model correctly classified 84.44% of unseen game data.

Coaches can estimate team performance categories, use this model to predict the outcome, then shuffle their lineups to target certain categories in their control that will output winning predictions.

My assumptions are that coaches can aggregate each player’s season averages to accurately estimate:

  1. The performance of the opposing lineup
  2. The performance of their own lineup

Dataset

The dataset (games.csv) contains information on 23,195 NBA games from every season between 2003–2019.

Sample of the dataset and its relevant features

This was randomly split into a train set (80% of data) to learn from, and a test set (20% of data) to evaluate performance. Further, this was split into X_train/X_test (features), and y_train/y_test (target).

Features

For each game (row), we have the following features (columns) for both home and away teams:

  • FG_PCT_home/FG_PCT_away: % of made shots (excluding free throws)
  • FT_PCT_home/FT_PCT_away: % of made free throws
  • FG3_PCT_home/FG3_PCT_away: % of made 3-point shots
  • AST_home/AST_away: # of assists
  • REB_home/REB_away: # of rebounds

I wanted to predict the HOME_TEAM_WINS column (target); assuming that ours is always the home team. A 0 represents a loss, and a 1 represents a win.

Dropped Games

The train set was missing data for 77 games. Estimating each feature of these games would detract from the analysis, so they were dropped. I also dropped one game with an erroneous 36–33 score.

Outliers

The following histograms display, for each feature, the count of values in a bin (value range). Most are centered, meaning there are no values far outside the middle that shift the distribution (potential outliers).

There are some games with very low FT_PCT, shifting the distribution right. These are real; NBA games typically have high FT_PCT with occasional bad performances. They were kept because eliminating them would artificially divorce our data from reality.

Some games had high FG3_PCT or REB, shifting the following distributions left, but they too were kept. Teams usually shoot a low FG3_PCT, but with few 3-point attempts, an accuracy of 1.0 is possible. Likewise, overtime games allow teams to collect supernormal rebounds.

Outcome Splits by Feature

I then separated histograms by outcome, where orange represents values recorded in wins; and blue in losses.

FG_PCT_home shows that games with high efficiency are overwhelmingly won (orange). FG_PCT_away is the reverse; games where the opponents had high efficiency are overwhelmingly lost (blue). Similar splits appear for the following features, but the relationship is less strong.

Model

Preprocessing

Input transformations improve the model’s ability to learn.

The preproc_pipe takes inputs, fills in missing values with the median of that feature from the train set, and then squishes every value into a distribution that has a mean of 0 and a standard deviation of 1 (standard normal distribution). This second step (scaling) prevents the model from artificially favoring features like REB_home/REB_away that are larger.

Sample of processed training features

Model Selection

After trying multiple approaches (LGBMClassifier, LogisticRegression, RandomForestClassifier) and settings (hyperparameters) for each, the best model was LogisticRegression with default settings.

LogisticRegression defines a linear equation where the outputs (y) lie between 0 and 1. Outputs ≥ 0.5, are classified as wins (1) and outputs < 0.5 as losses (0). When training, coefficients are continuously tweaked to minimize the difference between actual and predicted values. Then, it can multiply unseen inputs by trained parameters, and predict.

The final line splits train into 10 chunks. For each of 10 iterations, the model preprocesses the inputs, then trains LogisticRegression on 9/10 chunks before validating performance on the 10th (unseen) chunk. Using different validation chunks each iteration allows for more training on limited data, and increased confidence in results.

Results

Train Set

This model should beat the baseline accuracy (% correct predictions) of 59% (obtained by always predicting the most frequent outcome, wins).

Above, we see average scores on the validation chunk are close to average scores on the training chunks. This means our model has not overfit; it performs similarly on unseen data. In the train set:

  • It correctly classified 83.83% of games (validation_accuracy)
  • Of ‘win’ classifications, 85.61% were truly won (validation_precision)
  • It identified 87.52% of actual wins (validation_recall)

The average difference in accuracy across validation chunks (standard deviation) was small: 0.007263.

Test Set

On the 20% of data the model had never seen, it correctly classified 84.44% of game outcomes, which was close (within standard deviation) to the train set (83.83%). This again builds confidence that our model has not overfit.

Feature Importances

The below visualization shows the impact of each feature (its coefficient) on prediction; something that might interest coaches.

Larger positive coefficients influence the model to predict wins, whereas negative coefficients influence the model to predict losses. As before, FG_PCT had the highest impact on predictions; greater FG_PCT_home influences the model to predict ‘win’, and greater FG_PCT_away influences the model to predict ‘loss’. AST_home/AST_away were the least important features because they had the smallest magnitude of impact.

Caveats

Assumptions

Given the starting assumptions, accurate models might not help decision-making if coaches poorly estimate the performances of lineups.

Features

Other features could significantly impact outcomes; especially defensive statistics like steals and blocks. A richer dataset could mitigate this weakness.

Fundamental Changes

New rules and strategies constantly redefine the NBA, like the recent popularity of the 3-point shot. Changes in gameplay trends can reduce future predictive power, as the data becomes fundamentally different.

Full code available here.

--

--

Sherwyn D’Souza

Software Engineer, CS and ML Student, and open-source enthusiast. Deeply passionate about algorithms, data science, fintech, and economics.