Predict for Score

library(tidyverse)
library(hexbin)
library(jsonlite)
library(httr)
library(patchwork)
library(ggplot2)
library(corrplot)
library(stringr)
library(caret)
library(plotly)
library(ggridges)
library(GGally)
library(modelr)

knitr::opts_chunk$set(
  fig.width = 6,
  fig.asp = .6,
  out.width = "90%"
)

theme_set(theme_minimal() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis"
)

scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d

box_score_all = read_csv("./data2/box_score_all.csv")

regre_df = 
  box_score_all %>%
  select(-c(1:7)) %>%
  select(-ends_with("rank")) %>%
  mutate(wl = recode(wl, "W" = 1, "L" = 0),
         wl = as.factor(wl))

Regression Exploration

In this part, we explore each game in the past 20 years, try to find some important variables that might affect the result of the game. By this process, we also can get some insight on choosing potential parameters for model building. Since we use the smae data set as the data we use to build logistic regression, the Exploratory Analysis part can just click here to see the Exploratory Analysis.

Correlation Matrix

Further, we can draw a correlation map to exam correlation of variables to help us select variables when building model.

regre_df = 
  regre_df %>%
  mutate(
    min = as.numeric(min),
    fgm = as.numeric(fgm),
    fga = as.numeric(fga),
    fg_pct = as.numeric(fg_pct),
    fg3m = as.numeric(fg3m),
    fg3a = as.numeric(fg3a),
    fg3_pct = as.numeric(fg3_pct),
    ftm = as.numeric(ftm),
    fta = as.numeric(fta),
    ft_pct = as.numeric(ft_pct),
    oreb = as.numeric(oreb),
    dreb = as.numeric(dreb),
    reb = as.numeric(reb),
    ast = as.numeric(ast),
    tov = as.numeric(tov),
    stl = as.numeric(stl),
    blk = as.numeric(blk),
    blka = as.numeric(blka),
    pf = as.numeric(pf),
    pfd = as.numeric(pfd),
    pts = as.numeric(pts),
    plus_minus = as.numeric(plus_minus)
  )
corr <- cor(regre_df[-1])
corrplot(corr, method = "square", order = "FPC")

After ruling out variables with strong correlation, we include variables as follow: Dependent variable is the score of each game, denoted by pts (points). Independent variables are selected from both offensive aspect and defensive aspect.

For the offensive level, variables include:

fg3_pct: proportion of three points shooting
fg_pct: proportion of field goals attempted
fg_pct: proportion of free throw
oreb: average offensive rebounds per game
ast: average assists per games

As for the defensive level, variables include:

stl: steals of each game
blk: blocks of each game
dreb: defensive rebounds of each game
tov: turnovers of each game
pf: personal foul of each game

Modelling

Use step function to choose a model by AIC in a Stepwise algorithm.

ln_regre = lm(pts ~fg_pct+fg3_pct+ft_pct+oreb+dreb+ast+stl+blk+tov+pf,data = regre_df)
summary(ln_regre)
linear.step = step(ln_regre,direction="both")

ln_regre %>% broom::tidy() %>% knitr::kable()

term	estimate	std.error	statistic
(Intercept)	-25.2294230	0.5159129	-48.90248
fg_pct	138.5948965	0.8033069	172.53044
fg3_pct	14.5641804	0.3258922	44.69017
ft_pct	24.2905146	0.3318872	73.18906
oreb	0.8025245	0.0092283	86.96336
dreb	0.5638889	0.0064921	86.85749
ast	0.4874214	0.0081360	59.90906
stl	0.5657252	0.0118704	47.65846
blk	-0.1525314	0.0133510	-11.42469
tov	-0.6829004	0.0088988	-76.74036
pf	0.4049123	0.0076670	52.81257

The adjusted R square for the full model is 0.6916, that is to say 69.16% of variances in the response variable can be explained by the predictors.

Model diagnostic

1). to check if the error term is normally distributed with mean 0.

ggplot(data = ln_regre , aes(x = ln_regre$residuals)) + geom_histogram()

Condition 1 is met.

2). to check if the error term is independent of the dependent variable.

ggplot(data = ln_regre, aes(x = ln_regre$fitted.values, y = ln_regre$residuals)) + geom_point() + geom_smooth(method = "lm")

Condition 2 is met as we cannot see an obvious tendency of errors.

Interpretation of model coefficients

Our final model for predicting game result is showing below.

\[Score=-25.229423 + 138.594897(fg_pct)+14.564180(fg3_pct)+24.290515(ft_pct)+0.802525(oreb)+\\ 0.563889(dreb)-0.487421(ast)+0.565725(stl)-0.152531(blk)-0.682900(tov)+0.404912(pf)\]

All variables selected are significant in this linear regression model.

For each additional 0.1 of proportion of field goals attempted, the points will increase 13.9.

For each additional 0.1 of proportion three points shooting, the points will increase 1.45.

For each additional 0.1 of proportion of free throw, the points will increase 2.43.

For each additional 1 of offensive rebounds per game, the points will increase 0.8.

For each additional 1 of defensive rebounds per games, the points will increase 0.56.

For each additional 1 of steals per game, the points will increase 0.57.

For each additional 1 of assists per game, the points will increase 0.45.

For each additional 1 of blocks per game, the points will decrease 0.15.

For each additional 1 of turnovers per game, the points will decrease 0.68.

For each additional 1 of personal foul per game, the points will decrease 0.4.

Model Conclusion

We have built a linear and logistic regression based on the NBA data. The adjusted R square for the linear regression model is 0.6916, which can explain the game score in a large extent and can help us to predict the result of a game more accurately.