library(tidyverse)
library(hexbin)
library(jsonlite)
library(httr)
library(patchwork)
library(ggplot2)
library(corrplot)
library(stringr)
library(caret)
library(plotly)
library(ggridges)
library(GGally)
library(modelr)
knitr::opts_chunk$set(
fig.width = 6,
fig.asp = .6,
out.width = "90%"
)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
box_score_all = read_csv("./data2/box_score_all.csv")
regre_df =
box_score_all %>%
select(-c(1:7)) %>%
select(-ends_with("rank")) %>%
mutate(wl = recode(wl, "W" = 1, "L" = 0),
wl = as.factor(wl))
In this part, we explore each game in the past 20 years, try to find some important variables that might affect the result of the game. By this process, we also can get some insight on choosing potential parameters for model building. Since we use the smae data set as the data we use to build logistic regression, the Exploratory Analysis part can just click here to see the Exploratory Analysis.
Further, we can draw a correlation map to exam correlation of variables to help us select variables when building model.
regre_df =
regre_df %>%
mutate(
min = as.numeric(min),
fgm = as.numeric(fgm),
fga = as.numeric(fga),
fg_pct = as.numeric(fg_pct),
fg3m = as.numeric(fg3m),
fg3a = as.numeric(fg3a),
fg3_pct = as.numeric(fg3_pct),
ftm = as.numeric(ftm),
fta = as.numeric(fta),
ft_pct = as.numeric(ft_pct),
oreb = as.numeric(oreb),
dreb = as.numeric(dreb),
reb = as.numeric(reb),
ast = as.numeric(ast),
tov = as.numeric(tov),
stl = as.numeric(stl),
blk = as.numeric(blk),
blka = as.numeric(blka),
pf = as.numeric(pf),
pfd = as.numeric(pfd),
pts = as.numeric(pts),
plus_minus = as.numeric(plus_minus)
)
corr <- cor(regre_df[-1])
corrplot(corr, method = "square", order = "FPC")
After ruling out variables with strong correlation, we include variables as follow: Dependent variable is the score of each game, denoted by pts (points). Independent variables are selected from both offensive aspect and defensive aspect.
For the offensive level, variables include:
As for the defensive level, variables include:
Use step function to choose a model by AIC in a Stepwise algorithm.
ln_regre = lm(pts ~fg_pct+fg3_pct+ft_pct+oreb+dreb+ast+stl+blk+tov+pf,data = regre_df)
summary(ln_regre)
linear.step = step(ln_regre,direction="both")
ln_regre %>% broom::tidy() %>% knitr::kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | -25.2294230 | 0.5159129 | -48.90248 | 0 |
fg_pct | 138.5948965 | 0.8033069 | 172.53044 | 0 |
fg3_pct | 14.5641804 | 0.3258922 | 44.69017 | 0 |
ft_pct | 24.2905146 | 0.3318872 | 73.18906 | 0 |
oreb | 0.8025245 | 0.0092283 | 86.96336 | 0 |
dreb | 0.5638889 | 0.0064921 | 86.85749 | 0 |
ast | 0.4874214 | 0.0081360 | 59.90906 | 0 |
stl | 0.5657252 | 0.0118704 | 47.65846 | 0 |
blk | -0.1525314 | 0.0133510 | -11.42469 | 0 |
tov | -0.6829004 | 0.0088988 | -76.74036 | 0 |
pf | 0.4049123 | 0.0076670 | 52.81257 | 0 |
The adjusted R square for the full model is 0.6916, that is to say 69.16% of variances in the response variable can be explained by the predictors.
1). to check if the error term is normally distributed with mean 0.
ggplot(data = ln_regre , aes(x = ln_regre$residuals)) + geom_histogram()
Condition 1 is met.
2). to check if the error term is independent of the dependent variable.
ggplot(data = ln_regre, aes(x = ln_regre$fitted.values, y = ln_regre$residuals)) + geom_point() + geom_smooth(method = "lm")
Condition 2 is met as we cannot see an obvious tendency of errors.
Our final model for predicting game result is showing below.
\[Score=-25.229423 + 138.594897(fg_pct)+14.564180(fg3_pct)+24.290515(ft_pct)+0.802525(oreb)+\\ 0.563889(dreb)-0.487421(ast)+0.565725(stl)-0.152531(blk)-0.682900(tov)+0.404912(pf)\]
All variables selected are significant in this linear regression model.
For each additional 0.1 of proportion of field goals attempted, the points will increase 13.9.
For each additional 0.1 of proportion three points shooting, the points will increase 1.45.
For each additional 0.1 of proportion of free throw, the points will increase 2.43.
For each additional 1 of offensive rebounds per game, the points will increase 0.8.
For each additional 1 of defensive rebounds per games, the points will increase 0.56.
For each additional 1 of steals per game, the points will increase 0.57.
For each additional 1 of assists per game, the points will increase 0.45.
For each additional 1 of blocks per game, the points will decrease 0.15.
For each additional 1 of turnovers per game, the points will decrease 0.68.
For each additional 1 of personal foul per game, the points will decrease 0.4.
We have built a linear and logistic regression based on the NBA data. The adjusted R square for the linear regression model is 0.6916, which can explain the game score in a large extent and can help us to predict the result of a game more accurately.