In this post, we will examine the results obtained with the function glance
of the broom
package for linear regression estimators obtained with the lm function. Let’s remember first that linear regression estimates the values of dependent variable y
from a set of independent variables \(x_j\):
\[ y_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} + \varepsilon_i \]
When we estimate the regression coeficients from a dataset we obtain:
\[ y_i = b_0 + b_1x_{i1} + \dots + b_px_{ip} + e_i = \hat{y}_i + e_i \]
The lm
function obtains ordinary least squares estimators (OLS) for the linear regression model. Under conditions defined in this post, OLS estimates are the ones of maximum likelihood.
Some models
Let’s fit four regression models using the mtcars
dataset:
mod0
: the null model, with no predictors for fuel consumptionmpg
.mod1
:mpg
as a function of weightwt
.mod2
:mpg
as a function ofwt
and the square ofwt
. I have used thepoly
function to obtain linearly independent polynomial predictors.mod3
:mpg
as a function ofwt
and type of transmissionam
, considering the interaction between the two variables.mod4
:mpg
as a function of the rest of the variables of the dataset.
mod0 <- lm(mpg ~ 1, mtcars)
mod1 <- lm(mpg ~ wt, mtcars)
mod2 <- lm(mpg ~ poly(wt, 2), mtcars)
mod3 <- lm(mpg ~ wt*am, mtcars)
mod4 <- lm(mpg ~ ., mtcars)
Collecting summary statistics
Let’s examine the summary statistics of model fit obtained with the glance
function of the broom
package applied to a linear regression obtained with lm
.
fit_models <- bind_rows(lapply(list(mod0, mod1, mod2, mod3, mod4), function(x) glance(x))) %>%
mutate(mod =c("mod0", "mod1", "mod2", "mod3", "mod4")) %>%
relocate(mod, r.squared)
fit_models %>%
kbl(digits = 3) %>%
kable_paper("hover", full_width = F)
mod | r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|---|
mod0 | 0.000 | 0.000 | 6.027 | NA | NA | NA | -102.378 | 208.756 | 211.687 | 1126.047 | 31 | 32 |
mod1 | 0.753 | 0.745 | 3.046 | 91.375 | 0 | 1 | -80.015 | 166.029 | 170.427 | 278.322 | 30 | 32 |
mod2 | 0.819 | 0.807 | 2.651 | 65.638 | 0 | 2 | -75.024 | 158.048 | 163.911 | 203.745 | 29 | 32 |
mod3 | 0.833 | 0.815 | 2.591 | 46.567 | 0 | 3 | -73.738 | 157.476 | 164.805 | 188.008 | 28 | 32 |
mod4 | 0.869 | 0.807 | 2.650 | 13.932 | 0 | 10 | -69.855 | 163.710 | 181.299 | 147.494 | 21 | 32 |
We can group those statistics into three categories:
statistic
,p.value
,df
anddf.residual
provide information about the test of overall significance of the model.r.squared
andadj.r.squared
are measures of the coefficient of determination.logLik
,AIC
,BIC
anddeviance
are fit measures based on maximum likelihood estimation.
sigma
and nobs
retrieve the an estimation of variance of the residuals and the number of observations, respectively.
Test of overall significance
This test consists in evaluating the null hypothesis that the regression model does not explain the variability of the dependent variable better than its mean:
\[H_0 : \beta_1 = \beta_2 = \dots = \beta_p = 0\]
To test that hypotesis we compute:
- The sum of squares of errors or residuals \(SSE\):
\[SSE = \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2\]
- The sum of squares of the model, or the differences between mean and fitted values \(SSM\):
\[SSM = \sum_{i=1}^n \left( \hat{y}_i - \bar{y_i} \right)^2 \]
- The total sum of squares \(SST\) or differences respect to the mean:
\[SST = \sum_{i=1}^n \left( y_i - \bar{y}_i \right)^2\]
These three magnitudes are related by the equality:
\[ SST = SSE + SSM \]
We can calculate the mean sum of squares dividing each by its degrees of freedom:
\[\begin{align} MSE &= \frac{SSE}{n-p-1} & MSM &= \frac{SSM}{p} & MST &= \frac{SSE}{n-1} \end{align}\]
If the null hypothesis is true, the quotient between \(MSM\) and \(MSE\) will follow a law:
\[ \frac{MSM}{MSE} \sim F_{p,n-p-1} \]
If the quotient is large enough, it can be interpreted as an abnormal observation of the underlying distribution (so the null hypothesis will be true). The probability of this is the p.value
listed on the table. If this p-value is smaller than 0.05, we usually prefer the alternative explanation that the null hypothesis is false, and therefore the model is significant.
This analysis is presented at the bottom of the summary
of the lm
function:
summary(mod1)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
We can use a similar test to compare pairs of nested models, where the model with more variables contains all the variables of the other model. This is the case of mod1
and mod2
:
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ poly(wt, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 203.75 1 74.576 10.615 0.00286 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the p-value of the test is smaller than 0.05, we conclude that mod2
adds explanatory power to mod1
.
Coefficients of determination
Another parameter to evaluate the fit of a linear regression is the coefficient of determination. It is calculated as:
\[R^2 = 1- \frac{\sum \left( y_i - \hat{y}_i \right)^2 }{\sum \left( y_i - \bar{y}_i \right)^2} = 1- \frac{SEE}{SST}\]
If the model fits well to the data, the sum of squared residuals will be small and \(R^2\) will be close to one.
A problem of \(R^2\) is that it does not decrease when we add variables to the model, because we always can set all regression coefficients of the new variables equal to zero and obtain the same sum of squared residuals of the former model. To account for this, we define the adjusted coefficient of determination as:
\[ R^2_{adj.} = 1- \frac{\sum \left( y_i - \hat{y}_i \right)^2 / \left( n-p-1 \right) }{\sum \left( y_i - \bar{y}_i \right)^2 / \left( n-1 \right)} = 1- \frac{MSE}{MST} \]
\(R^2_{adj.}\) and \(R^2\) are related by the expression:
\[ R^2_{adj.} = 1 - \left( 1- R^2 \right) \frac{n-1}{n-p-1} \]
Looking at the table, we observe that r.squared
increases with number of variables, but not adj.r.squared
. The best value of adjusted coefficient of determination goes to mod3
.
Log-likelihood estimation metrics
In this post I discussed the meaning of the likelihood function, and how can we obtain estimates of a model maximizing the log likelihood function. This function for linear regression is equal to:
\[ \mathcal{l} \left[ \left( \sigma, \mu \right), \mathbf{e} \right] = - \frac{n}{2}ln\left( 2\pi \right) - \frac{n}{2}ln \left( \sigma^2 \right) - \frac{1}{2\sigma^2} \sum_{i=1}^i e_i^2 \]
This value is (approximately) returned in the logLik
column of the table. As we maximize likelihood the larger (less negative) its value the better the fit. Similarly to \(R^2\), ´logLik` tends to increase as we add variables.
We can obtain a more parsimonious indicator of fit based on likelihood with the Akaike information criterion (AIC). This parameter is equal to:
\[AIC = 2k - 2\mathcal{l}\]
where \(k\) is the number of parameters of the model. In this case \(k=p+2\) as we are estimating the \(p+1\) regression coefficients plus residual variance \(\sigma^2\). We prefer models with smaller values of AIC.
Another similar metric is the Bayesian information criterion (BIC):
\[BIC = k ln\left(n\right) - \mathcal{l}\]
Like AIC, we will prefer models with lower values of BIC.
AIC and BIC are suited to compare models. Both favour parsimonious models, that is, models with high explanatory power and few variables. mod2
and mod3
are the best models according to these criteria.
The deviance has little sense in linear models estimated through OLS. The value presented by glance
is the sum of squared residuals \(SSE\).
Which is the best model?
The summary statistics presented by the glance
function for models estimated with lm
measure goodness of fit and parsimony.
- A model with goodness of fit has a high power to explain the variablity and to predict the dependent variable. All the summary statistics presented here measure goodness of fit.
- Parsimony combines simplicity and goodness of fit. Parsimonious models are simple models with great explanatory or predictive power. The adjusted coefficient of deteremination \(R^2_{adj.}\) and metrics AIC and BIC allow to detect parsimonious models.
For the models presented here, we can say that mod0
and mod1
show low goodness of fit, while the other three models have better values of r.squared
, adj.r.squared
and logLik
. AIC and BIC show that mod2
and mod3
have the best balance between simplicity and goodness of fit, so we can consider them as the best models. mod3
has the best values of \(R^2_{adj.}\) and AIC, and mod2
the best value of BIC.