In this post, we will examine the results obtained with the function `glance`

of the `broom`

package for linear regression estimators obtained with the lm function. Let’s remember first that linear regression estimates the values of dependent variable `y`

from a set of independent variables \(x_j\):

\[ y_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} + \varepsilon_i \]

When we estimate the regression coeficients from a dataset we obtain:

\[ y_i = b_0 + b_1x_{i1} + \dots + b_px_{ip} + e_i = \hat{y}_i + e_i \]

The `lm`

function obtains ordinary least squares estimators (OLS) for the linear regression model. Under conditions defined in this post, OLS estimates are the ones of maximum likelihood.

## Some models

Let’s fit four regression models using the `mtcars`

dataset:

`mod0`

: the null model, with no predictors for fuel consumption`mpg`

.`mod1`

:`mpg`

as a function of weight`wt`

.`mod2`

:`mpg`

as a function of`wt`

and the square of`wt`

. I have used the`poly`

function to obtain linearly independent polynomial predictors.`mod3`

:`mpg`

as a function of`wt`

and type of transmission`am`

, considering the interaction between the two variables.`mod4`

:`mpg`

as a function of the rest of the variables of the dataset.

```
mod0 <- lm(mpg ~ 1, mtcars)
mod1 <- lm(mpg ~ wt, mtcars)
mod2 <- lm(mpg ~ poly(wt, 2), mtcars)
mod3 <- lm(mpg ~ wt*am, mtcars)
mod4 <- lm(mpg ~ ., mtcars)
```

## Collecting summary statistics

Let’s examine the summary statistics of model fit obtained with the `glance`

function of the `broom`

package applied to a linear regression obtained with `lm`

.

```
fit_models <- bind_rows(lapply(list(mod0, mod1, mod2, mod3, mod4), function(x) glance(x))) %>%
mutate(mod =c("mod0", "mod1", "mod2", "mod3", "mod4")) %>%
relocate(mod, r.squared)
fit_models %>%
kbl(digits = 3) %>%
kable_paper("hover", full_width = F)
```

mod | r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|---|

mod0 | 0.000 | 0.000 | 6.027 | NA | NA | NA | -102.378 | 208.756 | 211.687 | 1126.047 | 31 | 32 |

mod1 | 0.753 | 0.745 | 3.046 | 91.375 | 0 | 1 | -80.015 | 166.029 | 170.427 | 278.322 | 30 | 32 |

mod2 | 0.819 | 0.807 | 2.651 | 65.638 | 0 | 2 | -75.024 | 158.048 | 163.911 | 203.745 | 29 | 32 |

mod3 | 0.833 | 0.815 | 2.591 | 46.567 | 0 | 3 | -73.738 | 157.476 | 164.805 | 188.008 | 28 | 32 |

mod4 | 0.869 | 0.807 | 2.650 | 13.932 | 0 | 10 | -69.855 | 163.710 | 181.299 | 147.494 | 21 | 32 |

We can group those statistics into three categories:

`statistic`

,`p.value`

,`df`

and`df.residual`

provide information about the**test of overall significance**of the model.`r.squared`

and`adj.r.squared`

are measures of the**coefficient of determination**.`logLik`

,`AIC`

,`BIC`

and`deviance`

are fit measures based on maximum likelihood estimation.

`sigma`

and `nobs`

retrieve the an estimation of variance of the residuals and the number of observations, respectively.

## Test of overall significance

This test consists in evaluating the null hypothesis that the regression model does not explain the variability of the dependent variable better than its mean:

\[H_0 : \beta_1 = \beta_2 = \dots = \beta_p = 0\]

To test that hypotesis we compute:

- The sum of squares of errors or residuals \(SSE\):

\[SSE = \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2\]

- The sum of squares of the model, or the differences between mean and fitted values \(SSM\):

\[SSM = \sum_{i=1}^n \left( \hat{y}_i - \bar{y_i} \right)^2 \]

- The total sum of squares \(SST\) or differences respect to the mean:

\[SST = \sum_{i=1}^n \left( y_i - \bar{y}_i \right)^2\]

These three magnitudes are related by the equality:

\[ SST = SSE + SSM \]

We can calculate the mean sum of squares dividing each by its degrees of freedom:

\[\begin{align} MSE &= \frac{SSE}{n-p-1} & MSM &= \frac{SSM}{p} & MST &= \frac{SSE}{n-1} \end{align}\]

If the null hypothesis is true, the quotient between \(MSM\) and \(MSE\) will follow a law:

\[ \frac{MSM}{MSE} \sim F_{p,n-p-1} \]

If the quotient is large enough, it can be interpreted as an abnormal observation of the underlying distribution (so the null hypothesis will be true). The probability of this is the `p.value`

listed on the table. If this *p*-value is smaller than 0.05, we usually prefer the alternative explanation that the null hypothesis is false, and therefore the model is significant.

This analysis is presented at the bottom of the `summary`

of the `lm`

function:

`summary(mod1)`

```
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
```

We can use a similar test to compare pairs of **nested models**, where the model with more variables contains all the variables of the other model. This is the case of `mod1`

and `mod2`

:

`anova(mod1, mod2)`

```
## Analysis of Variance Table
##
## Model 1: mpg ~ wt
## Model 2: mpg ~ poly(wt, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 278.32
## 2 29 203.75 1 74.576 10.615 0.00286 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

As the *p*-value of the test is smaller than 0.05, we conclude that `mod2`

adds explanatory power to `mod1`

.

## Coefficients of determination

Another parameter to evaluate the fit of a linear regression is the **coefficient of determination**. It is calculated as:

\[R^2 = 1- \frac{\sum \left( y_i - \hat{y}_i \right)^2 }{\sum \left( y_i - \bar{y}_i \right)^2} = 1- \frac{SEE}{SST}\]

If the model fits well to the data, the sum of squared residuals will be small and \(R^2\) will be close to one.

A problem of \(R^2\) is that it does not decrease when we add variables to the model, because we always can set all regression coefficients of the new variables equal to zero and obtain the same sum of squared residuals of the former model. To account for this, we define the **adjusted coefficient of determination** as:

\[ R^2_{adj.} = 1- \frac{\sum \left( y_i - \hat{y}_i \right)^2 / \left( n-p-1 \right) }{\sum \left( y_i - \bar{y}_i \right)^2 / \left( n-1 \right)} = 1- \frac{MSE}{MST} \]

\(R^2_{adj.}\) and \(R^2\) are related by the expression:

\[ R^2_{adj.} = 1 - \left( 1- R^2 \right) \frac{n-1}{n-p-1} \]

Looking at the table, we observe that `r.squared`

increases with number of variables, but not `adj.r.squared`

. The best value of adjusted coefficient of determination goes to `mod3`

.

## Log-likelihood estimation metrics

In this post I discussed the meaning of the likelihood function, and how can we obtain estimates of a model maximizing the log likelihood function. This function for linear regression is equal to:

\[ \mathcal{l} \left[ \left( \sigma, \mu \right), \mathbf{e} \right] = - \frac{n}{2}ln\left( 2\pi \right) - \frac{n}{2}ln \left( \sigma^2 \right) - \frac{1}{2\sigma^2} \sum_{i=1}^i e_i^2 \]

This value is (approximately) returned in the `logLik`

column of the table. As we maximize likelihood the larger (less negative) its value the better the fit. Similarly to \(R^2\), ´logLik` tends to increase as we add variables.

We can obtain a more parsimonious indicator of fit based on likelihood with the **Akaike information criterion (AIC)**. This parameter is equal to:

\[AIC = 2k - 2\mathcal{l}\]

where \(k\) is the number of parameters of the model. In this case \(k=p+2\) as we are estimating the \(p+1\) regression coefficients plus residual variance \(\sigma^2\). We prefer models with smaller values of *AIC*.

Another similar metric is the **Bayesian information criterion (BIC)**:

\[BIC = k ln\left(n\right) - \mathcal{l}\]

Like *AIC*, we will prefer models with lower values of *BIC*.

*AIC* and *BIC* are suited to compare models. Both favour **parsimonious** models, that is, models with high explanatory power and few variables. `mod2`

and `mod3`

are the best models according to these criteria.

The **deviance** has little sense in linear models estimated through OLS. The value presented by `glance`

is the sum of squared residuals \(SSE\).

## Which is the best model?

The summary statistics presented by the `glance`

function for models estimated with `lm`

measure goodness of fit and parsimony.

- A model with
**goodness of fit**has a high power to explain the variablity and to predict the dependent variable. All the summary statistics presented here measure goodness of fit. **Parsimony**combines simplicity and goodness of fit. Parsimonious models are simple models with great explanatory or predictive power. The adjusted coefficient of deteremination \(R^2_{adj.}\) and metrics*AIC*and*BIC*allow to detect parsimonious models.

For the models presented here, we can say that `mod0`

and `mod1`

show low goodness of fit, while the other three models have better values of `r.squared`

, `adj.r.squared`

and `logLik`

. *AIC* and *BIC* show that `mod2`

and `mod3`

have the best balance between simplicity and goodness of fit, so we can consider them as the best models. `mod3`

has the best values of \(R^2_{adj.}\) and *AIC*, and `mod2`

the best value of *BIC*.