**Linear regression** si about finding a linear relationship between:

- a
**dependent variable**, also called endogenous, response or criterion \(y\) - and a set \(p\) of
**independent variables**, sometimes called exogenous or predictor \(x_j\).

The posited relationship can be represented as:

\[ y_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} + \varepsilon_i \]

Linear regression is one of the more powerful tools of statistical data analysis. In its many variants, we use linear regression:

- to
**explain**relationships between dependent and independent variables, trying to find if we can assert that regression coefficients \(\beta_j\) are significantly different from zero. - to
**predict**the dependent variable, in the context of supervised learning.

The above equations are defined for the whole population. We usually have a sample, from which we can make statistical inferences about the regression coefficients. We will never know \(\beta_j\), but its estimator \(b_j\):

\[ y_i = b_0 + b_1x_{i1} + \dots + b_px_{ip} + e_i \]

From these estimators, we can find predictors \(\hat{y}_i\) for each observation, as an alternative of estimating all observations through the mean \(\bar{y}\):

\[ \hat{y}_i = b_0 + b_1x_{i1} + \dots + b_px_{ip} + e_i \]

\[ y_i = \hat{y}_i + e_i \]

The most common way of obtaining the estimators \(b_j\) is through **ordinary least squares (OLS)**. These are the estimators that minimize:

\[ \sum e_i^2 \]

## Linear regression on mtcars

R has a built-in function `lm`

that calculates the OLS estimators for a set of dependent and independent variables. We don’t need any package for `lm`

, but we will use some packages:

- the
`tidyverse`

for data handling and plotting, `broom`

to tidy the output of lm- and
`ggfortify`

to present some plots for the linear model.

```
library(tidyverse)
library(broom)
library(ggfortify)
```

I will examine the `mtcars`

dataset, to find an explanatory model of fuel consumption `mpg`

.

`mtcars %>% glimpse()`

```
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
```

I will transform variables from `vs`

to `carb`

to factors, as they are categoric rather than numeric:

```
mtcars <- mtcars %>%
mutate_at(vars(vs:carb), as.factor)
```

## A regression model

Let’s examine a linear regression model that includes all the possible independent variables. The inputs of `lm`

are:

- a
**formula**`mpg ~ .`

meaning that`mpg`

is the dependent variable and the rest of variables of the dataset are the independent variables. - a
**data frame**`mtcars`

with data to build the model. Formula variables must match column names of the data frame.

Unlike other statistical software like Stata or SPSS, we store the result of estimating the model into a variable `mod0`

.

`mod0 <- lm(mpg ~ ., mtcars)`

The standard way of presenting the results of a linear model is through `summary`

:

`summary(mod0)`

```
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6533 -1.3325 -0.5166 0.7643 4.7284
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.31994 23.88164 1.060 0.3048
## cyl -1.02343 1.48131 -0.691 0.4995
## disp 0.04377 0.03058 1.431 0.1716
## hp -0.04881 0.03189 -1.531 0.1454
## drat 1.82084 2.38101 0.765 0.4556
## wt -4.63540 2.52737 -1.834 0.0853 .
## qsec 0.26967 0.92631 0.291 0.7747
## vs1 1.04908 2.70495 0.388 0.7032
## am1 0.96265 3.19138 0.302 0.7668
## gear4 1.75360 3.72534 0.471 0.6442
## gear5 1.87899 3.65935 0.513 0.6146
## carb2 -0.93427 2.30934 -0.405 0.6912
## carb3 3.42169 4.25513 0.804 0.4331
## carb4 -0.99364 3.84683 -0.258 0.7995
## carb6 1.94389 5.76983 0.337 0.7406
## carb8 4.36998 7.75434 0.564 0.5809
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.823 on 16 degrees of freedom
## Multiple R-squared: 0.8867, Adjusted R-squared: 0.7806
## F-statistic: 8.352 on 15 and 16 DF, p-value: 6.044e-05
```

The summary includes:

- The
**function call**. - A summary of the
**residuals**\(e_i\). - The
**estimators of coefficients**, and the*p*-value of the null hypothesis that each regression coefficient is zero. In this case all*p*-values are above 0.05, therefore we don’t know if any of the coefficients is different from zero. - Some parameters of
**overall significance**of the model: the coefficient of determination (raw and adjusted), and a F-test indicating if the regression model explains the dependent variable better than the mean. Values of the adjusted coefficient of determination close to one and small values of*p*-value suggest good fit.

In this case we observe that the model is significant (if we take the usual convention that the *p*-value should be below 0.05), but that none of the coefficients is. This may arise because lack of statistical power (`mtcars`

is a small dataset) or because of correlation among dependent variables.

The `broom`

package offers an alternative way of presenting the output of statistical analysis. The package works through three functions:

`tidy`

presents parameter estimators,`glance`

overall model significance estimators,- and
`augment`

adds information to the dataset.

The output of these functions is a tibble, with a content depending of the examined model. Let’s see the outcome of `tidy`

and `glance`

for this model:

`tidy(mod0)`

```
## # A tibble: 16 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 25.3 23.9 1.06 0.305
## 2 cyl -1.02 1.48 -0.691 0.500
## 3 disp 0.0438 0.0306 1.43 0.172
## 4 hp -0.0488 0.0319 -1.53 0.145
## 5 drat 1.82 2.38 0.765 0.456
## 6 wt -4.64 2.53 -1.83 0.0853
## 7 qsec 0.270 0.926 0.291 0.775
## 8 vs1 1.05 2.70 0.388 0.703
## 9 am1 0.963 3.19 0.302 0.767
## 10 gear4 1.75 3.73 0.471 0.644
## 11 gear5 1.88 3.66 0.513 0.615
## 12 carb2 -0.934 2.31 -0.405 0.691
## 13 carb3 3.42 4.26 0.804 0.433
## 14 carb4 -0.994 3.85 -0.258 0.799
## 15 carb6 1.94 5.77 0.337 0.741
## 16 carb8 4.37 7.75 0.564 0.581
```

`glance(mod0) %>% glimpse()`

```
## Rows: 1
## Columns: 12
## $ r.squared <dbl> 0.886746
## $ adj.r.squared <dbl> 0.7805703
## $ sigma <dbl> 2.823223
## $ statistic <dbl> 8.351689
## $ p.value <dbl> 6.044113e-05
## $ df <dbl> 15
## $ logLik <dbl> -67.52781
## $ AIC <dbl> 169.0556
## $ BIC <dbl> 193.9731
## $ deviance <dbl> 127.5294
## $ df.residual <int> 16
## $ nobs <int> 32
```

## A simpler regression model

Let’s try a explanatory model of fuel consumption `mpg`

considering car weight `wt`

only:

`mod1 <- lm(mpg ~ wt, mtcars)`

We see that the regression coefficients of the intercept and `wt`

are significant in this model:

`tidy(mod1)`

```
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 37.3 1.88 19.9 8.24e-19
## 2 wt -5.34 0.559 -9.56 1.29e-10
```

and that the overall significance of the model is higher than `mod0`

:

`glance(mod1) %>% select(r.squared, adj.r.squared, p.value)`

```
## # A tibble: 1 x 3
## r.squared adj.r.squared p.value
## <dbl> <dbl> <dbl>
## 1 0.753 0.745 1.29e-10
```

## Graphical diagnostics

This new, simpler model looks convincing: we expect higher fuel consumption for weighter vehicles. But to confirm that the model is adequate, we need to do **residual diagnostics** to confirm that:

- Residuals have mean zero and constant variance across predicted values \(\hat{y}_i\). We check that with a residuals vs fitted values plot.
- Residuals are distributed normally. We check that with a QQ-plot.

We can obtain these plots in R base doing `plot(mod1, which = 1)`

and `plot(mod2, which = 2)`

. But we can obtain prettier versions of them with the `autoplot`

function of `ggfortify`

:

`autoplot(mod1, which = 2:1, ncol = 2, label.size = 3)`

In the first plot we observe that residuals are reasonably normal. From the second plot we learn that the mean of residuals is larger for extreme values of \(\hat{y}_i\) than for central values. This suggests that the relationship between weight and fuel consumption is not linear. Let’s examine that with an alternative model.

## A quadratic model

We can examine nonlinear relationships with linear regressions adding powered values of independent variables. As these powers tend to be correlated with the original value, it is better to use **orthogonal polynomials** (orthogonal means non-correlated) using the `poly`

function.

`mod2 <- lm(mpg ~ poly(wt, 2), mtcars)`

Let’s examine the regression coefficients:

`tidy(mod2)`

```
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 20.1 0.469 42.9 8.75e-28
## 2 poly(wt, 2)1 -29.1 2.65 -11.0 7.52e-12
## 3 poly(wt, 2)2 8.64 2.65 3.26 2.86e- 3
```

We observe that the regression coefficients of the first- and second-order terms are significantly different from zero. Let’s examine the residual diagnostics again:

`autoplot(mod2, which = 1:2, ncol = 2, label.size = 3)`

Now we see that now the residuals behave more smoothly, although we observe some outliers in the first plot. We take advantage that `glance`

returns us a tibble to together the overall signficance metrics of the three models:

```
bind_rows(glance(mod0) %>% mutate(model = "all vars"),
glance(mod1) %>% mutate(model = "wt"),
glance(mod2) %>% mutate(model = "wt squared")) %>%
select(model, r.squared, adj.r.squared, p.value)
```

```
## # A tibble: 3 x 4
## model r.squared adj.r.squared p.value
## <chr> <dbl> <dbl> <dbl>
## 1 all vars 0.887 0.781 6.04e- 5
## 2 wt 0.753 0.745 1.29e-10
## 3 wt squared 0.819 0.807 1.71e-11
```

The adjusted coefficient of determination increases as we build more accurate linear regression models. So we can conclude that `mod2`

is the best explanatory model of `mpg`

among the ones we have tested.

*Built with R 4.0.3, tidyverse 1.3.0, broom 0.7.5 and ggoftify 0.4.11*