In linear regression, we estimate the value of a dependent variable as a linear function of a set of dependent variables. The values of the dependent variable are unrestricted, meaning that it can take any real value. If some hypothesis regarding normality and variance stability of residuals are met, the ordinary least squares are the maximum likelihood estimates for linear regression.
There are many contexts, though, where the dependent variable can take only values equal to zero and one. In that context, we may want to estimate the probability that the dependent variable be equal to one. Unlike linear regression, the values of the dependent variable are restricted to the \(\left[0,1\right]\) interval. For these models we cannot use ordinary least squares. Instead, we can obtain maximum likelihood estimates with other models.
A usual approach to this problem is the logistic regression, sometimes called logit model.
\[ P( y = 1 \vert \mathbf{x}) = \mathbf{\pi}\left( \mathbf{z} \right) = \frac{1}{1 + \text{exp}\left(-\mathbf{z}\right)} \]
\[ z_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} = \mathcal{\theta}' \mathcal{x}_i \]
In the logit model, we use the logistic function to transform an unrestricted real variable \(z_i\) into a probability \(\pi_i\). Here is a representation of the logistic function \(y = 1/\left(1 + e^{\beta s}\right)\)
The variable \(z\) is the log odds ratio of the dependent variable being equal to one:
\[ ln\left( \frac{\pi_i}{1-\pi_i} \right) = \mathcal{\theta}' \mathcal{x}_i \]
The estimation of model coefficients is completely different from ordinary least squares. The likelihood function of the parameters given the data is a binomial distribution:
\[ \mathcal{L}\left( \mathcal{\theta} \right) = \prod_{i=1}^n \pi_i^{y_i} \left(1-\pi_i\right)^{1-y_i} \]
For simplicity, we optimize the log likelihood function to obtain the estimators:
\[ \mathcal{l}\left( \mathcal{\theta} \right) = \sum_{i=1}^n y_i\text{log} \pi_i + \sum_{i=1}^n\left(1-y_i\right) \text{log}\left(1-\pi_i\right) \]
As \(\pi_i\) is a function of dependent variables, we can obtain estimates \(b_0, \dots, b_p\) of \(\beta_0, \dots, \beta_p\) maximizing the log likelihood function \(\mathcal{l}\). Unlike ordinary least squares, there is no analytical solution for the maximization of log likelihood, so model estimates are obtained numerically.
An example of logistic regression
Let’s use the iris
dataset to build a logistic regression model with a dependent variable is_virginica
equal to one if the observation belongs to this species and zero otherwise. I will remove the original Species
variable and Petal.Width
. The later is highly correlated with the rest of predictors.
iris_logistic <- iris %>%
mutate(is_virginica = ifelse(Species == "virginica", 1, 0)) %>%
select(-c(Petal.Width, Species))
We can fit a logit model in R using the base function glm
for generalized linear models with family = "binomial"
. We can use summary
to obtain a summary of model output.
mod1 <- glm(is_virginica ~ ., family = "binomial", data = iris_logistic)
summary(mod1)
##
## Call:
## glm(formula = is_virginica ~ ., family = "binomial", data = iris_logistic)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.83416 -0.01085 0.00000 0.00674 1.52165
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.2154 14.0532 -2.719 0.006541 **
## Sepal.Length -3.8524 1.7091 -2.254 0.024190 *
## Sepal.Width -0.6385 2.2906 -0.279 0.780424
## Petal.Length 13.1475 3.9057 3.366 0.000762 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 190.954 on 149 degrees of freedom
## Residual deviance: 23.772 on 146 degrees of freedom
## AIC: 31.772
##
## Number of Fisher Scoring iterations: 10
Model coefficients
Model coefficients significance allow using the logistic regression model as a explanatory tool, detecting the antecedents of the probability of each observation being virginica. We can obtain information on model coefficients in tabular form using the tidy
function of the broom
package.
tidy(mod1)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -38.2 14.1 -2.72 0.00654
## 2 Sepal.Length -3.85 1.71 -2.25 0.0242
## 3 Sepal.Width -0.639 2.29 -0.279 0.780
## 4 Petal.Length 13.1 3.91 3.37 0.000762
The p.value
column presents the probability of observing a regression coefficient equal or larger than the estimate if the null hypothesis that the regression coefficient is zero holds. We usually discard that null hypothesis for p-values equal or smaller than 0.05. Observing the sign of significant coefficients we can conclude that virginica flowers will tend to have small values of Sepal.Length
and high values of Petal.Length
.
We can examine the predictions for each observation using the augment
function of broom
. The values of the log odds \(z_i\) are found on the .fitted
column, so we need to transform it to obtain the predicted probability prod
.
mod1_pred <- augment(mod1) %>%
mutate(prob = 1/(1 + exp(-.fitted)),
is_virginica = as.factor(is_virginica))
Let’s examine the relationship between Petal.Length
and prob
, together with the real classification of each observation:
mod1_pred %>%
ggplot(aes(Petal.Length, prob, color = is_virginica)) +
geom_point() +
theme(panel.background = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.key = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.background = element_rect(fill = "#FFFFFF", color = "#FFFFFF"), legend.position = "bottom") +
scale_color_manual(name = "class", label = c("not virginica", "virginica"), values = c("#FF6666", "#00CC66"))
We observe that we can separate many of the observations belonging to each class considering the values of Petal.Length
. Let’s see how good is Sepal.Length
to explain the probability of being virginica:
mod1_pred %>%
ggplot(aes(Sepal.Length, prob, color = is_virginica)) +
geom_point() +
theme(panel.background = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.key = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.background = element_rect(fill = "#FFFFFF", color = "#FFFFFF"), legend.position = "bottom") +
scale_color_manual(name = "class", label = c("not virginica", "virginica"), values = c("#FF6666", "#00CC66"))
We observe that the explanatory power of Petal.Length
is smaller. For values of Petal.Length
between 5.5. and 7 we can find observations belonging to both classes. This was to be expected, as its p-value was much higher than Petal.Lenght, although still below 0.05.
Model fit
To assess overall significance of the model, we have only parameters related with log likelihood maximization. The F-test of overall significance and the coefficient of determination of linear regression are not available for logistic regression.
The most useful parameter to examine model fit of logistic regression is deviance. Deviance \(D\) aaccounts for how much does the model deviates from a model with perfect fit. The smaller the deviance, the better the model. Deviance is calculated as:
\[D = -2\mathcal{l}\left( \mathcal{\theta} \right)\]
The McFadden pseudo-R squared allows obtaining a fit parameter similar to linear regression’s R squared comparing the deviance of the model with the deviance of the null model \(D_{null}\). The null model does not have predictors, so it is supposed to have the worse value of fit. The formula of McFadden pseudo-R squared is:
\[R^2_{pseudo} = 1 - \frac{D}{D_{null}} = 1 - \frac{\mathcal{l}}{\mathcal{l}_{null}}\]
Note that pseudo-R squared makes sense only if log likelihood is negative.
We can obtain fit indices for our model using the glance
function of broom
. We can easily add the pseudo-R squared usign deviance values.
glance(mod1) %>%
mutate(pseudo.rsq = 1 - deviance/null.deviance) %>%
glimpse()
## Rows: 1
## Columns: 9
## $ null.deviance <dbl> 190.9543
## $ df.null <int> 149
## $ logLik <dbl> -11.8859
## $ AIC <dbl> 31.7718
## $ BIC <dbl> 43.81434
## $ deviance <dbl> 23.7718
## $ df.residual <int> 146
## $ nobs <int> 150
## $ pseudo.rsq <dbl> 0.8755105
Predictions
We can use logistic regression as a predictive model for classification. We use a threshold value of probability to assign each observation to each class. Here we will assign to class virginica observations with predicted probability higher than 0.5:
pred_iris <- augment(mod1) %>%
mutate(prob = 1/(1 + exp(-.fitted))) %>%
mutate(pred_virginica = ifelse(prob > 0.5, 1, 0))
Once we have predicted the class, we can obtain the confusion matrix with the conf_mat
function of tidymodel’s yardstick
package:
library(yardstick)
pred_iris %>%
mutate(is_virginica = as.factor(is_virginica), pred_virginica = as.factor(pred_virginica)) %>%
conf_mat(truth = is_virginica,
estimate = pred_virginica)
## Truth
## Prediction 0 1
## 0 97 2
## 1 3 48
We observe that mod1
has a good performance as classifier for this dataset.
Explaining and predicting with logistic regression
In this post, we have introduced logistic regression as a generalization of linear regression to estimate the probability of an observation being equal to one when the dependent variable is binary. Logistic regression models are estimated maximizing log likelihood.
Logistic regression can be used as a explanatory model of the antecedents of the dependent variable. This is the most common use of logistic regression in econometrics. We examine the significance of model coefficients to find antecedents of the binary dependent variable.
Another use of logistic regression is as a predictive model. When we assign a category to each observation based on predicted probability, we have a classification model. Many classification techniques of machine learning, like neural networks, can be seen as extensions of logistic regression.
References
- FAQ: What are pseudo R- squareds? https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/
Built with R 4.1.0, tidyverse 1.3.1, broom 0.7.7 and yardstick 0.0.8