Logistic regression

Jose M Sallan 2021-06-19 7 min read

In linear regression, we estimate the value of a dependent variable as a linear function of a set of dependent variables. The values of the dependent variable are unrestricted, meaning that it can take any real value. If some hypothesis regarding normality and variance stability of residuals are met, the ordinary least squares are the maximum likelihood estimates for linear regression.

There are many contexts, though, where the dependent variable can take only values equal to zero and one. In that context, we may want to estimate the probability that the dependent variable be equal to one. Unlike linear regression, the values of the dependent variable are restricted to the \(\left[0,1\right]\) interval. For these models we cannot use ordinary least squares. Instead, we can obtain maximum likelihood estimates with other models.

A usual approach to this problem is the logistic regression, sometimes called logit model.

\[ P( y = 1 \vert \mathbf{x}) = \mathbf{\pi}\left( \mathbf{z} \right) = \frac{1}{1 + \text{exp}\left(-\mathbf{z}\right)} \]

\[ z_i = \beta_0 + \beta_1x_{i1} + \dots + \beta_px_{ip} = \mathcal{\theta}' \mathcal{x}_i \]

In the logit model, we use the logistic function to transform an unrestricted real variable \(z_i\) into a probability \(\pi_i\). Here is a representation of the logistic function \(y = 1/\left(1 + e^{\beta s}\right)\)

The variable \(z\) is the log odds ratio of the dependent variable being equal to one:

\[ ln\left( \frac{\pi_i}{1-\pi_i} \right) = \mathcal{\theta}' \mathcal{x}_i \]

The estimation of model coefficients is completely different from ordinary least squares. The likelihood function of the parameters given the data is a binomial distribution:

\[ \mathcal{L}\left( \mathcal{\theta} \right) = \prod_{i=1}^n \pi_i^{y_i} \left(1-\pi_i\right)^{1-y_i} \]

For simplicity, we optimize the log likelihood function to obtain the estimators:

\[ \mathcal{l}\left( \mathcal{\theta} \right) = \sum_{i=1}^n y_i\text{log} \pi_i + \sum_{i=1}^n\left(1-y_i\right) \text{log}\left(1-\pi_i\right) \]

As \(\pi_i\) is a function of dependent variables, we can obtain estimates \(b_0, \dots, b_p\) of \(\beta_0, \dots, \beta_p\) maximizing the log likelihood function \(\mathcal{l}\). Unlike ordinary least squares, there is no analytical solution for the maximization of log likelihood, so model estimates are obtained numerically.

An example of logistic regression

Let’s use the iris dataset to build a logistic regression model with a dependent variable is_virginica equal to one if the observation belongs to this species and zero otherwise. I will remove the original Species variable and Petal.Width. The later is highly correlated with the rest of predictors.

iris_logistic <- iris %>% 
  mutate(is_virginica = ifelse(Species == "virginica", 1, 0)) %>%
  select(-c(Petal.Width, Species))

We can fit a logit model in R using the base function glm for generalized linear models with family = "binomial". We can use summary to obtain a summary of model output.

mod1 <- glm(is_virginica ~ ., family = "binomial", data = iris_logistic)
## Call:
## glm(formula = is_virginica ~ ., family = "binomial", data = iris_logistic)
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.83416  -0.01085   0.00000   0.00674   1.52165  
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -38.2154    14.0532  -2.719 0.006541 ** 
## Sepal.Length  -3.8524     1.7091  -2.254 0.024190 *  
## Sepal.Width   -0.6385     2.2906  -0.279 0.780424    
## Petal.Length  13.1475     3.9057   3.366 0.000762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 190.954  on 149  degrees of freedom
## Residual deviance:  23.772  on 146  degrees of freedom
## AIC: 31.772
## Number of Fisher Scoring iterations: 10

Model coefficients

Model coefficients significance allow using the logistic regression model as a explanatory tool, detecting the antecedents of the probability of each observation being virginica. We can obtain information on model coefficients in tabular form using the tidy function of the broom package.

## # A tibble: 4 x 5
##   term         estimate std.error statistic  p.value
##   <chr>           <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   -38.2       14.1     -2.72  0.00654 
## 2 Sepal.Length   -3.85       1.71    -2.25  0.0242  
## 3 Sepal.Width    -0.639      2.29    -0.279 0.780   
## 4 Petal.Length   13.1        3.91     3.37  0.000762

The p.value column presents the probability of observing a regression coefficient equal or larger than the estimate if the null hypothesis that the regression coefficient is zero holds. We usually discard that null hypothesis for p-values equal or smaller than 0.05. Observing the sign of significant coefficients we can conclude that virginica flowers will tend to have small values of Sepal.Length and high values of Petal.Length.

We can examine the predictions for each observation using the augment function of broom. The values of the log odds \(z_i\) are found on the .fitted column, so we need to transform it to obtain the predicted probability prod.

mod1_pred <- augment(mod1) %>%
  mutate(prob = 1/(1 + exp(-.fitted)),
         is_virginica = as.factor(is_virginica)) 

Let’s examine the relationship between Petal.Length and prob, together with the real classification of each observation:

mod1_pred %>%
  ggplot(aes(Petal.Length, prob, color = is_virginica)) +
  geom_point() +
  theme(panel.background = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.key = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.background = element_rect(fill = "#FFFFFF", color = "#FFFFFF"), legend.position = "bottom") +
  scale_color_manual(name = "class", label = c("not virginica", "virginica"), values = c("#FF6666", "#00CC66"))

We observe that we can separate many of the observations belonging to each class considering the values of Petal.Length. Let’s see how good is Sepal.Length to explain the probability of being virginica:

mod1_pred %>%
  ggplot(aes(Sepal.Length, prob, color = is_virginica)) +
  geom_point() +
  theme(panel.background = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.key = element_rect(fill = "#FFFFFA", color = "#F5DEB3"), legend.background = element_rect(fill = "#FFFFFF", color = "#FFFFFF"), legend.position = "bottom") +
  scale_color_manual(name = "class", label = c("not virginica", "virginica"), values = c("#FF6666", "#00CC66"))

We observe that the explanatory power of Petal.Length is smaller. For values of Petal.Length between 5.5. and 7 we can find observations belonging to both classes. This was to be expected, as its p-value was much higher than Petal.Lenght, although still below 0.05.

Model fit

To assess overall significance of the model, we have only parameters related with log likelihood maximization. The F-test of overall significance and the coefficient of determination of linear regression are not available for logistic regression.

The most useful parameter to examine model fit of logistic regression is deviance. Deviance \(D\) aaccounts for how much does the model deviates from a model with perfect fit. The smaller the deviance, the better the model. Deviance is calculated as:

\[D = -2\mathcal{l}\left( \mathcal{\theta} \right)\]

The McFadden pseudo-R squared allows obtaining a fit parameter similar to linear regression’s R squared comparing the deviance of the model with the deviance of the null model \(D_{null}\). The null model does not have predictors, so it is supposed to have the worse value of fit. The formula of McFadden pseudo-R squared is:

\[R^2_{pseudo} = 1 - \frac{D}{D_{null}} = 1 - \frac{\mathcal{l}}{\mathcal{l}_{null}}\]

Note that pseudo-R squared makes sense only if log likelihood is negative.

We can obtain fit indices for our model using the glance function of broom. We can easily add the pseudo-R squared usign deviance values.

glance(mod1) %>%
  mutate(pseudo.rsq = 1 - deviance/null.deviance) %>%
## Rows: 1
## Columns: 9
## $ null.deviance <dbl> 190.9543
## $ df.null       <int> 149
## $ logLik        <dbl> -11.8859
## $ AIC           <dbl> 31.7718
## $ BIC           <dbl> 43.81434
## $ deviance      <dbl> 23.7718
## $ df.residual   <int> 146
## $ nobs          <int> 150
## $ pseudo.rsq    <dbl> 0.8755105


We can use logistic regression as a predictive model for classification. We use a threshold value of probability to assign each observation to each class. Here we will assign to class virginica observations with predicted probability higher than 0.5:

pred_iris <- augment(mod1) %>%
  mutate(prob = 1/(1 + exp(-.fitted))) %>%
  mutate(pred_virginica = ifelse(prob > 0.5, 1, 0))

Once we have predicted the class, we can obtain the confusion matrix with the conf_mat function of tidymodel’s yardstick package:

pred_iris %>%
  mutate(is_virginica = as.factor(is_virginica), pred_virginica = as.factor(pred_virginica)) %>%
conf_mat(truth = is_virginica, 
         estimate = pred_virginica) 
##           Truth
## Prediction  0  1
##          0 97  2
##          1  3 48

We observe that mod1 has a good performance as classifier for this dataset.

Explaining and predicting with logistic regression

In this post, we have introduced logistic regression as a generalization of linear regression to estimate the probability of an observation being equal to one when the dependent variable is binary. Logistic regression models are estimated maximizing log likelihood.

Logistic regression can be used as a explanatory model of the antecedents of the dependent variable. This is the most common use of logistic regression in econometrics. We examine the significance of model coefficients to find antecedents of the binary dependent variable.

Another use of logistic regression is as a predictive model. When we assign a category to each observation based on predicted probability, we have a classification model. Many classification techniques of machine learning, like neural networks, can be seen as extensions of logistic regression.


Built with R 4.1.0, tidyverse 1.3.1, broom 0.7.7 and yardstick 0.0.8