Regularized regression with glmnet

Jose M Sallan 2022-06-27 8 min read

In this post, I will introduce regularized regression, and then use the glmnet package to evaluate a regularized regression model on the InsuranceCharges dataset.

InsuranceCharges contains several features of individuals such as age, physical/family condition and location, and their existing medical expense. We intend to predict future medical expenses of individuals that help medical insurance to make decision on charging the premium. Those expenses are in the charges variable.

The data are embedded in the BAdatasets package. I’ll load also recipes for data preprocessing, yardstick for performance metrics and glmnet for regularized regression.

library(recipes)
library(yardstick)
library(BAdatasets)
library(glmnet)

I have performed some transformations of the dataset using the recipes package:

  • replace the continuous variable bmi by a bmi30 dummy variable, splitting the data into individuals with BMI less or equal than 30 and larger than 30.
  • adding a quadratic term to age.
  • transform all factors (nominal variables) into dummies using one hot encoding and then remove one of the dummies for each category.
  • add an interaction term between bmi30 and smoker_yes.
rec <- InsuranceCharges %>%
  recipe(charges ~ .) %>%
  step_mutate(bmi30 = ifelse(bmi <= 30, 0 , 1)) %>%
  step_rm(bmi) %>%
  step_poly(age, degree = 2) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  step_rm(sex_male, smoker_no, region_northeast) %>%
  step_interact(terms = ~ bmi30:smoker_yes)

The transformed data are in the insurance data frame:

insurance <- rec %>%
  prep() %>%
  juice()

The result is a new dataset with 11 predictors:

insurance %>%
  glimpse()
## Rows: 1,338
## Columns: 11
## $ children           <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, …
## $ charges            <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855,…
## $ bmi30              <dbl> 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, …
## $ age_poly_1         <dbl> -0.039333409, -0.041279930, -0.021814716, -0.012082…
## $ age_poly_2         <dbl> 0.036256381, 0.043000157, -0.010053464, -0.02459349…
## $ sex_female         <dbl> 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, …
## $ smoker_yes         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, …
## $ region_northwest   <dbl> 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, …
## $ region_southeast   <dbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ region_southwest   <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, …
## $ bmi30_x_smoker_yes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …

Regularized regression

In datasets with many features, it can happen that using all of them leads to a worse performance than using only a subset. Selecting the features to introduce for better accuracy is the feature selection problem. There are two classical techniques of feature selection:

  • Best subsets, which examines the fit of all possible subsets of features to choose the best subset. This can be computationally expensive. Sometimes we use algorithms like simulated annealing to choose that subset.
  • Stepwise regression, consisting in adding or removing a feature in each step, until a satisfactory model is found.

An alternative to these strategies is regularization, in which we limit the total value of regression coefficients, so the regression coefficients of some variables shrink or go to zero.

The regression coefficients in the ordinary least squares (OLS) approach to linear regression are such that minimize the sum of squared errors:

\[ \displaystyle\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 \]

Let’s examine two regularization approaches to OLS, lasso and ridge regression.

The name lasso stands for least absolute shrinkage and selection operator.

The approach of lasso is to force regression coefficients to shrink, forcing that the sum of absolute values or 1-norm to be smaller than a value \(t\):

\[ \displaystyle\sum_{j=1}^p |\beta_j| = \lVert \beta \lVert_1 \leq t\]

This is equivalent to minimizing:

\[ \displaystyle\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 + \lambda \lVert \beta \lVert_1 \]

Lasso tends to set to zero some regression coefficients, so it works as an automated feature selection tool. As it uses the 1-norm of regression coefficients, lasso is also called L1 regularization.

Ridge regression uses a strategy analogous to lasso regression, but bounding the 2-norm (sum of squares) of the vector of regression coefficients to a value \(t\):

\[ \sqrt{\displaystyle\sum_{j=1}^p \beta_j^2} = \lVert \beta \lVert_2 \leq t \]

This is equivalent to minimizing:

\[ \displaystyle\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 + \lambda \lVert \beta \lVert_2 \]

As ridge regression uses the 2-norm, we call it L2 regularization.

Ridge regression improves prediction error by shrinking coefficients of some variables.

We can mixt both regularizations in a mixed model introducing a mixture parameter which ranges from \(\alpha = 0\) for ridge regression and to \(\alpha = 1\) for lasso regression. \(\alpha\) is sometimes called the mixture parameter. These models are sometimes called elastic nets.

\[ \displaystyle\sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 + \lambda\left( \alpha \lVert \beta \lVert_2 + \left(1 - \alpha \right) \lVert \beta \lVert_1 \right) \]

The results of all these models depend on a \(\lambda\) parameter. As regularized regression is used mainly for prediction, we choose the value of lambda that best fits a prediction metric, usually the mean squared error. Let’s see how we can do that using glmnet.

A ridge regression model

glmnet requires entering variables in matrix format, that’s why I have generated dummies for categorical variables. I obtain the inputs to glmnet doing:

dep <- insurance %>%
  pull(charges)

vars <- insurance %>%
  select(-charges) %>%
  as.matrix()

Let’s obtain the ridge model:

ridge <- glmnet(x = vars, y = dep, alpha = 0)

glmnet has picked a range of values of \(\lambda\) and obtained the model for each. Let’s plot the evolution of coefficients with \(\lambda\):

plot(ridge, xvar = "lambda")

For high values of \(\lambda\), the coefficients of all variables are shrinking. Small values of \(\lambda\) result in coefficients similar to OLS. We need to select the value of \(\lambda\) for our prediction job as the one that minimizes a performance metric for prediction. cv.glmnet does that using cross validation.

cv_ridge <- cv.glmnet(x = vars,
                      y = dep,
                      alpha = 0)

Here are the results as a function of \(\lambda\):

plot(cv_ridge)

We get two values of \(\lambda\): the optimum lambda.min and lambda.1se which gives a similar fit with some more shrinking of coefficients. We obtain the values predicted with lambda.min doing:

pred_lambdamin_ridge <- predict(cv_ridge, 
                          s = cv_ridge$lambda.min, 
                          newx = vars)[, 1]

Lasso regression

To fit a lasso regression we set alpha = 1:

lasso <- glmnet(x = vars, y = dep, alpha = 1)

Let’s plot the evolution of coefficients:

plot(lasso, xvar = "lambda")

Lasso sets to zero coefficients of predictors as \(\lambda\) increases, so it can be considered an automated feature selector. The number of variables selected for each \(\lambda\) are presented in the scale above the plot. Let’s select \(\lambda\) with cross validation and plot the result:

cv_lasso <- cv.glmnet(x = vars,
                      y = dep,
                      alpha = 1,
                      penalty.factor = rep(1, 10))
plot(cv_lasso)

Finally, let’s do the prediction using lambda.min like in ridge regression.

pred_lambdamin_lasso <- predict(cv_lasso, 
                          s = cv_lasso$lambda.min, 
                          newx = vars)[, 1]

Model performance

I will use yardstick to evaluate performance of lasso and ridge regression. Let’s define a metric_set for numerical prediction.

np_metrics <- metric_set(rmse, mae, rsq)

Let’s store observed values and predictions in a data frame.

np_metrics <- metric_set(rmse, mae, rsq)
predictions <- tibble(real = dep,
                      ridge = pred_lambdamin_ridge,
                      lasso = pred_lambdamin_lasso) 

The performance of ridge regression:

predictions %>% 
  np_metrics(truth = real, estimate = ridge)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    4458.   
## 2 mae     standard    2574.   
## 3 rsq     standard       0.867

And the performance of lasso regression:

predictions %>% 
  np_metrics(truth = real, estimate = lasso)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    4423.   
## 2 mae     standard    2467.   
## 3 rsq     standard       0.867

Both models have quite similar performance in this case.

References

Session info

## R version 4.2.0 (2022-04-22)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19.2
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] glmnet_4.1-4     Matrix_1.4-1     BAdatasets_0.1.0 yardstick_0.0.9 
## [5] recipes_0.2.0    dplyr_1.0.9     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3       lubridate_1.8.0    lattice_0.20-45    tidyr_1.2.0       
##  [5] listenv_0.8.0      class_7.3-20       assertthat_0.2.1   digest_0.6.29     
##  [9] ipred_0.9-12       foreach_1.5.2      utf8_1.2.2         parallelly_1.31.1 
## [13] R6_2.5.1           plyr_1.8.7         hardhat_0.2.0      evaluate_0.15     
## [17] highr_0.9          blogdown_1.9       pillar_1.7.0       rlang_1.0.2       
## [21] rstudioapi_0.13    jquerylib_0.1.4    rpart_4.1.16       rmarkdown_2.14    
## [25] splines_4.2.0      gower_1.0.0        stringr_1.4.0      compiler_4.2.0    
## [29] xfun_0.30          pkgconfig_2.0.3    shape_1.4.6        globals_0.14.0    
## [33] htmltools_0.5.2    nnet_7.3-17        tidyselect_1.1.2   tibble_3.1.6      
## [37] prodlim_2019.11.13 bookdown_0.26      codetools_0.2-18   fansi_1.0.3       
## [41] future_1.25.0      crayon_1.5.1       withr_2.5.0        MASS_7.3-57       
## [45] grid_4.2.0         jsonlite_1.8.0     lifecycle_1.0.1    DBI_1.1.2         
## [49] magrittr_2.0.3     pROC_1.18.0        future.apply_1.9.0 cli_3.3.0         
## [53] stringi_1.7.6      timeDate_3043.102  bslib_0.3.1        ellipsis_0.3.2    
## [57] generics_0.1.2     vctrs_0.4.1        lava_1.6.10        iterators_1.0.14  
## [61] tools_4.2.0        glue_1.6.2         purrr_0.3.4        parallel_4.2.0    
## [65] fastmap_1.1.0      survival_3.2-13    yaml_2.3.5         knitr_1.39        
## [69] sass_0.4.1