Just another starred correlation function in R

Jose M Sallan 2022-01-03 6 min read

When reporting correlation matrices, it is sometimes required to specify the significance level, that is, an upper bound of the p-value, the probability of rejecting the null hypothesis that the population correlation is zero when the null hypothesis is true. As I discussed here, p-values for correlations depend on sample size.

The most common convention for indicating significance levels is:

* p < 0.05
** p < 0.01
*** p < 0.001

p-values above 0.05 are considered usually as non-significant. These values are a convention, and other thresholds of significance can be adopted.

Although R base has a cor.test function that performs hypothesis testing for a single correlation, there is no standard way of obtaining a correlation matrix with significance levels similar to the delivered by SPSS and other software. Some functions performing this task can be found at the end of this post. As an alternative, here I present a cor_stars function that relies on tidyverse packages:

library(dplyr)
library(purrr)
library(tidyr)

The function takes as input a data frame dt of multivariate data and the number of digits of the correlation matrix. Then, it proceeds as follows:

Builds a data frame df with columns x and y containing the names of columns of df corresponding to a lower triangular correlation matrix in long format.
uses purrr:map2_dbl() to calculate correlations r and p-value p with cor and cor.test, respectively.
dplyr::case_when() assigns asterisks or stars to each correlation depending on the p-value. It is important to assign the same number of characters to each level of significance.
using format with digits and nsmall the function builds a formatted correlation variable fr with the same length in all cases, irrespective of the number of decimals of the actual correlation.
formatted correlation and stars are bound with paste0 in the cs variable.
the correlation matrix r_tab are the starred correlations presented in wide format. It is built with tidyr::pivot_wider and tidyr::replace_na.

cor_stars <- function(dt, digits = 3){
  
  names <- names(dt)
  n <- length(names)
  
  x <- character(0)
  for(i in 1:n) x <- c(x, rep(names[i], n-i))
  
  y <- character(0)
  for(i in 1:(n-1)) y <- c(y, names[(i+1):n])
  
  df <- tibble(x=x, y=y)
  
  df <- df %>%
    mutate(r = map2_dbl(.x = x, .y = y, ~ cor(dt %>% pull(.x), dt %>% pull(.y))),
           p =  map2_dbl(.x = x, .y = y, ~ cor.test(dt %>% pull(.x), dt %>% pull(.y))$p.value))
   
  df <- df %>%
    mutate(stars = case_when(p < 0.05 & p >= 0.01 ~ "*  ",
                             p < 0.01 & p >= 0.001 ~ "** ",
                             p < 0.001 ~ "***",
                             is.na(r) ~ "   ",
                             TRUE ~ "   "),
           fr = format(r, digits = digits-1, nsmall = digits-1))
  
  df <- df %>%
    mutate(cs = paste0(fr, stars))
  
  r_tab <- df %>%
    select(x, y, cs) %>%
    pivot_wider(id_cols = y, names_from = x, values_from = cs)
  
  r_tab <- r_tab %>%
    mutate(across(everything(), ~ replace_na(.x, ""))) %>%
    rename("term" = "y")
  
  
  r_tab <- data.frame(r_tab)
  
  return(r_tab)
  
}

Here is the default correlation table for the mtcars dataset:

cor_stars(mtcars)

##    term       mpg       cyl      disp        hp      drat        wt      qsec
## 1   cyl -0.852***                                                            
## 2  disp -0.848***  0.902***                                                  
## 3    hp -0.776***  0.832***  0.791***                                        
## 4  drat  0.681*** -0.700*** -0.710*** -0.449**                               
## 5    wt -0.868***  0.782***  0.888***  0.659*** -0.712***                    
## 6  qsec  0.419*   -0.591*** -0.434*   -0.708***  0.091    -0.175             
## 7    vs  0.664*** -0.811*** -0.710*** -0.723***  0.440*   -0.555***  0.745***
## 8    am  0.600*** -0.523**  -0.591*** -0.243     0.713*** -0.692*** -0.230   
## 9  gear  0.480**  -0.493**  -0.556*** -0.126     0.700*** -0.583*** -0.213   
## 10 carb -0.551**   0.527**   0.395*    0.750*** -0.091     0.428*   -0.656***
##           vs        am      gear
## 1                               
## 2                               
## 3                               
## 4                               
## 5                               
## 6                               
## 7                               
## 8   0.168                       
## 9   0.206     0.794***          
## 10 -0.570***  0.058     0.274

And the correlation table with two digits for the same dataset:

cor_stars(mtcars, digits = 2)

##    term      mpg      cyl     disp       hp     drat       wt     qsec       vs
## 1   cyl -0.85***                                                               
## 2  disp -0.85***  0.90***                                                      
## 3    hp -0.78***  0.83***  0.79***                                             
## 4  drat  0.68*** -0.70*** -0.71*** -0.45**                                     
## 5    wt -0.87***  0.78***  0.89***  0.66*** -0.71***                           
## 6  qsec  0.42*   -0.59*** -0.43*   -0.71***  0.09    -0.17                     
## 7    vs  0.66*** -0.81*** -0.71*** -0.72***  0.44*   -0.55***  0.74***         
## 8    am  0.60*** -0.52**  -0.59*** -0.24     0.71*** -0.69*** -0.23     0.17   
## 9  gear  0.48**  -0.49**  -0.56*** -0.13     0.70*** -0.58*** -0.21     0.21   
## 10 carb -0.55**   0.53**   0.39*    0.75*** -0.09     0.43*   -0.66*** -0.57***
##          am     gear
## 1                   
## 2                   
## 3                   
## 4                   
## 5                   
## 6                   
## 7                   
## 8                   
## 9   0.79***         
## 10  0.06     0.27

If we are presenting an html document we can use the kableExtra package:

library(kableExtra)
cor_stars(mtcars) %>%
  kbl() %>%
  kable_classic_2(full_width = F)

term	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear
cyl	-0.852***
disp	-0.848***	0.902***
hp	-0.776***	0.832***	0.791***
drat	0.681***	-0.700***	-0.710***	-0.449**
wt	-0.868***	0.782***	0.888***	0.659***	-0.712***
qsec	0.419*	-0.591***	-0.434*	-0.708***	0.091	-0.175
vs	0.664***	-0.811***	-0.710***	-0.723***	0.440*	-0.555***	0.745***
am	0.600***	-0.523**	-0.591***	-0.243	0.713***	-0.692***	-0.230	0.168
gear	0.480**	-0.493**	-0.556***	-0.126	0.700***	-0.583***	-0.213	0.206	0.794***
carb	-0.551**	0.527**	0.395*	0.750***	-0.091	0.428*	-0.656***	-0.570***	0.058	0.274

This is just another way of building a starred correlation matrix in R. The output is a data frame (a tibble does not yield a good output for this type of data) that can be displayed in plain text or in am html table. We can recycle most of that code to build a LaTeX table using the xtable package.

References

Covariance and Pearson correlation in R https://jmsallan.netlify.app/blog/covariance-and-pearson-correlation-in-r/
Elegant correlation table using xtable R package http://www.sthda.com/english/wiki/elegant-correlation-table-using-xtable-r-package
Correlation table with significance indicators https://gist.github.com/aL3xa/887249
Create Awesome HTML Tables with knitr::kable and kableExtra https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html

Session info

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] kableExtra_1.3.4 tidyr_1.1.4      purrr_0.3.4      dplyr_1.0.7     
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9         pillar_1.6.4      bslib_0.2.5.1     compiler_4.1.2   
##  [5] jquerylib_0.1.4   tools_4.1.2       digest_0.6.27     viridisLite_0.4.0
##  [9] jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0   tibble_3.1.5     
## [13] pkgconfig_2.0.3   rlang_0.4.12      DBI_1.1.1         rstudioapi_0.13  
## [17] yaml_2.2.1        blogdown_1.5      xfun_0.23         stringr_1.4.0    
## [21] httr_1.4.2        knitr_1.33        xml2_1.3.2        systemfonts_1.0.2
## [25] generics_0.1.0    vctrs_0.3.8       sass_0.4.0        webshot_0.5.2    
## [29] tidyselect_1.1.1  svglite_2.0.0     glue_1.4.2        R6_2.5.0         
## [33] fansi_0.5.0       rmarkdown_2.9     bookdown_0.24     magrittr_2.0.1   
## [37] scales_1.1.1      ellipsis_0.3.2    htmltools_0.5.1.1 assertthat_0.2.1 
## [41] rvest_1.0.2       colorspace_2.0-1  utf8_1.2.1        stringi_1.7.3    
## [45] munsell_0.5.0     crayon_1.4.1

The Jose M Sallan static website

Just another starred correlation function in R

References

Session info