When reporting correlation matrices, it is sometimes required to specify the significance level, that is, an upper bound of the p-value, the probability of rejecting the null hypothesis that the population correlation is zero when the null hypothesis is true. As I discussed here, p-values for correlations depend on sample size.
The most common convention for indicating significance levels is:
*p < 0.05**p < 0.01***p < 0.001
p-values above 0.05 are considered usually as non-significant. These values are a convention, and other thresholds of significance can be adopted.
Although R base has a cor.test function that performs hypothesis testing for a single correlation, there is no standard way of obtaining a correlation matrix with significance levels similar to the delivered by SPSS and other software. Some functions performing this task can be found at the end of this post. As an alternative, here I present a cor_stars function that relies on tidyverse packages:
library(dplyr)
library(purrr)
library(tidyr)
The function takes as input a data frame dt of multivariate data and the number of digits of the correlation matrix. Then, it proceeds as follows:
- Builds a data frame
dfwith columnsxandycontaining the names of columns ofdfcorresponding to a lower triangular correlation matrix in long format. - uses
purrr:map2_dbl()to calculate correlationsrand p-valuepwithcorandcor.test, respectively. dplyr::case_when()assigns asterisks orstarsto each correlation depending on the p-value. It is important to assign the same number of characters to each level of significance.- using
formatwithdigitsandnsmallthe function builds a formatted correlation variablefrwith the same length in all cases, irrespective of the number of decimals of the actual correlation. - formatted correlation and stars are bound with
paste0in thecsvariable. - the correlation matrix
r_tabare the starred correlations presented in wide format. It is built withtidyr::pivot_widerandtidyr::replace_na.
cor_stars <- function(dt, digits = 3){
names <- names(dt)
n <- length(names)
x <- character(0)
for(i in 1:n) x <- c(x, rep(names[i], n-i))
y <- character(0)
for(i in 1:(n-1)) y <- c(y, names[(i+1):n])
df <- tibble(x=x, y=y)
df <- df %>%
mutate(r = map2_dbl(.x = x, .y = y, ~ cor(dt %>% pull(.x), dt %>% pull(.y))),
p = map2_dbl(.x = x, .y = y, ~ cor.test(dt %>% pull(.x), dt %>% pull(.y))$p.value))
df <- df %>%
mutate(stars = case_when(p < 0.05 & p >= 0.01 ~ "* ",
p < 0.01 & p >= 0.001 ~ "** ",
p < 0.001 ~ "***",
is.na(r) ~ " ",
TRUE ~ " "),
fr = format(r, digits = digits-1, nsmall = digits-1))
df <- df %>%
mutate(cs = paste0(fr, stars))
r_tab <- df %>%
select(x, y, cs) %>%
pivot_wider(id_cols = y, names_from = x, values_from = cs)
r_tab <- r_tab %>%
mutate(across(everything(), ~ replace_na(.x, ""))) %>%
rename("term" = "y")
r_tab <- data.frame(r_tab)
return(r_tab)
}
Here is the default correlation table for the mtcars dataset:
cor_stars(mtcars)
## term mpg cyl disp hp drat wt qsec
## 1 cyl -0.852***
## 2 disp -0.848*** 0.902***
## 3 hp -0.776*** 0.832*** 0.791***
## 4 drat 0.681*** -0.700*** -0.710*** -0.449**
## 5 wt -0.868*** 0.782*** 0.888*** 0.659*** -0.712***
## 6 qsec 0.419* -0.591*** -0.434* -0.708*** 0.091 -0.175
## 7 vs 0.664*** -0.811*** -0.710*** -0.723*** 0.440* -0.555*** 0.745***
## 8 am 0.600*** -0.523** -0.591*** -0.243 0.713*** -0.692*** -0.230
## 9 gear 0.480** -0.493** -0.556*** -0.126 0.700*** -0.583*** -0.213
## 10 carb -0.551** 0.527** 0.395* 0.750*** -0.091 0.428* -0.656***
## vs am gear
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8 0.168
## 9 0.206 0.794***
## 10 -0.570*** 0.058 0.274
And the correlation table with two digits for the same dataset:
cor_stars(mtcars, digits = 2)
## term mpg cyl disp hp drat wt qsec vs
## 1 cyl -0.85***
## 2 disp -0.85*** 0.90***
## 3 hp -0.78*** 0.83*** 0.79***
## 4 drat 0.68*** -0.70*** -0.71*** -0.45**
## 5 wt -0.87*** 0.78*** 0.89*** 0.66*** -0.71***
## 6 qsec 0.42* -0.59*** -0.43* -0.71*** 0.09 -0.17
## 7 vs 0.66*** -0.81*** -0.71*** -0.72*** 0.44* -0.55*** 0.74***
## 8 am 0.60*** -0.52** -0.59*** -0.24 0.71*** -0.69*** -0.23 0.17
## 9 gear 0.48** -0.49** -0.56*** -0.13 0.70*** -0.58*** -0.21 0.21
## 10 carb -0.55** 0.53** 0.39* 0.75*** -0.09 0.43* -0.66*** -0.57***
## am gear
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 0.79***
## 10 0.06 0.27
If we are presenting an html document we can use the kableExtra package:
library(kableExtra)
cor_stars(mtcars) %>%
kbl() %>%
kable_classic_2(full_width = F)
| term | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear |
|---|---|---|---|---|---|---|---|---|---|---|
| cyl | -0.852*** | |||||||||
| disp | -0.848*** | 0.902*** | ||||||||
| hp | -0.776*** | 0.832*** | 0.791*** | |||||||
| drat | 0.681*** | -0.700*** | -0.710*** | -0.449** | ||||||
| wt | -0.868*** | 0.782*** | 0.888*** | 0.659*** | -0.712*** | |||||
| qsec | 0.419* | -0.591*** | -0.434* | -0.708*** | 0.091 | -0.175 | ||||
| vs | 0.664*** | -0.811*** | -0.710*** | -0.723*** | 0.440* | -0.555*** | 0.745*** | |||
| am | 0.600*** | -0.523** | -0.591*** | -0.243 | 0.713*** | -0.692*** | -0.230 | 0.168 | ||
| gear | 0.480** | -0.493** | -0.556*** | -0.126 | 0.700*** | -0.583*** | -0.213 | 0.206 | 0.794*** | |
| carb | -0.551** | 0.527** | 0.395* | 0.750*** | -0.091 | 0.428* | -0.656*** | -0.570*** | 0.058 | 0.274 |
This is just another way of building a starred correlation matrix in R. The output is a data frame (a tibble does not yield a good output for this type of data) that can be displayed in plain text or in am html table. We can recycle most of that code to build a LaTeX table using the xtable package.
References
- Covariance and Pearson correlation in R https://jmsallan.netlify.app/blog/covariance-and-pearson-correlation-in-r/
- Elegant correlation table using xtable R package http://www.sthda.com/english/wiki/elegant-correlation-table-using-xtable-r-package
- Correlation table with significance indicators https://gist.github.com/aL3xa/887249
- Create Awesome HTML Tables with
knitr::kableandkableExtrahttps://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
Session info
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
##
## locale:
## [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.3.4 tidyr_1.1.4 purrr_0.3.4 dplyr_1.0.7
##
## loaded via a namespace (and not attached):
## [1] highr_0.9 pillar_1.6.4 bslib_0.2.5.1 compiler_4.1.2
## [5] jquerylib_0.1.4 tools_4.1.2 digest_0.6.27 viridisLite_0.4.0
## [9] jsonlite_1.7.2 evaluate_0.14 lifecycle_1.0.0 tibble_3.1.5
## [13] pkgconfig_2.0.3 rlang_0.4.12 DBI_1.1.1 rstudioapi_0.13
## [17] yaml_2.2.1 blogdown_1.5 xfun_0.23 stringr_1.4.0
## [21] httr_1.4.2 knitr_1.33 xml2_1.3.2 systemfonts_1.0.2
## [25] generics_0.1.0 vctrs_0.3.8 sass_0.4.0 webshot_0.5.2
## [29] tidyselect_1.1.1 svglite_2.0.0 glue_1.4.2 R6_2.5.0
## [33] fansi_0.5.0 rmarkdown_2.9 bookdown_0.24 magrittr_2.0.1
## [37] scales_1.1.1 ellipsis_0.3.2 htmltools_0.5.1.1 assertthat_0.2.1
## [41] rvest_1.0.2 colorspace_2.0-1 utf8_1.2.1 stringi_1.7.3
## [45] munsell_0.5.0 crayon_1.4.1