When reporting correlation matrices, it is sometimes required to specify the significance level, that is, an upper bound of the p-value, the probability of rejecting the null hypothesis that the population correlation is zero when the null hypothesis is true. As I discussed here, p-values for correlations depend on sample size.
The most common convention for indicating significance levels is:
*
p < 0.05**
p < 0.01***
p < 0.001
p-values above 0.05 are considered usually as non-significant. These values are a convention, and other thresholds of significance can be adopted.
Although R base has a cor.test
function that performs hypothesis testing for a single correlation, there is no standard way of obtaining a correlation matrix with significance levels similar to the delivered by SPSS and other software. Some functions performing this task can be found at the end of this post. As an alternative, here I present a cor_stars
function that relies on tidyverse
packages:
library(dplyr)
library(purrr)
library(tidyr)
The function takes as input a data frame dt
of multivariate data and the number of digits
of the correlation matrix. Then, it proceeds as follows:
- Builds a data frame
df
with columnsx
andy
containing the names of columns ofdf
corresponding to a lower triangular correlation matrix in long format. - uses
purrr:map2_dbl()
to calculate correlationsr
and p-valuep
withcor
andcor.test
, respectively. dplyr::case_when()
assigns asterisks orstars
to each correlation depending on the p-value. It is important to assign the same number of characters to each level of significance.- using
format
withdigits
andnsmall
the function builds a formatted correlation variablefr
with the same length in all cases, irrespective of the number of decimals of the actual correlation. - formatted correlation and stars are bound with
paste0
in thecs
variable. - the correlation matrix
r_tab
are the starred correlations presented in wide format. It is built withtidyr::pivot_wider
andtidyr::replace_na
.
cor_stars <- function(dt, digits = 3){
names <- names(dt)
n <- length(names)
x <- character(0)
for(i in 1:n) x <- c(x, rep(names[i], n-i))
y <- character(0)
for(i in 1:(n-1)) y <- c(y, names[(i+1):n])
df <- tibble(x=x, y=y)
df <- df %>%
mutate(r = map2_dbl(.x = x, .y = y, ~ cor(dt %>% pull(.x), dt %>% pull(.y))),
p = map2_dbl(.x = x, .y = y, ~ cor.test(dt %>% pull(.x), dt %>% pull(.y))$p.value))
df <- df %>%
mutate(stars = case_when(p < 0.05 & p >= 0.01 ~ "* ",
p < 0.01 & p >= 0.001 ~ "** ",
p < 0.001 ~ "***",
is.na(r) ~ " ",
TRUE ~ " "),
fr = format(r, digits = digits-1, nsmall = digits-1))
df <- df %>%
mutate(cs = paste0(fr, stars))
r_tab <- df %>%
select(x, y, cs) %>%
pivot_wider(id_cols = y, names_from = x, values_from = cs)
r_tab <- r_tab %>%
mutate(across(everything(), ~ replace_na(.x, ""))) %>%
rename("term" = "y")
r_tab <- data.frame(r_tab)
return(r_tab)
}
Here is the default correlation table for the mtcars
dataset:
cor_stars(mtcars)
## term mpg cyl disp hp drat wt qsec
## 1 cyl -0.852***
## 2 disp -0.848*** 0.902***
## 3 hp -0.776*** 0.832*** 0.791***
## 4 drat 0.681*** -0.700*** -0.710*** -0.449**
## 5 wt -0.868*** 0.782*** 0.888*** 0.659*** -0.712***
## 6 qsec 0.419* -0.591*** -0.434* -0.708*** 0.091 -0.175
## 7 vs 0.664*** -0.811*** -0.710*** -0.723*** 0.440* -0.555*** 0.745***
## 8 am 0.600*** -0.523** -0.591*** -0.243 0.713*** -0.692*** -0.230
## 9 gear 0.480** -0.493** -0.556*** -0.126 0.700*** -0.583*** -0.213
## 10 carb -0.551** 0.527** 0.395* 0.750*** -0.091 0.428* -0.656***
## vs am gear
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8 0.168
## 9 0.206 0.794***
## 10 -0.570*** 0.058 0.274
And the correlation table with two digits for the same dataset:
cor_stars(mtcars, digits = 2)
## term mpg cyl disp hp drat wt qsec vs
## 1 cyl -0.85***
## 2 disp -0.85*** 0.90***
## 3 hp -0.78*** 0.83*** 0.79***
## 4 drat 0.68*** -0.70*** -0.71*** -0.45**
## 5 wt -0.87*** 0.78*** 0.89*** 0.66*** -0.71***
## 6 qsec 0.42* -0.59*** -0.43* -0.71*** 0.09 -0.17
## 7 vs 0.66*** -0.81*** -0.71*** -0.72*** 0.44* -0.55*** 0.74***
## 8 am 0.60*** -0.52** -0.59*** -0.24 0.71*** -0.69*** -0.23 0.17
## 9 gear 0.48** -0.49** -0.56*** -0.13 0.70*** -0.58*** -0.21 0.21
## 10 carb -0.55** 0.53** 0.39* 0.75*** -0.09 0.43* -0.66*** -0.57***
## am gear
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9 0.79***
## 10 0.06 0.27
If we are presenting an html document we can use the kableExtra
package:
library(kableExtra)
cor_stars(mtcars) %>%
kbl() %>%
kable_classic_2(full_width = F)
term | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear |
---|---|---|---|---|---|---|---|---|---|---|
cyl | -0.852*** | |||||||||
disp | -0.848*** | 0.902*** | ||||||||
hp | -0.776*** | 0.832*** | 0.791*** | |||||||
drat | 0.681*** | -0.700*** | -0.710*** | -0.449** | ||||||
wt | -0.868*** | 0.782*** | 0.888*** | 0.659*** | -0.712*** | |||||
qsec | 0.419* | -0.591*** | -0.434* | -0.708*** | 0.091 | -0.175 | ||||
vs | 0.664*** | -0.811*** | -0.710*** | -0.723*** | 0.440* | -0.555*** | 0.745*** | |||
am | 0.600*** | -0.523** | -0.591*** | -0.243 | 0.713*** | -0.692*** | -0.230 | 0.168 | ||
gear | 0.480** | -0.493** | -0.556*** | -0.126 | 0.700*** | -0.583*** | -0.213 | 0.206 | 0.794*** | |
carb | -0.551** | 0.527** | 0.395* | 0.750*** | -0.091 | 0.428* | -0.656*** | -0.570*** | 0.058 | 0.274 |
This is just another way of building a starred correlation matrix in R. The output is a data frame (a tibble does not yield a good output for this type of data) that can be displayed in plain text or in am html table. We can recycle most of that code to build a LaTeX table using the xtable
package.
References
- Covariance and Pearson correlation in R https://jmsallan.netlify.app/blog/covariance-and-pearson-correlation-in-r/
- Elegant correlation table using xtable R package http://www.sthda.com/english/wiki/elegant-correlation-table-using-xtable-r-package
- Correlation table with significance indicators https://gist.github.com/aL3xa/887249
- Create Awesome HTML Tables with
knitr::kable
andkableExtra
https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html
Session info
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
##
## locale:
## [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.3.4 tidyr_1.1.4 purrr_0.3.4 dplyr_1.0.7
##
## loaded via a namespace (and not attached):
## [1] highr_0.9 pillar_1.6.4 bslib_0.2.5.1 compiler_4.1.2
## [5] jquerylib_0.1.4 tools_4.1.2 digest_0.6.27 viridisLite_0.4.0
## [9] jsonlite_1.7.2 evaluate_0.14 lifecycle_1.0.0 tibble_3.1.5
## [13] pkgconfig_2.0.3 rlang_0.4.12 DBI_1.1.1 rstudioapi_0.13
## [17] yaml_2.2.1 blogdown_1.5 xfun_0.23 stringr_1.4.0
## [21] httr_1.4.2 knitr_1.33 xml2_1.3.2 systemfonts_1.0.2
## [25] generics_0.1.0 vctrs_0.3.8 sass_0.4.0 webshot_0.5.2
## [29] tidyselect_1.1.1 svglite_2.0.0 glue_1.4.2 R6_2.5.0
## [33] fansi_0.5.0 rmarkdown_2.9 bookdown_0.24 magrittr_2.0.1
## [37] scales_1.1.1 ellipsis_0.3.2 htmltools_0.5.1.1 assertthat_0.2.1
## [41] rvest_1.0.2 colorspace_2.0-1 utf8_1.2.1 stringi_1.7.3
## [45] munsell_0.5.0 crayon_1.4.1