Examining correlation matrices in R

Jose M Sallan 2021-12-27 11 min read

In this post, I will introduce how to examine relationships between variables in a multivariate dataset using covariance and correlation matrices. Those matrices are the input of techniques like exploratory and confirmatory factor analysis and structural equation modelling. I this post, I will be using the mtcars dataset, that allows showing positive and negative relationships between variables.

In base R, we use cov and cor to obtain the covariances and correlation matrices of a multivariate distribution. These functions take a data frame with the observations as input.

c_mat <- cov(mtcars)
r_mat <- cor(mtcars)

The covariance matrix \(\mathbf{S}\) includes covariances \(s_{ij}\) between all pairs of variables \(\left(i, j\right)\) of the distribution. As \(s_{ij} = s_{ji}\), it is a symmetric matrix. Diagonal elements \(s_{ii}\) are the variance of each variable \(i\).

c_mat
##              mpg         cyl        disp          hp         drat          wt
## mpg    36.324103  -9.1723790  -633.09721 -320.732056   2.19506351  -5.1166847
## cyl    -9.172379   3.1895161   199.66028  101.931452  -0.66836694   1.3673710
## disp -633.097208 199.6602823 15360.79983 6721.158669 -47.06401915 107.6842040
## hp   -320.732056 101.9314516  6721.15867 4700.866935 -16.45110887  44.1926613
## drat    2.195064  -0.6683669   -47.06402  -16.451109   0.28588135  -0.3727207
## wt     -5.116685   1.3673710   107.68420   44.192661  -0.37272073   0.9573790
## qsec    4.509149  -1.8868548   -96.05168  -86.770081   0.08714073  -0.3054816
## vs      2.017137  -0.7298387   -44.37762  -24.987903   0.11864919  -0.2736613
## am      1.803931  -0.4657258   -36.56401   -8.320565   0.19015121  -0.3381048
## gear    2.135685  -0.6491935   -50.80262   -6.358871   0.27598790  -0.4210806
## carb   -5.363105   1.5201613    79.06875   83.036290  -0.07840726   0.6757903
##              qsec           vs           am        gear        carb
## mpg    4.50914919   2.01713710   1.80393145   2.1356855 -5.36310484
## cyl   -1.88685484  -0.72983871  -0.46572581  -0.6491935  1.52016129
## disp -96.05168145 -44.37762097 -36.56401210 -50.8026210 79.06875000
## hp   -86.77008065 -24.98790323  -8.32056452  -6.3588710 83.03629032
## drat   0.08714073   0.11864919   0.19015121   0.2759879 -0.07840726
## wt    -0.30548161  -0.27366129  -0.33810484  -0.4210806  0.67579032
## qsec   3.19316613   0.67056452  -0.20495968  -0.2804032 -1.89411290
## vs     0.67056452   0.25403226   0.04233871   0.0766129 -0.46370968
## am    -0.20495968   0.04233871   0.24899194   0.2923387  0.04637097
## gear  -0.28040323   0.07661290   0.29233871   0.5443548  0.32661290
## carb  -1.89411290  -0.46370968   0.04637097   0.3266129  2.60887097

Covariance values depend on the scale of each pair of variables and thus they are difficult to interpret. That’s why we usually to examine the correlation matrix \(\mathbf{R}\). Correlations \(r_{ij}\) are scaled between -1 and +1, and diagonal elements are equal to one.

r_mat
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

There are many R packages dealing with correlation matrices, to allow a better visualization and interpretation. Here I will present some functionalities of corrr and corrplot packages.

library(corrr)
library(corrplot)

The corrr package

The corrr package is a part of the tidymodels ecosystem, and allows manipulating and presenting correlation matrices as data frames. We use correlate to obtain correlations with corrr.

r_df <- correlate(mtcars)

The outcome of correlate is a tibble, instead of a matrix. Variable names of rows are stored in an additional term column. By default values of diagonal are set to NA.

r_df
## # A tibble: 11 × 12
##    term     mpg    cyl   disp     hp    drat     wt    qsec     vs      am
##    <chr>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
##  1 mpg   NA     -0.852 -0.848 -0.776  0.681  -0.868  0.419   0.664  0.600 
##  2 cyl   -0.852 NA      0.902  0.832 -0.700   0.782 -0.591  -0.811 -0.523 
##  3 disp  -0.848  0.902 NA      0.791 -0.710   0.888 -0.434  -0.710 -0.591 
##  4 hp    -0.776  0.832  0.791 NA     -0.449   0.659 -0.708  -0.723 -0.243 
##  5 drat   0.681 -0.700 -0.710 -0.449 NA      -0.712  0.0912  0.440  0.713 
##  6 wt    -0.868  0.782  0.888  0.659 -0.712  NA     -0.175  -0.555 -0.692 
##  7 qsec   0.419 -0.591 -0.434 -0.708  0.0912 -0.175 NA       0.745 -0.230 
##  8 vs     0.664 -0.811 -0.710 -0.723  0.440  -0.555  0.745  NA      0.168 
##  9 am     0.600 -0.523 -0.591 -0.243  0.713  -0.692 -0.230   0.168 NA     
## 10 gear   0.480 -0.493 -0.556 -0.126  0.700  -0.583 -0.213   0.206  0.794 
## 11 carb  -0.551  0.527  0.395  0.750 -0.0908  0.428 -0.656  -0.570  0.0575
## # … with 2 more variables: gear <dbl>, carb <dbl>

With stretch we can get correlations as a long table:

stretch(r_df)
## # A tibble: 121 × 3
##    x     y          r
##    <chr> <chr>  <dbl>
##  1 mpg   mpg   NA    
##  2 mpg   cyl   -0.852
##  3 mpg   disp  -0.848
##  4 mpg   hp    -0.776
##  5 mpg   drat   0.681
##  6 mpg   wt    -0.868
##  7 mpg   qsec   0.419
##  8 mpg   vs     0.664
##  9 mpg   am     0.600
## 10 mpg   gear   0.480
## # … with 111 more rows

With focus we can examine a part of the correlation matrix. Columns are the second argument of the function, and rows the rest of variables:

focus(r_df, c(mpg, cyl))
## # A tibble: 9 × 3
##   term     mpg    cyl
##   <chr>  <dbl>  <dbl>
## 1 disp  -0.848  0.902
## 2 hp    -0.776  0.832
## 3 drat   0.681 -0.700
## 4 wt    -0.868  0.782
## 5 qsec   0.419 -0.591
## 6 vs     0.664 -0.811
## 7 am     0.600 -0.523
## 8 gear   0.480 -0.493
## 9 carb  -0.551  0.527

fashion allows a pretty presentation of the correlation matrix. We can specify the number of decimals, and select if we want to print the leading_zeros. Here I am presenting the default input.

fashion(r_df)
##    term  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
## 1   mpg      -.85 -.85 -.78  .68 -.87  .42  .66  .60  .48 -.55
## 2   cyl -.85       .90  .83 -.70  .78 -.59 -.81 -.52 -.49  .53
## 3  disp -.85  .90       .79 -.71  .89 -.43 -.71 -.59 -.56  .39
## 4    hp -.78  .83  .79      -.45  .66 -.71 -.72 -.24 -.13  .75
## 5  drat  .68 -.70 -.71 -.45      -.71  .09  .44  .71  .70 -.09
## 6    wt -.87  .78  .89  .66 -.71      -.17 -.55 -.69 -.58  .43
## 7  qsec  .42 -.59 -.43 -.71  .09 -.17       .74 -.23 -.21 -.66
## 8    vs  .66 -.81 -.71 -.72  .44 -.55  .74       .17  .21 -.57
## 9    am  .60 -.52 -.59 -.24  .71 -.69 -.23  .17       .79  .06
## 10 gear  .48 -.49 -.56 -.13  .70 -.58 -.21  .21  .79       .27
## 11 carb -.55  .53  .39  .75 -.09  .43 -.66 -.57  .06  .27

To interpret a correlation matrix, it can be useful to change the default order of variables, putting together highly correlated variables. We accomplish this with the rearrange function. The methods available to rearrange variables are principal components analysis "PCA" (the default) or hierarchical clustering "HC".

rearrange(r_df, method = "HC")
## # A tibble: 11 × 12
##    term      wt    cyl   disp     hp    carb    drat      am   gear    qsec
##    <chr>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>   <dbl>  <dbl>   <dbl>
##  1 wt    NA      0.782  0.888  0.659  0.428  -0.712  -0.692  -0.583 -0.175 
##  2 cyl    0.782 NA      0.902  0.832  0.527  -0.700  -0.523  -0.493 -0.591 
##  3 disp   0.888  0.902 NA      0.791  0.395  -0.710  -0.591  -0.556 -0.434 
##  4 hp     0.659  0.832  0.791 NA      0.750  -0.449  -0.243  -0.126 -0.708 
##  5 carb   0.428  0.527  0.395  0.750 NA      -0.0908  0.0575  0.274 -0.656 
##  6 drat  -0.712 -0.700 -0.710 -0.449 -0.0908 NA       0.713   0.700  0.0912
##  7 am    -0.692 -0.523 -0.591 -0.243  0.0575  0.713  NA       0.794 -0.230 
##  8 gear  -0.583 -0.493 -0.556 -0.126  0.274   0.700   0.794  NA     -0.213 
##  9 qsec  -0.175 -0.591 -0.434 -0.708 -0.656   0.0912 -0.230  -0.213 NA     
## 10 mpg   -0.868 -0.852 -0.848 -0.776 -0.551   0.681   0.600   0.480  0.419 
## 11 vs    -0.555 -0.811 -0.710 -0.723 -0.570   0.440   0.168   0.206  0.745 
## # … with 2 more variables: mpg <dbl>, vs <dbl>

Correlation matrices are symmetric and with ones in the diagonal, so it is frequent to present its lower triangular part without diagonal elements. We get this with shave:

shave(r_df)
## # A tibble: 11 × 12
##    term     mpg    cyl   disp     hp    drat     wt   qsec     vs      am   gear
##    <chr>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
##  1 mpg   NA     NA     NA     NA     NA      NA     NA     NA     NA      NA    
##  2 cyl   -0.852 NA     NA     NA     NA      NA     NA     NA     NA      NA    
##  3 disp  -0.848  0.902 NA     NA     NA      NA     NA     NA     NA      NA    
##  4 hp    -0.776  0.832  0.791 NA     NA      NA     NA     NA     NA      NA    
##  5 drat   0.681 -0.700 -0.710 -0.449 NA      NA     NA     NA     NA      NA    
##  6 wt    -0.868  0.782  0.888  0.659 -0.712  NA     NA     NA     NA      NA    
##  7 qsec   0.419 -0.591 -0.434 -0.708  0.0912 -0.175 NA     NA     NA      NA    
##  8 vs     0.664 -0.811 -0.710 -0.723  0.440  -0.555  0.745 NA     NA      NA    
##  9 am     0.600 -0.523 -0.591 -0.243  0.713  -0.692 -0.230  0.168 NA      NA    
## 10 gear   0.480 -0.493 -0.556 -0.126  0.700  -0.583 -0.213  0.206  0.794  NA    
## 11 carb  -0.551  0.527  0.395  0.750 -0.0908  0.428 -0.656 -0.570  0.0575  0.274
## # … with 1 more variable: carb <dbl>

We can achieve a more satisfying presentation combining shave and fashion:

fashion(shave(r_df))
##    term  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb
## 1   mpg                                                       
## 2   cyl -.85                                                  
## 3  disp -.85  .90                                             
## 4    hp -.78  .83  .79                                        
## 5  drat  .68 -.70 -.71 -.45                                   
## 6    wt -.87  .78  .89  .66 -.71                              
## 7  qsec  .42 -.59 -.43 -.71  .09 -.17                         
## 8    vs  .66 -.81 -.71 -.72  .44 -.55  .74                    
## 9    am  .60 -.52 -.59 -.24  .71 -.69 -.23  .17               
## 10 gear  .48 -.49 -.56 -.13  .70 -.58 -.21  .21  .79          
## 11 carb -.55  .53  .39  .75 -.09  .43 -.66 -.57  .06  .27

We can also plot the correlation matrix with rplot. It is customary to rearrange and shave the matrix before plotting:

r_pretty <- shave(rearrange(r_df))
rplot(r_pretty)

The corrplot package

corrplot provides a visual exploratory tool of correlation matrices that supports automatic variable reordering to help detect hidden patterns among variables. It can be seen as a visual alternative to exploratory factor analysis.

The functionalities of corrplot are nicely explained in the package vignette. Here I will be posting some illustrative examples.

We specify how to present correlations with the method argument of the corrplot function:

corrplot(r_mat, method = 'number') 

We can specify how to order variables in the correlation matrix. Methods available are angular order of eigenvectors "AOE", principal components "FPC" and hierarchical clustering "HC".

corrplot(r_mat, method = "circle", order = "hclust", diag = FALSE)

The corrplot.mixed function allows presenting two different visualizations of the same correlation matrix in the upper and lower triangular parts of the matrix.

corrplot.mixed(r_mat, upper = 'ellipse', lower = "shade", order = "hclust")

Examining covariance and correlation matrices in R

Covariance and correlation matrices express relationships between variables of a multivariate sample. As correlations are scaled between -1 nd +1, it is more convenient for humans to examine correlation matrices. With corrr and corrplot packages we can examine correlation matrices, group highly correlated subsets of variables and present visualizations of the results.

After examining correlation matrices, we can engage in advanced techniques to examine correlational structures, like exploratory and confirmatory factor analysis or structural equation modelling.

Session info

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] corrplot_0.92 corrr_0.4.3  
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9         bslib_0.2.5.1     compiler_4.1.2    pillar_1.6.4     
##  [5] jquerylib_0.1.4   iterators_1.0.13  tools_4.1.2       digest_0.6.27    
##  [9] jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0   tibble_3.1.5     
## [13] gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.12      foreach_1.5.1    
## [17] registry_0.5-1    rstudioapi_0.13   cli_3.0.1         DBI_1.1.1        
## [21] yaml_2.2.1        seriation_1.3.1   blogdown_1.5      xfun_0.23        
## [25] TSP_1.1-11        stringr_1.4.0     dplyr_1.0.7       knitr_1.33       
## [29] generics_0.1.0    sass_0.4.0        vctrs_0.3.8       tidyselect_1.1.1 
## [33] grid_4.1.2        glue_1.4.2        R6_2.5.0          fansi_0.5.0      
## [37] rmarkdown_2.9     bookdown_0.24     farver_2.1.0      purrr_0.3.4      
## [41] ggplot2_3.3.5     magrittr_2.0.1    codetools_0.2-18  scales_1.1.1     
## [45] htmltools_0.5.1.1 ellipsis_0.3.2    assertthat_0.2.1  colorspace_2.0-1 
## [49] labeling_0.4.2    utf8_1.2.1        stringi_1.7.3     munsell_0.5.0    
## [53] crayon_1.4.1