Examining the Central Limit Theorem with purrr

Jose M Sallan 2024-05-27 7 min read

ggplot
purrr

In this post, I will present the possiblities of some of the family of map_*() functions from the tidyverse purrr package. The purrr package is included in the core tydiverse packages, so we can load it doing:

library(tidyverse)

I will apply these functions to present empirical evidence of the central limit theorem.

The central limit theorem states that if we collect a large enough sample of $n$ indenpendent obaservations of a random variable with population mean $μ$ and stardard deviation $σ$ , the sample mean follows a distribution:

$\bar{x} \sim N (μ, σ / \sqrt{n})$

The value $σ / \sqrt{n}$ is called standard error.

To examine this theorem empirically, we need to generate several samples of the random variable and obtain the sample mean $\bar{x}$ from each sample. Once obtained the set of sample values of $\bar{x}$ , we need to examine the mean, standard deviation and distribution. As the central limit theorem works with any distribution I will test it with:

A normal distribution.
A continuous uniform distribution.

Normal Distribution

The purrr::map() function iterates a function along the components of a list or vector and returns a list as output. It is largely an equivalent of the R base lapply() function.

Let’s use map() to create 1,000 samples of 2,000 observations each from a normal distribution of mean zero and three different values of standard deviation sd.

set.seed(11)
sd_05 <- map(1:1000, ~ rnorm(2000, mean = 0, sd = 0.5))
sd_10 <- map(1:1000, ~ rnorm(2000, mean = 0, sd = 1))
sd_20 <- map(1:1000, ~ rnorm(2000, mean = 0, sd = 2))

norm <- list(sd_05 = sd_05, sd_10 = sd_10, sd_20 = sd_20)

We can check the number of elements of each component of norm applying lenght() with purrr::map_int().

map_int(norm, length)

## sd_05 sd_10 sd_20 
##  1000  1000  1000

The output of this function is a vector of integer values of sample mean $\bar{x}$ . This output inherits the names of the original norm list.

Let’s calculate $\bar{x}$ of each of the samples of norm. The result will be a list of three components, each of them being a vector with the mean of each of the 1,000 samples. These vectors are obtained with the purrr::map_dbl() function. This function returns always a vector of double (floating point numeric) values.

mean_norm <- map(norm, ~ map_dbl(. , mean))

Now we have 1,000 values of the sample mean of normal distributions with different values of standard deviation. We can obtain an estimation of the mean of the sample mean for each distribution doing:

map_dbl(mean_norm, mean)

##        sd_05        sd_10        sd_20 
## 7.675551e-05 6.811647e-04 1.890621e-03

We observe that the mean of $\bar{x}$ is very close to $μ$ for the three sets of samples. Let’s look at the standard deviation:

map_dbl(mean_norm, ~ sd(.))

##      sd_05      sd_10      sd_20 
## 0.01088386 0.02154560 0.04467244

The standard deviation of $\bar{x}$ increases as the standard deviation of the distribution increases. Let’s compare the values obtained above with the population standard error.

c(0.5, 1, 2)/sqrt(2000)

## [1] 0.01118034 0.02236068 0.04472136

The sample values of standard error are close to the population values.

Let’s now plot the distribution of the sample mean. To do so, I will wrap all the values of mean_norm in a single tibble with columns value and name equal to means of each sample and the name of the list of samples, respectively. We obtain that with purrr::map2_dfr():

Being a map2*() function, map2_dfr() takes two lists of vectors as arguments, passed as .x and .y in the formula.
Being a map*_dfr() function, map2_dfr() binds the result into a tibble by row.

norm_table <- map2_dfr(mean_norm, names(mean_norm), ~ tibble(value = .x, name = .y))
norm_table

## # A tibble: 3,000 × 2
##        value name 
##        <dbl> <chr>
##  1  0.00155  sd_05
##  2  0.0108   sd_05
##  3  0.0119   sd_05
##  4 -0.000436 sd_05
##  5  0.00463  sd_05
##  6 -0.00281  sd_05
##  7  0.00336  sd_05
##  8 -0.0106   sd_05
##  9  0.0114   sd_05
## 10 -0.0168   sd_05
## # ℹ 2,990 more rows

Let’s draw an histogram for each distribution:

norm_table |>
  ggplot(aes(value)) +
  geom_histogram(bins = 30) +
  facet_grid(. ~ name) +
  theme_linedraw() +
  labs(title = "Normal distribution", x = NULL, y = NULL)

Each of the samples of $\bar{x}$ follows a normal distribution centered in $μ$ and standard deviation equal to the standard error. This is what the central limit theorem states, so we have tested it empirically.

I can provide additional graphical evidence of the normality of sample mean with a quantile-quantile (Q-Q) plot, obtained with stat_qq() and stat_qq_line() from ggplot.

norm_table |>
  ggplot(aes(sample = value)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  facet_grid(. ~ name) +
  theme_linedraw() +
  labs(title = "Normal distribution", x = NULL, y = NULL)

All points are aligned alog the red line, meaning that the sample follows a normal distribution.

Continuous Uniform Distribution

For the continuous uniform distribution, I will replicate the workflow of the normal distribution obtaining three samples of growing standard deviation. For a continuous distribution, the larger the difference betweem maximum and minimum, the larger the standard deviation of the distribution.

set.seed(55)
unif_01 <- map(1:1000, ~ runif(2000, min = -1, max = 1))
unif_02 <- map(1:1000, ~ runif(2000, min = -2, max = 2))
unif_04 <- map(1:1000, ~ runif(2000, min = -4, max = 4))

unif <- list(unif_01 = unif_01, unif_02 = unif_02, unif_04 = unif_04)

Like with normal distributions, let’s compute the sample means $\bar{x}$ for each sample.

mean_unif <- map(unif, ~ map_dbl(. , mean))

When computing the mean and standard deviation of the samples of $\bar{x}$ , we observe that mean is close to zero and that standard error grows:

map_dbl(mean_unif, mean)

##       unif_01       unif_02       unif_04 
## -0.0002933102 -0.0010177496 -0.0009016651

map_dbl(mean_unif, ~ sd(.))

##    unif_01    unif_02    unif_04 
## 0.01292212 0.02627455 0.05129232

Values of standard errors are close to the population values.

c(2, 4, 8)/sqrt(12*2000)

## [1] 0.01290994 0.02581989 0.05163978

Let’s create the tibble to plot histograms of $\bar{x}$ .

unif_table <- map2_dfr(mean_unif, names(mean_unif), ~ tibble(value = .x, name = .y))
unif_table

## # A tibble: 3,000 × 2
##       value name   
##       <dbl> <chr>  
##  1 -0.00411 unif_01
##  2  0.00394 unif_01
##  3  0.00867 unif_01
##  4 -0.0155  unif_01
##  5 -0.0350  unif_01
##  6 -0.0151  unif_01
##  7  0.00923 unif_01
##  8 -0.00413 unif_01
##  9 -0.00681 unif_01
## 10  0.00407 unif_01
## # ℹ 2,990 more rows

These histograms show that sample means of uniform distributions are normally distributed.

unif_table |>
  ggplot(aes(value)) +
  geom_histogram(bins = 30) +
  facet_grid(. ~ name) +
  theme_linedraw() +
  labs(title = "Uniform distribution", x = NULL, y = NULL)

As in the previous example, I am presenting Q-Q plots to provide additional evidence of the normality of the sample mean.

unif_table |>
  ggplot(aes(sample = value)) +
  stat_qq() +
  stat_qq_line(color = "red") +
  facet_grid(. ~ name) +
  theme_linedraw() +
  labs(title = "Uniform distribution", x = NULL, y = NULL)

Iterating with `purrr`

To demonstrate empirically the central limit theorem for the normal and uniform distributions, I have used some functions of purrr to iterate along lists and vectors. The map() function works like the R base lapply() function, but others functions of the family offer additional features:

Functions like map_int() and map_dbl() fix the type of the output as a vector of integer and floating point values, respectively. Thus, these functions allow controlling the type of output.
Functions like map2() allow iterating along two inputs. The pmap() family of functions allows iterating along multiple inputs.
Functions like map_dfr() and map_dfc() bind together by rows or columns the output of each iteration.

Many of the combinations of features are available in functions like map2_dfr(), so they facilitate iterating in R programming considerably.

References

Central limit theorem https://en.wikipedia.org/wiki/Central_limit_theorem
Q-Q plot https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
purrr map family https://purrr.tidyverse.org/reference/index.html#map-family

Session Info

## R version 4.4.0 (2024-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Linux Mint 21.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
##  [9] ggplot2_3.5.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5      jsonlite_1.8.8    highr_0.10        compiler_4.4.0   
##  [5] tidyselect_1.2.1  jquerylib_0.1.4   scales_1.3.0      yaml_2.3.8       
##  [9] fastmap_1.1.1     R6_2.5.1          labeling_0.4.3    generics_0.1.3   
## [13] knitr_1.46        bookdown_0.39     munsell_0.5.1     tzdb_0.4.0       
## [17] bslib_0.7.0       pillar_1.9.0      rlang_1.1.3       utf8_1.2.4       
## [21] stringi_1.8.3     cachem_1.0.8      xfun_0.43         sass_0.4.9       
## [25] timechange_0.3.0  cli_3.6.2         withr_3.0.0       magrittr_2.0.3   
## [29] digest_0.6.35     grid_4.4.0        rstudioapi_0.16.0 hms_1.1.3        
## [33] lifecycle_1.0.4   vctrs_0.6.5       evaluate_0.23     glue_1.7.0       
## [37] farver_2.1.1      blogdown_1.19     fansi_1.0.6       colorspace_2.1-0 
## [41] rmarkdown_2.26    tools_4.4.0       pkgconfig_2.0.3   htmltools_0.5.8.1

The Jose M Sallan static website