Introducing the beta probability distribution

Jose M Sallan 2022-10-03 7 min read

In this post, I will present the beta distribution of probability, and illustrate how to use the tidiyverse to represent probability distributions. Beta distribution functions are included in the R base, so I will only be using the tidyverse, complemented with the patchwork package.

library(tidyverse)
library(patchwork)

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval \(\left[0, 1\right]\). It is used to model distribution of the probability of a specific success.

The shape of the beta distribution is defined with two parameters \(\alpha \geq 0\) and \(\beta \geq 0\). The formula of the beta probability distribution is:

\[ d = \frac{x^{\alpha-1}\left(1-x\right)^{\beta-1}}{B\left(\alpha, \beta\right)} \]

The \(B\left(\alpha, \beta\right)\) is a fixed term to scale the area under the function to one, and it is defined using the gamma function:

\[B\left(\alpha, \beta\right) = \frac{\Gamma\left(\alpha\right)\Gamma\left(\beta\right)}{\Gamma\left(\alpha+\beta\right)}\]

With the tidyverse, we can use the dbeta function in stat_function to plot a probability distribution function. Parameters shape1 and shape2 of dbeta correspond with \(\alpha\) and \(\beta\).

  ggplot() +
  stat_function(fun = dbeta, args = list(shape1 = 10, shape2 = 40)) +
  theme_minimal()

Interpretation of the beta distribution

The beta distribution:

\[ \frac{1}{B\left(\alpha, \beta\right)} x^{\alpha-1}\left(1-x\right)^{\beta-1}\]

is similar to the binomial distribution:

\[{n\choose x} p^{x}\left(1-p\right)^{n-x}\]

While the binomial gives the distribution of number of successes given a probability, the beta gives the distribution of probability of success given the number of successes and failures. Then, \(\alpha - 1\) is the number of successes and \(\beta - 1\) is the number of failures. This interpretation of beta is valid for integer values of \(\alpha0\) and \(\beta0\).

Let’s represent some beta distributions to illustrate this interpretation. First I am creating the table_beta function, that draws a the beta distribution for a number of sucesses s and failures f. The other parameter is a label l. I am using To draw the probability distribution, I create the variable x as a sequence of 100 values between 0 and 1, and then apply the dbeta function to obtain the value of density of probability d.

table_beta <- function(s, f, l){
  t <- tibble(x = seq(0, 1, length.out = 100),
              s = s,
              f = f,
              sample = l) %>%
    mutate(d = dbeta(x, shape1 = s + 1, shape2 = f + 1))
  return(t)
}

The table_plot function plots a set of beta distributions. The value of s, f and l for each distribution is stored in a table t. Each element of the set needs a different value of the label l. l is passed as color in the aes of the plot, so that we obtain one line for each value of l.

table_plot <- function(t){
  
  l <- lapply(1:nrow(t), \(i) table_beta(t$s[i], t$f[i], t$l[i]))
  
  df <- bind_rows(l)
  
  ggplot(df, aes(x, d, color = sample)) +
    geom_line() +
    theme_minimal() +
    scale_color_brewer(palette = "Reds", direction = -1)
}

Let’s generate two sets of distributions. t1 is a set of distribution with 20% of successes with a growing number of observations. t2 is similar, but with a 80% of successes.

t1 <- tibble(s = c(10, 50, 100, 200),
             f = c(40, 200, 400, 800),
             l = factor(s + f))

t2 <- tibble(s = c(40, 200, 400, 800),
             f = c(10, 50, 100, 200),
             l = factor(s + f))

Here I am presenting the plots of both families of distributions, t1 on the left and t2 on the right. They are put together thanks to the patchwork package.

p1a <- table_plot(t = t1) + 
  geom_vline(xintercept = 0.2, color = "#004C99") +
  ggtitle("Samples with 20% of success")

p1b <- table_plot(t = t2) + 
  geom_vline(xintercept = 0.8, color = "#004C99") +
  ggtitle("Samples with 80% of success")

p1a + p1b

In both cases, the variability of the probability distribution decreases with the number of observations. This is an example of increase of statistical power with larger sample sizes.

We can see the effect of sample size on power calculating ranges of probabilities using pbeta. When we have 10 successes and 40 failures, the probability that the probability of success ranges between 0.21 and 0.19 is:

pbeta(0.21, 11, 41) - pbeta(0.19, 11, 41)
## [1] 0.1418756

But when we have 1000 successes and 4000 failures, the probability is:

pbeta(0.21, 1001, 4001) - pbeta(0.19, 1001, 4001)
## [1] 0.9228958

For additional clarification, I will examine a set of beta distributions with one hundred observations, but with different proportions of success and failure:

t3 <- tibble(s = c(10, 20, 50, 80, 90),
             f = c(90, 80, 50, 20, 10),
             l = as.factor(c(0.1, 0.2, 0.5, 0.8, 0.9)))

table_plot(t = t3) +
  ggtitle("Several samples of 100 observations")

As the number of successes increases, the density of probability distribution leans to the right.

Beta distribution with alpha or beta smaller than one

The formula of the beta distribution also admits values of shape parameters \(\alpha\) and \(\beta\) between one and zero. To examine these distributions, I will create a new table_beta2 function taking alpha and beta as arguments…

table_beta2 <- function(alpha, beta, l){
  t <- tibble(x = seq(0, 1, length.out = 100),
              alpha = alpha,
              beta = beta,
              sample = l) %>%
    mutate(d = dbeta(x, shape1 = alpha, shape2 = beta))
  return(t)
}

.. and its corresponding table_plot2 function:

table_plot2 <- function(t){
  
  l <- lapply(1:nrow(t), \(i) table_beta2(t$alpha[i], t$beta[i], t$l[i]))
  
  df <- bind_rows(l)
  
  ggplot(df, aes(x, d, color = sample)) +
    geom_line() +
    theme_minimal() +
    scale_color_brewer(palette = "Reds", direction = -1)
}

Let’s examine first a t4 set with a combination of fractional alpha and integer beta and a symmetrical t5 set.

t4 <- tibble(alpha = seq(0.1, 0.8, 0.2),
             beta = 2,
             l = as.factor(alpha))

t5 <- tibble(alpha = 2,
             beta = seq(0.1, 0.8, 0.2),
             l = as.factor(beta))


table_plot2(t4) + table_plot2(t5)

Being \(\alpha\) much smaller than \(\beta\), the probability distribution is skewed to the left. The swap of values of shape parameters would lead to a symmetrical distribution.

Let’s examine now two set of \(\alpha\) and \(beta\) smaller than one. In the t6 set (left plot), both shape parameters add one, while in the t7 set (right plot) are equal. In both plots, labels are the values of \(\alpha\).

t6 <- tibble(alpha = seq(0.1, 0.8, 0.2),
             beta = 1 - alpha,
             l = as.factor(alpha))

t7 <- tibble(alpha = seq(0.1, 0.8, 0.2),
             beta = alpha,
             l = as.factor(alpha))

table_plot2(t6) + table_plot2(t7)

We observe U-shaped distributions, meaning that the distribution is skewed to extreme values of probability. In the StackExchange question listed in the references, one respondent provides an interpretation of this distribution based on a extrapolation of Polya urns to fractional values.

References

Session info

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19.2
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.1.1 forcats_0.5.1   stringr_1.4.0   dplyr_1.0.9    
##  [5] purrr_0.3.4     readr_2.1.2     tidyr_1.2.0     tibble_3.1.6   
##  [9] ggplot2_3.3.5   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] lubridate_1.8.0    assertthat_0.2.1   digest_0.6.29      utf8_1.2.2        
##  [5] R6_2.5.1           cellranger_1.1.0   backports_1.4.1    reprex_2.0.1      
##  [9] evaluate_0.15      httr_1.4.2         highr_0.9          blogdown_1.9      
## [13] pillar_1.7.0       rlang_1.0.2        readxl_1.4.0       rstudioapi_0.13   
## [17] jquerylib_0.1.4    rmarkdown_2.14     labeling_0.4.2     munsell_0.5.0     
## [21] broom_0.8.0        compiler_4.2.1     modelr_0.1.8       xfun_0.30         
## [25] pkgconfig_2.0.3    htmltools_0.5.2    tidyselect_1.1.2   bookdown_0.26     
## [29] fansi_1.0.3        crayon_1.5.1       tzdb_0.3.0         dbplyr_2.1.1      
## [33] withr_2.5.0        grid_4.2.1         jsonlite_1.8.0     gtable_0.3.0      
## [37] lifecycle_1.0.1    DBI_1.1.2          magrittr_2.0.3     scales_1.2.0      
## [41] cli_3.3.0          stringi_1.7.6      farver_2.1.0       fs_1.5.2          
## [45] xml2_1.3.3         bslib_0.3.1        ellipsis_0.3.2     generics_0.1.2    
## [49] vctrs_0.4.1        RColorBrewer_1.1-3 tools_4.2.1        glue_1.6.2        
## [53] hms_1.1.1          fastmap_1.1.0      yaml_2.3.5         colorspace_2.0-3  
## [57] rvest_1.0.2        knitr_1.39         haven_2.5.0        sass_0.4.1