In this post, I will present the functionalities of the purrr package for mapping (iterating) along vectors, lists or data frames using functional programming. purrr is included in the tidyverse and it is loaded with the tidyverse (meta-)package.
I will also be using kableExtra to present data frames nicely.
library(tidyverse)
library(kableExtra)
Mapping one list
In mathematics, mapping is an operation that associates each element of a given set (the domain) with one or more elements of a second set (the range). This is precisely what the map function of purrr is doing.
map has two arguments. The first one is a list, vector or data frame. The second is the function to be applied to each element of the first argument.
Let’s build a list of a vector and two data frames:
l <- list(a = LETTERS, b = iris, c = mtcars)
The outcome of map is always a list:
map(l, length)
## $a
## [1] 26
##
## $b
## [1] 5
##
## $c
## [1] 11
We can obtain a similar output using the lapply R base function:
lapply(l, length)
## $a
## [1] 26
##
## $b
## [1] 5
##
## $c
## [1] 11
The length function always returns an integer, so it makes sense to obtain a vector of integers instead of a list. We can achieve that with map_int.
map_int(l, length)
## a b c
## 26 5 11
Again, we can obtain a similar output using sapply:
sapply(l, length)
## a b c
## 26 5 11
All mapping functions of purrr include variants that allow specifying the class of the output. Using map_int, map_dbl, map_chr and map_lgl we obtain, if possible, outputs of class integer, double, character and logical respectively.
Let’s obtain the output of length for list l as a character vector:
map_chr(l, length)
## a b c
## "26" "5" "11"
The first argument of the map family functions can also be a data frame. Those functions treat a data frame as a list of columns. Let’s see how can we calculate the mean of each of the columns of mtcars.
map_dbl(mtcars, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
Using R base we can obtain the same result using apply across columns:
apply(mtcars, 2, mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
In purrr functions we can use function shortcuts, where the function is introduced with ~. In map, the element of list or data frame is represented as ..
map_dbl(mtcars, ~ round(mean(.), 4))
## mpg cyl disp hp drat wt qsec vs
## 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
## am gear carb
## 0.4062 3.6875 2.8125
We can produce a similar output using R base, but not with the function shortcut.
sapply(mtcars, \(i) round(mean(i), 4))
## mpg cyl disp hp drat wt qsec vs
## 20.0906 6.1875 230.7219 146.6875 3.5966 3.2172 17.8487 0.4375
## am gear carb
## 0.4062 3.6875 2.8125
Mapping two lists
The map2 family of functions allows iterating a function with two arguments along two lists.
To illustrate how map2 functions work, let’s build a function that tells us if we have improved or worsened our performance when comparing past and present grades:
check_improvement <- function(past, present){
if(past < present){
report <- "improved"
}else{
report <- "not improved"
}
return(report)
}
We want to apply check_improvement to two vectors of past and present grades:
set.seed(1111)
past_grades <- sample(1:10, 10, replace = TRUE)
present_grades <- sample(1:10, 10, replace = TRUE)
We cannot apply check_improvement to the vectors past_grades and present_grades directly, as if only logical evaluations arguments of length one. We can iterate along these two vectors using map2. The two inputs of the function are labeled as ´.xand.y`.
map2(past_grades, present_grades, ~ check_improvement(.x, .y))
## [[1]]
## [1] "not improved"
##
## [[2]]
## [1] "improved"
##
## [[3]]
## [1] "not improved"
##
## [[4]]
## [1] "not improved"
##
## [[5]]
## [1] "improved"
##
## [[6]]
## [1] "not improved"
##
## [[7]]
## [1] "not improved"
##
## [[8]]
## [1] "not improved"
##
## [[9]]
## [1] "not improved"
##
## [[10]]
## [1] "improved"
Using functions *_dfr and *_dfc we can present the output as tibbles constructed by rows or columns, respectively. Let’s modify the function above to return a row of a data frame for each observation.
check_improvement2 <- function(past, present){
if(past < present){
report <- "improved"
}else{
report <- "not improved"
}
return(list(past = past, present = present, report = report))
}
Now we get the data frame binding rows with map2_dfr:
map2_dfr(past_grades, present_grades, ~ check_improvement2(.x, .y)) %>%
kbl() %>%
kable_styling(full_width = FALSE)
| past | present | report |
|---|---|---|
| 6 | 6 | not improved |
| 2 | 8 | improved |
| 10 | 2 | not improved |
| 4 | 2 | not improved |
| 1 | 7 | improved |
| 6 | 5 | not improved |
| 6 | 1 | not improved |
| 10 | 5 | not improved |
| 7 | 4 | not improved |
| 1 | 4 | improved |
Mapping more than two lists
We can map functions taking three or more arguments using the pmap family. They work similary to map2, but taking arguments of the form ..1, ..2, ..3 and so on. The input of those functions is a list with the elements to iterate. Let’s see how the pmap functions work with an example.
Let’s consider a quiz where you are betting on the results of football matches. If your result has the same winning team as the real match, or you correctly guess a tie, you get two points. If your bet matches the exact result, you get three points. If f1 and f2 are the forecasted goals for each team, and r1 and r2 the real result, we can get the points of the bet with the function:
score <- function(f1, f2, r1, r2){
points <- 0
if(sign(f1-f2) == sign(r1-r2))
points <- 2
if(f1 == r1 & f2 == r2)
points <- 3
return(points)
}
Let’s test the function with a list of bets and results matches_list and the pmap_dbl function:
matches_list <- list(mf1 = c(0, 0, 0, 1),
mf2 = c(0, 2, 3, 1),
mr1 = c(1, 0, 1, 1),
mr2 = c(1, 2, 0, 1))
pmap_dbl(matches_list, ~score(..1, ..2, ..3, ..4))
## [1] 2 3 0 3
Mapping in a data frame
We can use the map, map2 and pmap families of functions inside a data frame or tibble using mutate. For the example above we can store bets and results in a tibble, and then use mutate to add a column with the results of the score function. Note that the argument of pmap can be also a data frame.
matches <- tibble(mf1 = c(0, 0, 0, 1),
mf2 = c(0, 2, 3, 1),
mr1 = c(1, 0, 1, 1),
mr2 = c(1, 2, 0, 1))
matches <- matches %>%
mutate(result = pmap_dbl(matches, ~score(..1, ..2, ..3, ..4)))
matches %>%
kbl() %>%
kable_styling(full_width = FALSE)
| mf1 | mf2 | mr1 | mr2 | result |
|---|---|---|---|---|
| 0 | 0 | 1 | 1 | 2 |
| 0 | 2 | 0 | 2 | 3 |
| 0 | 3 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 3 |
If we use functions of the map or map2 family, the arguments of the functions are the columns of the data frame.
grades <- tibble(past = past_grades, present = present_grades)
grades <- grades %>%
mutate(check = map2_chr(past, present, ~ check_improvement(.x, .y)))
grades %>%
kbl() %>%
kable_styling(full_width = FALSE)
| past | present | check |
|---|---|---|
| 6 | 6 | not improved |
| 2 | 8 | improved |
| 10 | 2 | not improved |
| 4 | 2 | not improved |
| 1 | 7 | improved |
| 6 | 5 | not improved |
| 6 | 1 | not improved |
| 10 | 5 | not improved |
| 7 | 4 | not improved |
| 1 | 4 | improved |
References
purrrpage in tidyverse website https://purrr.tidyverse.org/- 21: iteration in Wickham, H. and Grolemund, G. (in progress). R for data science. https://r4ds.had.co.nz/iteration.html
Session info
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 19.2
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
##
## locale:
## [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] kableExtra_1.3.4 forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10
## [5] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 tibble_3.1.8
## [9] ggplot2_3.4.0 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] svglite_2.1.0 lubridate_1.9.0 assertthat_0.2.1 digest_0.6.30
## [5] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
## [9] reprex_2.0.2 evaluate_0.17 highr_0.9 httr_1.4.4
## [13] blogdown_1.9 pillar_1.8.1 rlang_1.0.6 readxl_1.4.1
## [17] rstudioapi_0.13 jquerylib_0.1.4 rmarkdown_2.14 webshot_0.5.3
## [21] munsell_0.5.0 broom_1.0.1 compiler_4.2.2 modelr_0.1.10
## [25] xfun_0.34 pkgconfig_2.0.3 systemfonts_1.0.4 htmltools_0.5.3
## [29] tidyselect_1.1.2 bookdown_0.26 fansi_1.0.3 viridisLite_0.4.1
## [33] crayon_1.5.2 tzdb_0.3.0 dbplyr_2.2.1 withr_2.5.0
## [37] grid_4.2.2 jsonlite_1.8.3 gtable_0.3.0 lifecycle_1.0.3
## [41] DBI_1.1.2 magrittr_2.0.3 scales_1.2.1 cli_3.4.1
## [45] stringi_1.7.8 fs_1.5.2 xml2_1.3.3 bslib_0.3.1
## [49] ellipsis_0.3.2 generics_0.1.2 vctrs_0.5.0 tools_4.2.2
## [53] glue_1.6.2 hms_1.1.2 fastmap_1.1.0 yaml_2.3.6
## [57] timechange_0.1.1 colorspace_2.0-3 rvest_1.0.3 knitr_1.40
## [61] haven_2.5.1 sass_0.4.1