Merging with dtplyr

Jose M Sallan 2021-04-15 6 min read

In a recent post, I illustrated how to merge tables in R using the data.table package. This package handles large tabular data effectively, although some find its syntax somewhat arcane. That’s why Hadley Wickham has developed dtplyr “a data.table backend for dplyr, allowing you to write dplyr code that is automatically translated to the equivalent data.table code.”

For dtplyr to work, we need to load also dplyr, and it is convenient to load data.table as well:

library(data.table)
library(dplyr)
library(dtplyr)

Let’s compare the syntax of data.table and dplyr to merge tables. Let’s consider two data tables:

table1

##    id v       date
## 1:  a 1 2021-01-01
## 2:  a 2 2021-03-23
## 3:  b 3 2021-02-13
## 4:  b 4 2021-05-28
## 5:  c 5 2021-12-17
## 6:  e 6 2021-12-31

table2

##    id w
## 1:  a 1
## 2:  b 2
## 3:  c 3
## 4:  d 4

Inner join

The result of an inner join is a table with the rows with values of merging variables existing in both tables:

setkey(table1, "id")
setkey(table2, "id")
merge(table1, table2)

##    id v       date w
## 1:  a 1 2021-01-01 1
## 2:  a 2 2021-03-23 1
## 3:  b 3 2021-02-13 2
## 4:  b 4 2021-05-28 2
## 5:  c 5 2021-12-17 3

In dplyr we use inner_join for that. If the input are two data.table objects and dtplyr is loaded, we perform the same operation as above. We use as.data.table() to obtain a data.table object.

inner_join(table1, table2) %>%
  as.data.table()

##    id v       date w
## 1:  a 1 2021-01-01 1
## 2:  a 2 2021-03-23 1
## 3:  b 3 2021-02-13 2
## 4:  b 4 2021-05-28 2
## 5:  c 5 2021-12-17 3

The outcome of both functions is identical:

a <- merge(table1, table2)
b <- inner_join(table1, table2) %>%
  as.data.table()
identical(a,b)

## [1] TRUE

Let’s check time of execution:

rbenchmark::benchmark(merge(table1, table2),
                      inner_join(table1, table2) %>% as.data.table(),
                      inner_join(table1_tibble, table2_tibble),
                      order = "relative",
                      columns = c("test", "replications", "elapsed", "relative"))

##                                             test replications elapsed relative
## 1                          merge(table1, table2)          100   0.124    1.000
## 3       inner_join(table1_tibble, table2_tibble)          100   0.248    2.000
## 2 inner_join(table1, table2) %>% as.data.table()          100   0.457    3.685

We see that dtplyr is faster than dplyr, but slower than the original data.table code.

Left join

The result of a left join is a table with all the rows of the left table (the first to be entered):

merge(table1, table2, all.x = TRUE)

##    id v       date  w
## 1:  a 1 2021-01-01  1
## 2:  a 2 2021-03-23  1
## 3:  b 3 2021-02-13  2
## 4:  b 4 2021-05-28  2
## 5:  c 5 2021-12-17  3
## 6:  e 6 2021-12-31 NA

We can do the same with left_join in dplyr and dtplyr:

left_join(table1, table2) %>%
  as.data.table()

##    id  w v       date
## 1:  a  1 1 2021-01-01
## 2:  a  1 2 2021-03-23
## 3:  b  2 3 2021-02-13
## 4:  b  2 4 2021-05-28
## 5:  c  3 5 2021-12-17
## 6:  e NA 6 2021-12-31

Rigth join

In a right join we retain all the rows of the right table:

merge(table1, table2, all.y = TRUE)

##    id  v       date w
## 1:  a  1 2021-01-01 1
## 2:  a  2 2021-03-23 1
## 3:  b  3 2021-02-13 2
## 4:  b  4 2021-05-28 2
## 5:  c  5 2021-12-17 3
## 6:  d NA       <NA> 4

It’s easy to guess that we need to use right_join here:

right_join(table1, table2) %>%
  as.data.table()

##    id  v       date w
## 1:  a  1 2021-01-01 1
## 2:  a  2 2021-03-23 1
## 3:  b  3 2021-02-13 2
## 4:  b  4 2021-05-28 2
## 5:  c  5 2021-12-17 3
## 6:  d NA       <NA> 4

Outer or full join

The result of a full or outer join is a table containing all the rows of both tables:

merge(table1, table2, all.x = TRUE, all.y = TRUE)

##    id  v       date  w
## 1:  a  1 2021-01-01  1
## 2:  a  2 2021-03-23  1
## 3:  b  3 2021-02-13  2
## 4:  b  4 2021-05-28  2
## 5:  c  5 2021-12-17  3
## 6:  d NA       <NA>  4
## 7:  e  6 2021-12-31 NA

The dtplyr function for a full join is full_join:

full_join(table1, table2) %>%
  as.data.table()

##    id  v       date  w
## 1:  a  1 2021-01-01  1
## 2:  a  2 2021-03-23  1
## 3:  b  3 2021-02-13  2
## 4:  b  4 2021-05-28  2
## 5:  c  5 2021-12-17  3
## 6:  d NA       <NA>  4
## 7:  e  6 2021-12-31 NA

Cartesian merge

These two tables had many repeated values in the id column, so the result of merging them has many rows:

table3

##    id  g
## 1:  a 21
## 2:  a 22
## 3:  a 23
## 4:  b 24
## 5:  c 25
## 6:  f 26

table4

##    id  h
## 1:  a 31
## 2:  a 32
## 3:  a 33
## 4:  a 34
## 5:  a 35
## 6:  b 36
## 7:  b 37
## 8:  b 38
## 9:  c 39

In data.table, if the result of the merge has more rows than the sum of rows of both tables, we need to add allow.cartesian = TRUE.

merge(table3, table4, allow.cartesian = TRUE)

##     id  g  h
##  1:  a 21 31
##  2:  a 21 32
##  3:  a 21 33
##  4:  a 21 34
##  5:  a 21 35
##  6:  a 22 31
##  7:  a 22 32
##  8:  a 22 33
##  9:  a 22 34
## 10:  a 22 35
## 11:  a 23 31
## 12:  a 23 32
## 13:  a 23 33
## 14:  a 23 34
## 15:  a 23 35
## 16:  b 24 36
## 17:  b 24 37
## 18:  b 24 38
## 19:  c 25 39

In dtplyr allow.cartesian = TRUE is set by default:

inner_join(table3, table4) %>%
  as.data.table()

##     id  g  h
##  1:  a 21 31
##  2:  a 21 32
##  3:  a 21 33
##  4:  a 21 34
##  5:  a 21 35
##  6:  a 22 31
##  7:  a 22 32
##  8:  a 22 33
##  9:  a 22 34
## 10:  a 22 35
## 11:  a 23 31
## 12:  a 23 32
## 13:  a 23 33
## 14:  a 23 34
## 15:  a 23 35
## 16:  b 24 36
## 17:  b 24 37
## 18:  b 24 38
## 19:  c 25 39

References

Vignette of joining two tables in dplyr https://dplyr.tidyverse.org/reference/join.html
dtplyr package intro: https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/
A comparison of dplyr and data.table, including merges: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

I have used version 1.13.6 of data.table, version 1.0.5 of dplyr and 1.1.0 of dtplyr in R 4.0.3.

The Jose M Sallan static website