In a recent post, I illustrated how to merge tables in R using the data.table
package. This package handles large tabular data effectively, although some find its syntax somewhat arcane. That’s why Hadley Wickham has developed dtplyr
“a data.table
backend for dplyr
, allowing you to write dplyr
code that is automatically translated to the equivalent data.table
code.”
For dtplyr
to work, we need to load also dplyr
, and it is convenient to load data.table
as well:
library(data.table)
library(dplyr)
library(dtplyr)
Let’s compare the syntax of data.table
and dplyr
to merge tables. Let’s consider two data tables:
table1
## id v date
## 1: a 1 2021-01-01
## 2: a 2 2021-03-23
## 3: b 3 2021-02-13
## 4: b 4 2021-05-28
## 5: c 5 2021-12-17
## 6: e 6 2021-12-31
table2
## id w
## 1: a 1
## 2: b 2
## 3: c 3
## 4: d 4
Inner join
The result of an inner join is a table with the rows with values of merging variables existing in both tables:
setkey(table1, "id")
setkey(table2, "id")
merge(table1, table2)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
In dplyr
we use inner_join
for that. If the input are two data.table
objects and dtplyr
is loaded, we perform the same operation as above. We use as.data.table()
to obtain a data.table
object.
inner_join(table1, table2) %>%
as.data.table()
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
The outcome of both functions is identical:
a <- merge(table1, table2)
b <- inner_join(table1, table2) %>%
as.data.table()
identical(a,b)
## [1] TRUE
Let’s check time of execution:
rbenchmark::benchmark(merge(table1, table2),
inner_join(table1, table2) %>% as.data.table(),
inner_join(table1_tibble, table2_tibble),
order = "relative",
columns = c("test", "replications", "elapsed", "relative"))
## test replications elapsed relative
## 1 merge(table1, table2) 100 0.124 1.000
## 3 inner_join(table1_tibble, table2_tibble) 100 0.248 2.000
## 2 inner_join(table1, table2) %>% as.data.table() 100 0.457 3.685
We see that dtplyr
is faster than dplyr
, but slower than the original data.table
code.
Left join
The result of a left join is a table with all the rows of the left table (the first to be entered):
merge(table1, table2, all.x = TRUE)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: e 6 2021-12-31 NA
We can do the same with left_join
in dplyr
and dtplyr
:
left_join(table1, table2) %>%
as.data.table()
## id w v date
## 1: a 1 1 2021-01-01
## 2: a 1 2 2021-03-23
## 3: b 2 3 2021-02-13
## 4: b 2 4 2021-05-28
## 5: c 3 5 2021-12-17
## 6: e NA 6 2021-12-31
Rigth join
In a right join we retain all the rows of the right table:
merge(table1, table2, all.y = TRUE)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
It’s easy to guess that we need to use right_join
here:
right_join(table1, table2) %>%
as.data.table()
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
Outer or full join
The result of a full or outer join is a table containing all the rows of both tables:
merge(table1, table2, all.x = TRUE, all.y = TRUE)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
## 7: e 6 2021-12-31 NA
The dtplyr
function for a full join is full_join
:
full_join(table1, table2) %>%
as.data.table()
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
## 7: e 6 2021-12-31 NA
Cartesian merge
These two tables had many repeated values in the id
column, so the result of merging them has many rows:
table3
## id g
## 1: a 21
## 2: a 22
## 3: a 23
## 4: b 24
## 5: c 25
## 6: f 26
table4
## id h
## 1: a 31
## 2: a 32
## 3: a 33
## 4: a 34
## 5: a 35
## 6: b 36
## 7: b 37
## 8: b 38
## 9: c 39
In data.table
, if the result of the merge has more rows than the sum of rows of both tables, we need to add allow.cartesian = TRUE
.
merge(table3, table4, allow.cartesian = TRUE)
## id g h
## 1: a 21 31
## 2: a 21 32
## 3: a 21 33
## 4: a 21 34
## 5: a 21 35
## 6: a 22 31
## 7: a 22 32
## 8: a 22 33
## 9: a 22 34
## 10: a 22 35
## 11: a 23 31
## 12: a 23 32
## 13: a 23 33
## 14: a 23 34
## 15: a 23 35
## 16: b 24 36
## 17: b 24 37
## 18: b 24 38
## 19: c 25 39
In dtplyr allow.cartesian = TRUE
is set by default:
inner_join(table3, table4) %>%
as.data.table()
## id g h
## 1: a 21 31
## 2: a 21 32
## 3: a 21 33
## 4: a 21 34
## 5: a 21 35
## 6: a 22 31
## 7: a 22 32
## 8: a 22 33
## 9: a 22 34
## 10: a 22 35
## 11: a 23 31
## 12: a 23 32
## 13: a 23 33
## 14: a 23 34
## 15: a 23 35
## 16: b 24 36
## 17: b 24 37
## 18: b 24 38
## 19: c 25 39
References
- Vignette of joining two tables in
dplyr
https://dplyr.tidyverse.org/reference/join.html dtplyr
package intro: https://www.tidyverse.org/blog/2019/11/dtplyr-1-0-0/- A comparison of
dplyr
anddata.table
, including merges: https://atrebas.github.io/post/2019-03-03-datatable-dplyr/
I have used version 1.13.6 of data.table, version 1.0.5 of dplyr and 1.1.0 of dtplyr in R 4.0.3.