When performing tasks of exploratory data analysis, sometimes we need to join tables by common variables. This is an operation similar to the defined in relational algebra, although joining variables may not represent a unique ID for each row. Here I will illustrate how to merge tables using the data.table
package, that allows handling large datasets in R effectively.
library(data.table)
Let’s consider the following tables.
table1
## id v date
## 1: a 1 2021-01-01
## 2: a 2 2021-03-23
## 3: b 3 2021-02-13
## 4: b 4 2021-05-28
## 5: c 5 2021-12-17
## 6: e 6 2021-12-31
table2
## id w
## 1: a 1
## 2: b 2
## 3: c 3
## 4: d 4
table1
and table2
can be merged by the column id
. The most generic function to perform joins in data.table
is merge
, similar to the R base function of the same name. Let’s see how to perform different merges with these two tables.
Inner join
The result of an inner join is a table with the rows with values of merging variables existing in both tables. This is the default mode of merge
, with all.x = FALSE
and all.y = FALSE
.
merge(table1, table2)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
merge
has chosen the variables to merge picking the ones that have the same name in both tables. In data.table
, we can control which are the merging variables using setkey
:
table1_keyed <- copy(table1)
table2_keyed <- copy(table2)
setkey(table1_keyed, "id")
setkey(table2_keyed, "id")
merge(table1_keyed, table2_keyed)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
Let’s see if we gain speed setting the key variables:
rbenchmark::benchmark(merge(table1, table2),
merge(table1_keyed, table2_keyed),
order = "relative",
columns = c("test", "replications", "elapsed", "relative"))
## test replications elapsed relative
## 2 merge(table1_keyed, table2_keyed) 100 0.085 1.000
## 1 merge(table1, table2) 100 0.105 1.235
Seeing the results, we better set the keys of table1
and table2
.
setkey(table1, "id")
setkey(table2, "id")
Left join
The result of a left join is a table with all the rows of the first table (the one on the left). The columns of the second table (on the right) with no correspondence with the first have NA
in the merged table. We can perform a left join with data.table
using merge
with all.x = TRUE
.
merge(table1, table2, all.x = TRUE)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: e 6 2021-12-31 NA
Right join
The right join is analogous to the left join, but the resulting table contains all the rows of the second table:
merge(table1, table2, all.y = TRUE)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
In data.table
, we have an alternative syntax for right joins using the on
operator:
table1[table2, on = .(id)]
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
we can get rid of the on
operator if we define the keys of each table with setkey
:
table1[table2]
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
Outer or full join
The result of a full or outer join is a table containing all the rows of both tables:
merge(table1, table2, all.x = TRUE, all.y = TRUE)
## id v date w
## 1: a 1 2021-01-01 1
## 2: a 2 2021-03-23 1
## 3: b 3 2021-02-13 2
## 4: b 4 2021-05-28 2
## 5: c 5 2021-12-17 3
## 6: d NA <NA> 4
## 7: e 6 2021-12-31 NA
Cartesian join
A relational algebra requires that the columns used to merge variables have an unique value for each row. This is not always the case when we merge tables: we have seen than table1
has several rows with id
equal to a
or b
. Let’s see what happens when we perform an inner join with a table3 with repeated values in its id
column:
table1
## id v date
## 1: a 1 2021-01-01
## 2: a 2 2021-03-23
## 3: b 3 2021-02-13
## 4: b 4 2021-05-28
## 5: c 5 2021-12-17
## 6: e 6 2021-12-31
table3
## id g
## 1: a 21
## 2: a 22
## 3: a 23
## 4: b 24
## 5: c 25
## 6: f 26
setkey(table3, "id")
merge(table1, table3)
## id v date g
## 1: a 1 2021-01-01 21
## 2: a 1 2021-01-01 22
## 3: a 1 2021-01-01 23
## 4: a 2 2021-03-23 21
## 5: a 2 2021-03-23 22
## 6: a 2 2021-03-23 23
## 7: b 3 2021-02-13 24
## 8: b 4 2021-05-28 24
## 9: c 5 2021-12-17 25
merge
returns a table with all combinations of variables in the id
column. The resulting table has more rows than any of the two rows. This is something similar to the cartesian join in SQL, but here the cartesian product is only between rows of both tables with the same value in the merging columns.
Let’s consider a table4
to merge
with table3
.
table4
## id h
## 1: a 31
## 2: a 32
## 3: a 33
## 4: a 34
## 5: a 35
## 6: b 36
## 7: b 37
## 8: b 38
## 9: c 39
setkey(table4, "id")
If we try to perform merge(table3, table4)
, data.table
returns an error, because the number of rows of the merged table is larger than the sum of the rows of merged tables: table3
has 6 rows, table4
has 9 rows and the merged table has 19 rows. This is not exactly an error, but a warning that the size of the merged table can be too large. We must explicitly set allow.cartesian = TRUE
to tell data.table
that we are performing a cartesian join.
merge(table3, table4, allow.cartesian = TRUE)
## id g h
## 1: a 21 31
## 2: a 21 32
## 3: a 21 33
## 4: a 21 34
## 5: a 21 35
## 6: a 22 31
## 7: a 22 32
## 8: a 22 33
## 9: a 22 34
## 10: a 22 35
## 11: a 23 31
## 12: a 23 32
## 13: a 23 33
## 14: a 23 34
## 15: a 23 35
## 16: b 24 36
## 17: b 24 37
## 18: b 24 38
## 19: c 25 39
We can use the on syntax for cartesian merges (we table3[table4, on =.(id), allow.cartesian = TRUE]
if the merging variables are not defined with setkey
):
table3[table4, allow.cartesian = TRUE]
## id g h
## 1: a 21 31
## 2: a 22 31
## 3: a 23 31
## 4: a 21 32
## 5: a 22 32
## 6: a 23 32
## 7: a 21 33
## 8: a 22 33
## 9: a 23 33
## 10: a 21 34
## 11: a 22 34
## 12: a 23 34
## 13: a 21 35
## 14: a 22 35
## 15: a 23 35
## 16: b 24 36
## 17: b 24 37
## 18: b 24 38
## 19: c 25 39
Rolling join
data.table
has an additional feature to merge tables when we merge tables with dates or times. To illustrate that, let’s consider merging table1
and a new table5
:
table1
## id v date
## 1: a 1 2021-01-01
## 2: a 2 2021-03-23
## 3: b 3 2021-02-13
## 4: b 4 2021-05-28
## 5: c 5 2021-12-17
## 6: e 6 2021-12-31
setkey(table1, "id", "date")
table5
## id h date
## 1: a 31 2021-01-01
## 2: a 32 2021-02-04
## 3: a 33 2021-03-23
## 4: a 34 2021-06-03
## 5: a 35 2021-08-15
## 6: b 36 2021-02-13
## 7: b 37 2021-05-28
## 8: b 38 2021-07-12
## 9: c 39 2021-12-17
setkey(table5, "id", "date")
When we perform a merge with the on
syntax, we have NA
in the columns of table1
(what we have performed is equivalent to table3[table4, on =.(id), allow.cartesian = TRUE]
):
table1[table5]
## id v date h
## 1: a 1 2021-01-01 31
## 2: a NA 2021-02-04 32
## 3: a 2 2021-03-23 33
## 4: a NA 2021-06-03 34
## 5: a NA 2021-08-15 35
## 6: b 3 2021-02-13 36
## 7: b 4 2021-05-28 37
## 8: b NA 2021-07-12 38
## 9: c 5 2021-12-17 39
Both tables have a column of temporal values. If we interpret that date
is the moment when the variable v
changes in table1
, then the value of v
in the second row of the merged table is 1
, 2
in rows 4 and 5 and 4
in row 8. We can implement that doing roll = TRUE
:
table1[table5, roll = TRUE]
## id v date h
## 1: a 1 2021-01-01 31
## 2: a 1 2021-02-04 32
## 3: a 2 2021-03-23 33
## 4: a 2 2021-06-03 34
## 5: a 2 2021-08-15 35
## 6: b 3 2021-02-13 36
## 7: b 4 2021-05-28 37
## 8: b 4 2021-07-12 38
## 9: c 5 2021-12-17 39
I have used version 1.13.6 of data.table in R 4.0.3.