Merging with data.table

Jose M Sallan 2021-04-09 8 min read

When performing tasks of exploratory data analysis, sometimes we need to join tables by common variables. This is an operation similar to the defined in relational algebra, although joining variables may not represent a unique ID for each row. Here I will illustrate how to merge tables using the data.table package, that allows handling large datasets in R effectively.

library(data.table)

Let’s consider the following tables.

table1

##    id v       date
## 1:  a 1 2021-01-01
## 2:  a 2 2021-03-23
## 3:  b 3 2021-02-13
## 4:  b 4 2021-05-28
## 5:  c 5 2021-12-17
## 6:  e 6 2021-12-31

table2

##    id w
## 1:  a 1
## 2:  b 2
## 3:  c 3
## 4:  d 4

table1 and table2 can be merged by the column id. The most generic function to perform joins in data.table is merge, similar to the R base function of the same name. Let’s see how to perform different merges with these two tables.

Inner join

The result of an inner join is a table with the rows with values of merging variables existing in both tables. This is the default mode of merge, with all.x = FALSE and all.y = FALSE.

merge(table1, table2)

##    id v       date w
## 1:  a 1 2021-01-01 1
## 2:  a 2 2021-03-23 1
## 3:  b 3 2021-02-13 2
## 4:  b 4 2021-05-28 2
## 5:  c 5 2021-12-17 3

merge has chosen the variables to merge picking the ones that have the same name in both tables. In data.table, we can control which are the merging variables using setkey:

table1_keyed <- copy(table1)
table2_keyed <- copy(table2)
setkey(table1_keyed, "id")
setkey(table2_keyed, "id")
merge(table1_keyed, table2_keyed)

##    id v       date w
## 1:  a 1 2021-01-01 1
## 2:  a 2 2021-03-23 1
## 3:  b 3 2021-02-13 2
## 4:  b 4 2021-05-28 2
## 5:  c 5 2021-12-17 3

Let’s see if we gain speed setting the key variables:

rbenchmark::benchmark(merge(table1, table2),
                      merge(table1_keyed, table2_keyed),
                      order = "relative",
                      columns = c("test", "replications", "elapsed", "relative"))

##                                test replications elapsed relative
## 2 merge(table1_keyed, table2_keyed)          100   0.085    1.000
## 1             merge(table1, table2)          100   0.105    1.235

Seeing the results, we better set the keys of table1 and table2.

setkey(table1, "id")
setkey(table2, "id")

Left join

The result of a left join is a table with all the rows of the first table (the one on the left). The columns of the second table (on the right) with no correspondence with the first have NA in the merged table. We can perform a left join with data.table using merge with all.x = TRUE.

merge(table1, table2, all.x = TRUE)

##    id v       date  w
## 1:  a 1 2021-01-01  1
## 2:  a 2 2021-03-23  1
## 3:  b 3 2021-02-13  2
## 4:  b 4 2021-05-28  2
## 5:  c 5 2021-12-17  3
## 6:  e 6 2021-12-31 NA

Right join

The right join is analogous to the left join, but the resulting table contains all the rows of the second table:

merge(table1, table2, all.y = TRUE)

##    id  v       date w
## 1:  a  1 2021-01-01 1
## 2:  a  2 2021-03-23 1
## 3:  b  3 2021-02-13 2
## 4:  b  4 2021-05-28 2
## 5:  c  5 2021-12-17 3
## 6:  d NA       <NA> 4

In data.table, we have an alternative syntax for right joins using the on operator:

table1[table2, on = .(id)]

##    id  v       date w
## 1:  a  1 2021-01-01 1
## 2:  a  2 2021-03-23 1
## 3:  b  3 2021-02-13 2
## 4:  b  4 2021-05-28 2
## 5:  c  5 2021-12-17 3
## 6:  d NA       <NA> 4

we can get rid of the on operator if we define the keys of each table with setkey:

table1[table2]

##    id  v       date w
## 1:  a  1 2021-01-01 1
## 2:  a  2 2021-03-23 1
## 3:  b  3 2021-02-13 2
## 4:  b  4 2021-05-28 2
## 5:  c  5 2021-12-17 3
## 6:  d NA       <NA> 4

Outer or full join

The result of a full or outer join is a table containing all the rows of both tables:

merge(table1, table2, all.x = TRUE, all.y = TRUE)

##    id  v       date  w
## 1:  a  1 2021-01-01  1
## 2:  a  2 2021-03-23  1
## 3:  b  3 2021-02-13  2
## 4:  b  4 2021-05-28  2
## 5:  c  5 2021-12-17  3
## 6:  d NA       <NA>  4
## 7:  e  6 2021-12-31 NA

Cartesian join

A relational algebra requires that the columns used to merge variables have an unique value for each row. This is not always the case when we merge tables: we have seen than table1 has several rows with id equal to a or b. Let’s see what happens when we perform an inner join with a table3 with repeated values in its id column:

table1

##    id v       date
## 1:  a 1 2021-01-01
## 2:  a 2 2021-03-23
## 3:  b 3 2021-02-13
## 4:  b 4 2021-05-28
## 5:  c 5 2021-12-17
## 6:  e 6 2021-12-31

table3

##    id  g
## 1:  a 21
## 2:  a 22
## 3:  a 23
## 4:  b 24
## 5:  c 25
## 6:  f 26

setkey(table3, "id")
merge(table1, table3)

##    id v       date  g
## 1:  a 1 2021-01-01 21
## 2:  a 1 2021-01-01 22
## 3:  a 1 2021-01-01 23
## 4:  a 2 2021-03-23 21
## 5:  a 2 2021-03-23 22
## 6:  a 2 2021-03-23 23
## 7:  b 3 2021-02-13 24
## 8:  b 4 2021-05-28 24
## 9:  c 5 2021-12-17 25

merge returns a table with all combinations of variables in the id column. The resulting table has more rows than any of the two rows. This is something similar to the cartesian join in SQL, but here the cartesian product is only between rows of both tables with the same value in the merging columns.

Let’s consider a table4 to merge with table3.

table4

##    id  h
## 1:  a 31
## 2:  a 32
## 3:  a 33
## 4:  a 34
## 5:  a 35
## 6:  b 36
## 7:  b 37
## 8:  b 38
## 9:  c 39

setkey(table4, "id")

If we try to perform merge(table3, table4), data.table returns an error, because the number of rows of the merged table is larger than the sum of the rows of merged tables: table3 has 6 rows, table4 has 9 rows and the merged table has 19 rows. This is not exactly an error, but a warning that the size of the merged table can be too large. We must explicitly set allow.cartesian = TRUE to tell data.table that we are performing a cartesian join.

merge(table3, table4, allow.cartesian = TRUE)

##     id  g  h
##  1:  a 21 31
##  2:  a 21 32
##  3:  a 21 33
##  4:  a 21 34
##  5:  a 21 35
##  6:  a 22 31
##  7:  a 22 32
##  8:  a 22 33
##  9:  a 22 34
## 10:  a 22 35
## 11:  a 23 31
## 12:  a 23 32
## 13:  a 23 33
## 14:  a 23 34
## 15:  a 23 35
## 16:  b 24 36
## 17:  b 24 37
## 18:  b 24 38
## 19:  c 25 39

We can use the on syntax for cartesian merges (we table3[table4, on =.(id), allow.cartesian = TRUE] if the merging variables are not defined with setkey):

table3[table4, allow.cartesian = TRUE]

##     id  g  h
##  1:  a 21 31
##  2:  a 22 31
##  3:  a 23 31
##  4:  a 21 32
##  5:  a 22 32
##  6:  a 23 32
##  7:  a 21 33
##  8:  a 22 33
##  9:  a 23 33
## 10:  a 21 34
## 11:  a 22 34
## 12:  a 23 34
## 13:  a 21 35
## 14:  a 22 35
## 15:  a 23 35
## 16:  b 24 36
## 17:  b 24 37
## 18:  b 24 38
## 19:  c 25 39

Rolling join

data.table has an additional feature to merge tables when we merge tables with dates or times. To illustrate that, let’s consider merging table1 and a new table5:

table1

##    id v       date
## 1:  a 1 2021-01-01
## 2:  a 2 2021-03-23
## 3:  b 3 2021-02-13
## 4:  b 4 2021-05-28
## 5:  c 5 2021-12-17
## 6:  e 6 2021-12-31

setkey(table1, "id", "date")
table5

##    id  h       date
## 1:  a 31 2021-01-01
## 2:  a 32 2021-02-04
## 3:  a 33 2021-03-23
## 4:  a 34 2021-06-03
## 5:  a 35 2021-08-15
## 6:  b 36 2021-02-13
## 7:  b 37 2021-05-28
## 8:  b 38 2021-07-12
## 9:  c 39 2021-12-17

setkey(table5, "id", "date")

When we perform a merge with the on syntax, we have NA in the columns of table1 (what we have performed is equivalent to table3[table4, on =.(id), allow.cartesian = TRUE]):

table1[table5]

##    id  v       date  h
## 1:  a  1 2021-01-01 31
## 2:  a NA 2021-02-04 32
## 3:  a  2 2021-03-23 33
## 4:  a NA 2021-06-03 34
## 5:  a NA 2021-08-15 35
## 6:  b  3 2021-02-13 36
## 7:  b  4 2021-05-28 37
## 8:  b NA 2021-07-12 38
## 9:  c  5 2021-12-17 39

Both tables have a column of temporal values. If we interpret that date is the moment when the variable v changes in table1, then the value of v in the second row of the merged table is 1, 2 in rows 4 and 5 and 4 in row 8. We can implement that doing roll = TRUE:

table1[table5, roll = TRUE]

##    id v       date  h
## 1:  a 1 2021-01-01 31
## 2:  a 1 2021-02-04 32
## 3:  a 2 2021-03-23 33
## 4:  a 2 2021-06-03 34
## 5:  a 2 2021-08-15 35
## 6:  b 3 2021-02-13 36
## 7:  b 4 2021-05-28 37
## 8:  b 4 2021-07-12 38
## 9:  c 5 2021-12-17 39

I have used version 1.13.6 of data.table in R 4.0.3.

The Jose M Sallan static website