Joining relational tables in dplyr

Jose M Sallan 2022-04-04 10 min read

The dplyr package for tabular data handling includes a set of functions for joining data frames with a relational structure. By relational structure, I mean that there are columns in each table that can relate the contents of the two tables.

I will illustrate how these functions work using data from the World Bank. In addition to dplyr, I am using the wbstats package to retrieve those data directly from the web.

library(dplyr)
library(wbstats)

Mutating joins (also called merges) return a data frame with the columns of two data frames x and y:

  • inner join: returns all rows from x where there are matching values in y. If there are multiple matches between x and y, all combination of the matches are returned.

  • full (outer) join: returns all rows from both x and y. Where there are not matching values, returns NA for the ones missing.

  • left join: returns all rows from x. Rows of x with no match in y will have NA values in the y columns. If there are multiple matches between x and y, all combination of the matches are returned.

  • right join: returns all rows from y. Rows of y with no match in x will have NA values in the x columns. If there are multiple matches, all combinations are returned.

These four mutating joins are implemented in dplyr with the functions inner_join, full_join, left_join and right_join.

Getting data from the World Bank

Let’s use the wbstats package to retrieve values of two series:

pop_wb <- wb_data("SP.POP.TOTL", start_date = 2000, end_date = 2022)
pop_wb
## # A tibble: 4,557 × 9
##    iso2c iso3c country   date SP.POP.TOTL unit  obs_status footnote last_updated
##    <chr> <chr> <chr>    <dbl>       <dbl> <chr> <chr>      <chr>    <date>      
##  1 AF    AFG   Afghani…  2020    38928341 <NA>  <NA>       <NA>     2022-02-15  
##  2 AF    AFG   Afghani…  2019    38041757 <NA>  <NA>       <NA>     2022-02-15  
##  3 AF    AFG   Afghani…  2018    37171922 <NA>  <NA>       <NA>     2022-02-15  
##  4 AF    AFG   Afghani…  2017    36296111 <NA>  <NA>       <NA>     2022-02-15  
##  5 AF    AFG   Afghani…  2016    35383028 <NA>  <NA>       <NA>     2022-02-15  
##  6 AF    AFG   Afghani…  2015    34413603 <NA>  <NA>       <NA>     2022-02-15  
##  7 AF    AFG   Afghani…  2014    33370804 <NA>  <NA>       <NA>     2022-02-15  
##  8 AF    AFG   Afghani…  2013    32269592 <NA>  <NA>       <NA>     2022-02-15  
##  9 AF    AFG   Afghani…  2012    31161378 <NA>  <NA>       <NA>     2022-02-15  
## 10 AF    AFG   Afghani…  2011    30117411 <NA>  <NA>       <NA>     2022-02-15  
## # … with 4,547 more rows
tfrt_wb <- wb_data("SP.DYN.TFRT.IN", start_date = 2000, end_date = 2022)
tfrt_wb
## # A tibble: 4,557 × 9
##    iso2c iso3c country  date SP.DYN.TFRT.IN unit  obs_status footnote
##    <chr> <chr> <chr>   <dbl>          <dbl> <chr> <chr>      <chr>   
##  1 AW    ABW   Aruba    2000           1.87 <NA>  <NA>       <NA>    
##  2 AW    ABW   Aruba    2001           1.85 <NA>  <NA>       <NA>    
##  3 AW    ABW   Aruba    2002           1.82 <NA>  <NA>       <NA>    
##  4 AW    ABW   Aruba    2003           1.80 <NA>  <NA>       <NA>    
##  5 AW    ABW   Aruba    2004           1.79 <NA>  <NA>       <NA>    
##  6 AW    ABW   Aruba    2005           1.77 <NA>  <NA>       <NA>    
##  7 AW    ABW   Aruba    2006           1.76 <NA>  <NA>       <NA>    
##  8 AW    ABW   Aruba    2007           1.76 <NA>  <NA>       <NA>    
##  9 AW    ABW   Aruba    2008           1.76 <NA>  <NA>       <NA>    
## 10 AW    ABW   Aruba    2009           1.76 <NA>  <NA>       <NA>    
## # … with 4,547 more rows, and 1 more variable: last_updated <date>

I will edit the resulting tables removing missing values, retaining the columns with country, date and value and renaming the value.

pop <- pop_wb %>%
  filter(!is.na(SP.POP.TOTL)) %>%
  select(country, date, SP.POP.TOTL) %>%
  rename(population = SP.POP.TOTL)

tfrt <- tfrt_wb %>%
  filter(!is.na(SP.DYN.TFRT.IN)) %>%
  select(country, date, SP.DYN.TFRT.IN) %>%
  rename(fertility = SP.DYN.TFRT.IN)

Joins by one variable

I will start joining tables with data from year 2010:

pop_2010 <- pop %>%
  filter(date == 2010) %>%
  select(-date)

tfrt_2010 <- tfrt %>%
  filter(date == 2010) %>%
  select(-date)

Both tables have different number of observations:

  • pop_2010 has 217 rows.
  • tfrt_2010 has 201 rows.

It is important to track the number of rows of each table to examine the effect of each join. Let’s begin with the inner join. Like in the rest of joins, I have declared the joining columns with by. If no value of by is passed, the functions pick the columns of equal name in both tables.

inner_2010 <- inner_join(pop_2010, 
                         tfrt_2010, 
                         by = "country")
inner_2010
## # A tibble: 201 × 3
##    country             population fertility
##    <chr>                    <dbl>     <dbl>
##  1 Afghanistan           29185511      5.98
##  2 Albania                2913021      1.66
##  3 Algeria               35977451      2.86
##  4 Andorra                  84454      1.27
##  5 Angola                23356247      6.19
##  6 Antigua and Barbuda      88030      1.99
##  7 Argentina             40788453      2.35
##  8 Armenia                2877314      1.72
##  9 Aruba                   101665      1.77
## 10 Australia             22031750      1.93
## # … with 191 more rows

The inner join will have the least rows of all joins, as it only includes values of the joining columns present in both tables. It has 201 rows, the same as tfrt_2010.

full_2010 <- full_join(pop_2010, 
                       tfrt_2010, 
                       by = "country")
full_2010
## # A tibble: 217 × 3
##    country             population fertility
##    <chr>                    <dbl>     <dbl>
##  1 Afghanistan           29185511      5.98
##  2 Albania                2913021      1.66
##  3 Algeria               35977451      2.86
##  4 American Samoa           56084     NA   
##  5 Andorra                  84454      1.27
##  6 Angola                23356247      6.19
##  7 Antigua and Barbuda      88030      1.99
##  8 Argentina             40788453      2.35
##  9 Armenia                2877314      1.72
## 10 Aruba                   101665      1.77
## # … with 207 more rows

The full join has the maximum value of rows of all joins, as it includes the observations included in any of the tables. It has 201 rows, the same as pop_2010.

With the full join, we can examine which rows do not include fertility or population:

full_2010 %>%
  filter(is.na(fertility))
## # A tibble: 16 × 3
##    country                   population fertility
##    <chr>                          <dbl>     <dbl>
##  1 American Samoa                 56084        NA
##  2 British Virgin Islands         27796        NA
##  3 Cayman Islands                 56672        NA
##  4 Dominica                       70877        NA
##  5 Gibraltar                      33585        NA
##  6 Isle of Man                    84856        NA
##  7 Marshall Islands               56361        NA
##  8 Monaco                         35609        NA
##  9 Nauru                          10009        NA
## 10 Northern Mariana Islands       53971        NA
## 11 Palau                          17954        NA
## 12 San Marino                     31221        NA
## 13 Sint Maarten (Dutch part)      34056        NA
## 14 St. Kitts and Nevis            49011        NA
## 15 Turks and Caicos Islands       32658        NA
## 16 Tuvalu                         10521        NA
full_2010 %>%
  filter(is.na(population))
## # A tibble: 0 × 3
## # … with 3 variables: country <chr>, population <dbl>, fertility <dbl>

We observe that some countries that have populations have not fertility rate records, but not in reverse.

Left and rigth joins will have the same rows of the left and right tables in the join expression, respectively.

left_2010 <- left_join(pop_2010, 
                       tfrt_2010, 
                       by = "country")
left_2010
## # A tibble: 217 × 3
##    country             population fertility
##    <chr>                    <dbl>     <dbl>
##  1 Afghanistan           29185511      5.98
##  2 Albania                2913021      1.66
##  3 Algeria               35977451      2.86
##  4 American Samoa           56084     NA   
##  5 Andorra                  84454      1.27
##  6 Angola                23356247      6.19
##  7 Antigua and Barbuda      88030      1.99
##  8 Argentina             40788453      2.35
##  9 Armenia                2877314      1.72
## 10 Aruba                   101665      1.77
## # … with 207 more rows
right_2010 <- right_join(pop_2010, 
                         tfrt_2010, 
                         by = "country")
right_2010
## # A tibble: 201 × 3
##    country             population fertility
##    <chr>                    <dbl>     <dbl>
##  1 Afghanistan           29185511      5.98
##  2 Albania                2913021      1.66
##  3 Algeria               35977451      2.86
##  4 Andorra                  84454      1.27
##  5 Angola                23356247      6.19
##  6 Antigua and Barbuda      88030      1.99
##  7 Argentina             40788453      2.35
##  8 Armenia                2877314      1.72
##  9 Aruba                   101665      1.77
## 10 Australia             22031750      1.93
## # … with 191 more rows

Joins by two variables

Let’s join the two original pop and tfrt tables. We need to do the join by two variables: country and year.

inner <- inner_join(pop, 
                    tfrt, 
                    by = c("country", "date"))

full <- full_join(pop, 
                  tfrt, 
                  by = c("country", "date")) 

left <- left_join(pop, 
                  tfrt, 
                  by = c("country", "date"))

right <- right_join(pop, 
                    tfrt, 
                    by = c("country", "date"))

Let’s see how many rows we have in each join and the two original tables:

table number of rows
pop 4548
tfrt 4017
inner 4009
full 4556
left 4548
right 4017

Now the inner table has less rows than tfrt, while the full table has more rows than pop. This means that there are observations in pop missing in tfrt, and vice versa.

Filtering join

To check which observations of one table are missing in the other, we can use the anti_join filtering join: anti_join(x, y) returns observations of x with no match in y.

We can see which elements of pop are not present in tfrt doing:

anti_join(pop, tfrt, by = c("country", "date"))
## # A tibble: 539 × 3
##    country         date population
##    <chr>          <dbl>      <dbl>
##  1 Afghanistan     2020   38928341
##  2 Albania         2020    2837743
##  3 Algeria         2020   43851043
##  4 American Samoa  2020      55197
##  5 American Samoa  2019      55312
##  6 American Samoa  2018      55461
##  7 American Samoa  2017      55617
##  8 American Samoa  2016      55739
##  9 American Samoa  2015      55806
## 10 American Samoa  2014      55791
## # … with 529 more rows

We have 539 observations in this table. This is not surprising, as fertility records and not as exhaustive as population’s.

Let’s examine which elements of tfrt are not present in pop:

anti_join(tfrt, pop, by = c("country", "date"))
## # A tibble: 8 × 3
##   country  date fertility
##   <chr>   <dbl>     <dbl>
## 1 Eritrea  2012      4.41
## 2 Eritrea  2013      4.34
## 3 Eritrea  2014      4.27
## 4 Eritrea  2015      4.22
## 5 Eritrea  2016      4.16
## 6 Eritrea  2017      4.11
## 7 Eritrea  2018      4.06
## 8 Eritrea  2019      4.00

Here we have only 8 observations. None of them is from year 2010, so we did not observed them in the previous section.

dplyr offers a set of functions to join relational tables that allow performing mutating joins, because they return a new table with information from the two inputs. This allows doing SQL-like operations inside the R environment. The four mutating joins (inner, full, left and right) are a replacement for the R base merge function, hopefully in a more intuitive fashion. Filtering joins allow controlling for which rows have no match in the other table, so they do not return a new table.

References

Session info

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 10 (buster)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.8.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.8.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] wbstats_1.0.4    dplyr_1.0.7      kableExtra_1.3.4
## 
## loaded via a namespace (and not attached):
##  [1] highr_0.9         bslib_0.2.5.1     compiler_4.1.3    pillar_1.6.4     
##  [5] jquerylib_0.1.4   tools_4.1.3       digest_0.6.27     tibble_3.1.5     
##  [9] jsonlite_1.7.2    evaluate_0.14     lifecycle_1.0.0   viridisLite_0.4.0
## [13] pkgconfig_2.0.3   rlang_0.4.12      cli_3.0.1         DBI_1.1.1        
## [17] rstudioapi_0.13   curl_4.3.1        yaml_2.2.1        blogdown_1.5     
## [21] xfun_0.23         httr_1.4.2        stringr_1.4.0     xml2_1.3.2       
## [25] knitr_1.33        hms_1.1.1         generics_0.1.0    sass_0.4.0       
## [29] vctrs_0.3.8       systemfonts_1.0.2 tidyselect_1.1.1  webshot_0.5.2    
## [33] svglite_2.0.0     glue_1.4.2        R6_2.5.0          fansi_0.5.0      
## [37] rmarkdown_2.9     bookdown_0.24     tidyr_1.1.4       tzdb_0.1.2       
## [41] readr_2.0.2       purrr_0.3.4       magrittr_2.0.1    scales_1.1.1     
## [45] htmltools_0.5.1.1 ellipsis_0.3.2    assertthat_0.2.1  rvest_1.0.2      
## [49] colorspace_2.0-1  utf8_1.2.1        stringi_1.7.3     munsell_0.5.0    
## [53] crayon_1.4.1