Data manipulation with dplyr

Jose M Sallan 2021-02-19 4 min read

dplyr is a grammar of data manipulation for efficient transformation of rectangular data stored in data frames or tibbles. dplyr is part of the Tidyverse, and its functions expect tidy data.

library(dplyr)

An example of a tidy dataset is iris. Here we have defined a tibble version called iris_table for clarity, but you can use dplyr functions into iris straight.

library(tibble)
iris_tibble <- tibble(iris)
iris_tibble
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

The basic verbs for data manipulation

Basic dplyr functions are implemented as a set of verbs:

  • add new variables (columns) with mutate()
  • pick variables (columns) with select()
  • pick cases (rows) based on their values with filter()
  • obtain summary statistics of a varaible wiht summmarise()
  • order the rows according with variable values with arrange()

Here are some examples of usage of basic dplyr verbs:

# add a new variable
iris_tibble <- mutate(iris_tibble, new_var = Sepal.Length/Sepal.Width)

# select two columns
select(iris_tibble, Sepal.Length, Sepal.Width)
## # A tibble: 150 x 2
##    Sepal.Length Sepal.Width
##           <dbl>       <dbl>
##  1          5.1         3.5
##  2          4.9         3  
##  3          4.7         3.2
##  4          4.6         3.1
##  5          5           3.6
##  6          5.4         3.9
##  7          4.6         3.4
##  8          5           3.4
##  9          4.4         2.9
## 10          4.9         3.1
## # … with 140 more rows
# filter observations with Sepal.Length greater than 5.5
filter(iris_tibble, Sepal.Length > 5.5)
## # A tibble: 91 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species    new_var
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>        <dbl>
##  1          5.8         4            1.2         0.2 setosa        1.45
##  2          5.7         4.4          1.5         0.4 setosa        1.30
##  3          5.7         3.8          1.7         0.3 setosa        1.5 
##  4          7           3.2          4.7         1.4 versicolor    2.19
##  5          6.4         3.2          4.5         1.5 versicolor    2   
##  6          6.9         3.1          4.9         1.5 versicolor    2.23
##  7          6.5         2.8          4.6         1.5 versicolor    2.32
##  8          5.7         2.8          4.5         1.3 versicolor    2.04
##  9          6.3         3.3          4.7         1.6 versicolor    1.91
## 10          6.6         2.9          4.6         1.3 versicolor    2.28
## # … with 81 more rows
# obtain the mean of Sepal.Width
summarise(iris_tibble, m = mean(Sepal.Width))
## # A tibble: 1 x 1
##       m
##   <dbl>
## 1  3.06
# order by decreasing value of Sepal.Width
arrange(iris_tibble, desc(Sepal.Width))
## # A tibble: 150 x 6
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>     <dbl>
##  1          5.7         4.4          1.5         0.4 setosa     1.30
##  2          5.5         4.2          1.4         0.2 setosa     1.31
##  3          5.2         4.1          1.5         0.1 setosa     1.27
##  4          5.8         4            1.2         0.2 setosa     1.45
##  5          5.4         3.9          1.7         0.4 setosa     1.38
##  6          5.4         3.9          1.3         0.4 setosa     1.38
##  7          5.7         3.8          1.7         0.3 setosa     1.5 
##  8          5.1         3.8          1.5         0.3 setosa     1.34
##  9          5.1         3.8          1.9         0.4 setosa     1.34
## 10          5.1         3.8          1.6         0.2 setosa     1.34
## # … with 140 more rows

Piping operator

We can combine several dplyr verbs in a single instruction using the piping operator %>%:

# obtain the mean of Sepal.Width for observations with Sepal.Length greater than 5.5

iris_tibble %>% 
  filter(Sepal.Length > 5.5) %>% 
  summarise(m=mean(Sepal.Width))
## # A tibble: 1 x 1
##       m
##   <dbl>
## 1  2.96

Grouping

Sometimes we want to examine the properties of a dataset for each of the levels of a categorical variable (or combinations of levels). We can do that with group_by. It is often useful combining grouping with summarising:

# mean of variables for each species
iris_tibble %>% 
  group_by(Species) %>%
  summarise(m_sl = mean(Sepal.Length), m_sw = mean(Sepal.Width), m_pl = mean(Petal.Length), m_pw = mean(Petal.Width))
## # A tibble: 3 x 5
##   Species     m_sl  m_sw  m_pl  m_pw
## * <fct>      <dbl> <dbl> <dbl> <dbl>
## 1 setosa      5.01  3.43  1.46 0.246
## 2 versicolor  5.94  2.77  4.26 1.33 
## 3 virginica   6.59  2.97  5.55 2.03

More functions

dplyr has many other functions for data manipulation. You can find them in the dplyr tidyverse website or in the dplyr cheatsheet.