dplyr is a grammar of data manipulation for efficient transformation of rectangular data stored in data frames or tibbles. dplyr is part of the Tidyverse, and its functions expect tidy data.
library(dplyr)
An example of a tidy dataset is iris. Here we have defined a tibble version called iris_table for clarity, but you can use dplyr functions into iris straight.
library(tibble)
iris_tibble <- tibble(iris)
iris_tibble
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
The basic verbs for data manipulation
Basic dplyr functions are implemented as a set of verbs:
- add new variables (columns) with
mutate() - pick variables (columns) with
select() - pick cases (rows) based on their values with
filter() - obtain summary statistics of a varaible wiht
summmarise() - order the rows according with variable values with
arrange()
Here are some examples of usage of basic dplyr verbs:
# add a new variable
iris_tibble <- mutate(iris_tibble, new_var = Sepal.Length/Sepal.Width)
# select two columns
select(iris_tibble, Sepal.Length, Sepal.Width)
## # A tibble: 150 x 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
# filter observations with Sepal.Length greater than 5.5
filter(iris_tibble, Sepal.Length > 5.5)
## # A tibble: 91 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.8 4 1.2 0.2 setosa 1.45
## 2 5.7 4.4 1.5 0.4 setosa 1.30
## 3 5.7 3.8 1.7 0.3 setosa 1.5
## 4 7 3.2 4.7 1.4 versicolor 2.19
## 5 6.4 3.2 4.5 1.5 versicolor 2
## 6 6.9 3.1 4.9 1.5 versicolor 2.23
## 7 6.5 2.8 4.6 1.5 versicolor 2.32
## 8 5.7 2.8 4.5 1.3 versicolor 2.04
## 9 6.3 3.3 4.7 1.6 versicolor 1.91
## 10 6.6 2.9 4.6 1.3 versicolor 2.28
## # … with 81 more rows
# obtain the mean of Sepal.Width
summarise(iris_tibble, m = mean(Sepal.Width))
## # A tibble: 1 x 1
## m
## <dbl>
## 1 3.06
# order by decreasing value of Sepal.Width
arrange(iris_tibble, desc(Sepal.Width))
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.7 4.4 1.5 0.4 setosa 1.30
## 2 5.5 4.2 1.4 0.2 setosa 1.31
## 3 5.2 4.1 1.5 0.1 setosa 1.27
## 4 5.8 4 1.2 0.2 setosa 1.45
## 5 5.4 3.9 1.7 0.4 setosa 1.38
## 6 5.4 3.9 1.3 0.4 setosa 1.38
## 7 5.7 3.8 1.7 0.3 setosa 1.5
## 8 5.1 3.8 1.5 0.3 setosa 1.34
## 9 5.1 3.8 1.9 0.4 setosa 1.34
## 10 5.1 3.8 1.6 0.2 setosa 1.34
## # … with 140 more rows
Piping operator
We can combine several dplyr verbs in a single instruction using the piping operator %>%:
# obtain the mean of Sepal.Width for observations with Sepal.Length greater than 5.5
iris_tibble %>%
filter(Sepal.Length > 5.5) %>%
summarise(m=mean(Sepal.Width))
## # A tibble: 1 x 1
## m
## <dbl>
## 1 2.96
Grouping
Sometimes we want to examine the properties of a dataset for each of the levels of a categorical variable (or combinations of levels). We can do that with group_by. It is often useful combining grouping with summarising:
# mean of variables for each species
iris_tibble %>%
group_by(Species) %>%
summarise(m_sl = mean(Sepal.Length), m_sw = mean(Sepal.Width), m_pl = mean(Petal.Length), m_pw = mean(Petal.Width))
## # A tibble: 3 x 5
## Species m_sl m_sw m_pl m_pw
## * <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
More functions
dplyr has many other functions for data manipulation. You can find them in the dplyr tidyverse website or in the dplyr cheatsheet.