dplyr
is a grammar of data manipulation for efficient transformation of rectangular data stored in data frames or tibbles. dplyr
is part of the Tidyverse, and its functions expect tidy data.
library(dplyr)
An example of a tidy dataset is iris
. Here we have defined a tibble version called iris_table
for clarity, but you can use dplyr
functions into iris
straight.
library(tibble)
iris_tibble <- tibble(iris)
iris_tibble
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
The basic verbs for data manipulation
Basic dplyr
functions are implemented as a set of verbs:
- add new variables (columns) with
mutate()
- pick variables (columns) with
select()
- pick cases (rows) based on their values with
filter()
- obtain summary statistics of a varaible wiht
summmarise()
- order the rows according with variable values with
arrange()
Here are some examples of usage of basic dplyr
verbs:
# add a new variable
iris_tibble <- mutate(iris_tibble, new_var = Sepal.Length/Sepal.Width)
# select two columns
select(iris_tibble, Sepal.Length, Sepal.Width)
## # A tibble: 150 x 2
## Sepal.Length Sepal.Width
## <dbl> <dbl>
## 1 5.1 3.5
## 2 4.9 3
## 3 4.7 3.2
## 4 4.6 3.1
## 5 5 3.6
## 6 5.4 3.9
## 7 4.6 3.4
## 8 5 3.4
## 9 4.4 2.9
## 10 4.9 3.1
## # … with 140 more rows
# filter observations with Sepal.Length greater than 5.5
filter(iris_tibble, Sepal.Length > 5.5)
## # A tibble: 91 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.8 4 1.2 0.2 setosa 1.45
## 2 5.7 4.4 1.5 0.4 setosa 1.30
## 3 5.7 3.8 1.7 0.3 setosa 1.5
## 4 7 3.2 4.7 1.4 versicolor 2.19
## 5 6.4 3.2 4.5 1.5 versicolor 2
## 6 6.9 3.1 4.9 1.5 versicolor 2.23
## 7 6.5 2.8 4.6 1.5 versicolor 2.32
## 8 5.7 2.8 4.5 1.3 versicolor 2.04
## 9 6.3 3.3 4.7 1.6 versicolor 1.91
## 10 6.6 2.9 4.6 1.3 versicolor 2.28
## # … with 81 more rows
# obtain the mean of Sepal.Width
summarise(iris_tibble, m = mean(Sepal.Width))
## # A tibble: 1 x 1
## m
## <dbl>
## 1 3.06
# order by decreasing value of Sepal.Width
arrange(iris_tibble, desc(Sepal.Width))
## # A tibble: 150 x 6
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_var
## <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 5.7 4.4 1.5 0.4 setosa 1.30
## 2 5.5 4.2 1.4 0.2 setosa 1.31
## 3 5.2 4.1 1.5 0.1 setosa 1.27
## 4 5.8 4 1.2 0.2 setosa 1.45
## 5 5.4 3.9 1.7 0.4 setosa 1.38
## 6 5.4 3.9 1.3 0.4 setosa 1.38
## 7 5.7 3.8 1.7 0.3 setosa 1.5
## 8 5.1 3.8 1.5 0.3 setosa 1.34
## 9 5.1 3.8 1.9 0.4 setosa 1.34
## 10 5.1 3.8 1.6 0.2 setosa 1.34
## # … with 140 more rows
Piping operator
We can combine several dplyr verbs in a single instruction using the piping operator %>%
:
# obtain the mean of Sepal.Width for observations with Sepal.Length greater than 5.5
iris_tibble %>%
filter(Sepal.Length > 5.5) %>%
summarise(m=mean(Sepal.Width))
## # A tibble: 1 x 1
## m
## <dbl>
## 1 2.96
Grouping
Sometimes we want to examine the properties of a dataset for each of the levels of a categorical variable (or combinations of levels). We can do that with group_by
. It is often useful combining grouping with summarising:
# mean of variables for each species
iris_tibble %>%
group_by(Species) %>%
summarise(m_sl = mean(Sepal.Length), m_sw = mean(Sepal.Width), m_pl = mean(Petal.Length), m_pw = mean(Petal.Width))
## # A tibble: 3 x 5
## Species m_sl m_sw m_pl m_pw
## * <fct> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 5.01 3.43 1.46 0.246
## 2 versicolor 5.94 2.77 4.26 1.33
## 3 virginica 6.59 2.97 5.55 2.03
More functions
dplyr
has many other functions for data manipulation. You can find them in the dplyr tidyverse website or in the dplyr cheatsheet.