Data cleaning and data analysis tasks are easier if data frames are arranged as tidy data. According to Hadley Wickham, in tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This means that tidy data are presented as long tables, and that a dataset of different observational units is stored as a set of relational tables.
Let’s examine this table, presenting Spanish population in 2015-2019 by age group:
year | 0-16 | 16-19 | 20-24 | 25-34 | 35-44 | 45-54 | 55-64 | 65+ |
---|---|---|---|---|---|---|---|---|
2019 | 7366.4 | 1895.7 | 2324.5 | 5287.8 | 7269.0 | 7507.8 | 6235.2 | 8907.2 |
2018 | 7415.9 | 1841.3 | 2266.8 | 5297.9 | 7406.2 | 7412.8 | 6034.1 | 8760.4 |
2017 | 7418.0 | 1793.1 | 2239.3 | 5333.5 | 7524.0 | 7313.5 | 5875.0 | 8638.2 |
2016 | 7434.7 | 1756.7 | 2242.2 | 5449.8 | 7637.7 | 7216.5 | 5742.2 | 8539.9 |
2015 | 7468.2 | 1727.7 | 2270.3 | 5656.6 | 7740.3 | 7131.6 | 5594.0 | 8369.1 |
This table is not tidy data, as it contains eight observations for each row. To tidy those data, we need to put each observation in a row, so each row of the previous data will become eight rows. Then, we need also a new variable for the age group of each row.
year | age_group | population |
---|---|---|
2019 | 0-16 | 7366.4 |
2018 | 0-16 | 7415.9 |
2017 | 0-16 | 7418.0 |
2016 | 0-16 | 7434.7 |
2015 | 0-16 | 7468.2 |
2019 | 16-19 | 1895.7 |
2018 | 16-19 | 1841.3 |
2017 | 16-19 | 1793.1 |
2016 | 16-19 | 1756.7 |
2015 | 16-19 | 1727.7 |
2019 | 20-24 | 2324.5 |
2018 | 20-24 | 2266.8 |
2017 | 20-24 | 2239.3 |
2016 | 20-24 | 2242.2 |
2015 | 20-24 | 2270.3 |
2019 | 25-34 | 5287.8 |
2018 | 25-34 | 5297.9 |
2017 | 25-34 | 5333.5 |
2016 | 25-34 | 5449.8 |
2015 | 25-34 | 5656.6 |
2019 | 35-44 | 7269.0 |
2018 | 35-44 | 7406.2 |
2017 | 35-44 | 7524.0 |
2016 | 35-44 | 7637.7 |
2015 | 35-44 | 7740.3 |
2019 | 45-54 | 7507.8 |
2018 | 45-54 | 7412.8 |
2017 | 45-54 | 7313.5 |
2016 | 45-54 | 7216.5 |
2015 | 45-54 | 7131.6 |
2019 | 55-64 | 6235.2 |
2018 | 55-64 | 6034.1 |
2017 | 55-64 | 5875.0 |
2016 | 55-64 | 5742.2 |
2015 | 55-64 | 5594.0 |
2019 | 65+ | 8907.2 |
2018 | 65+ | 8760.4 |
2017 | 65+ | 8638.2 |
2016 | 65+ | 8539.9 |
2015 | 65+ | 8369.1 |
The wide table shown at the beginning can be useful for humans to read, as it allows us to compare across rows and columns. But machines prefer to work with tidy data, which usually means long tables. The tidyverse is a set of R packages for data science that expect tidy data to work with.