Tidy data

Jose M Sallan 2021-02-12 2 min read

Data cleaning and data analysis tasks are easier if data frames are arranged as tidy data. According to Hadley Wickham, in tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

This means that tidy data are presented as long tables, and that a dataset of different observational units is stored as a set of relational tables.

Let’s examine this table, presenting Spanish population in 2015-2019 by age group:

year 0-16 16-19 20-24 25-34 35-44 45-54 55-64 65+
2019 7366.4 1895.7 2324.5 5287.8 7269.0 7507.8 6235.2 8907.2
2018 7415.9 1841.3 2266.8 5297.9 7406.2 7412.8 6034.1 8760.4
2017 7418.0 1793.1 2239.3 5333.5 7524.0 7313.5 5875.0 8638.2
2016 7434.7 1756.7 2242.2 5449.8 7637.7 7216.5 5742.2 8539.9
2015 7468.2 1727.7 2270.3 5656.6 7740.3 7131.6 5594.0 8369.1

This table is not tidy data, as it contains eight observations for each row. To tidy those data, we need to put each observation in a row, so each row of the previous data will become eight rows. Then, we need also a new variable for the age group of each row.

year age_group population
2019 0-16 7366.4
2018 0-16 7415.9
2017 0-16 7418.0
2016 0-16 7434.7
2015 0-16 7468.2
2019 16-19 1895.7
2018 16-19 1841.3
2017 16-19 1793.1
2016 16-19 1756.7
2015 16-19 1727.7
2019 20-24 2324.5
2018 20-24 2266.8
2017 20-24 2239.3
2016 20-24 2242.2
2015 20-24 2270.3
2019 25-34 5287.8
2018 25-34 5297.9
2017 25-34 5333.5
2016 25-34 5449.8
2015 25-34 5656.6
2019 35-44 7269.0
2018 35-44 7406.2
2017 35-44 7524.0
2016 35-44 7637.7
2015 35-44 7740.3
2019 45-54 7507.8
2018 45-54 7412.8
2017 45-54 7313.5
2016 45-54 7216.5
2015 45-54 7131.6
2019 55-64 6235.2
2018 55-64 6034.1
2017 55-64 5875.0
2016 55-64 5742.2
2015 55-64 5594.0
2019 65+ 8907.2
2018 65+ 8760.4
2017 65+ 8638.2
2016 65+ 8539.9
2015 65+ 8369.1

The wide table shown at the beginning can be useful for humans to read, as it allows us to compare across rows and columns. But machines prefer to work with tidy data, which usually means long tables. The tidyverse is a set of R packages for data science that expect tidy data to work with.