Tidy data

Jose M Sallan 2021-02-12 2 min read

Data cleaning and data analysis tasks are easier if data frames are arranged as tidy data. According to Hadley Wickham, in tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

This means that tidy data are presented as long tables, and that a dataset of different observational units is stored as a set of relational tables.

Let’s examine this table, presenting Spanish population in 2015-2019 by age group:

year	0-16	16-19	20-24	25-34	35-44	45-54	55-64	65+
2019	7366.4	1895.7	2324.5	5287.8	7269.0	7507.8	6235.2	8907.2
2018	7415.9	1841.3	2266.8	5297.9	7406.2	7412.8	6034.1	8760.4
2017	7418.0	1793.1	2239.3	5333.5	7524.0	7313.5	5875.0	8638.2
2016	7434.7	1756.7	2242.2	5449.8	7637.7	7216.5	5742.2	8539.9
2015	7468.2	1727.7	2270.3	5656.6	7740.3	7131.6	5594.0	8369.1

This table is not tidy data, as it contains eight observations for each row. To tidy those data, we need to put each observation in a row, so each row of the previous data will become eight rows. Then, we need also a new variable for the age group of each row.

year	age_group	population
2019	0-16	7366.4
2018	0-16	7415.9
2017	0-16	7418.0
2016	0-16	7434.7
2015	0-16	7468.2
2019	16-19	1895.7
2018	16-19	1841.3
2017	16-19	1793.1
2016	16-19	1756.7
2015	16-19	1727.7
2019	20-24	2324.5
2018	20-24	2266.8
2017	20-24	2239.3
2016	20-24	2242.2
2015	20-24	2270.3
2019	25-34	5287.8
2018	25-34	5297.9
2017	25-34	5333.5
2016	25-34	5449.8
2015	25-34	5656.6
2019	35-44	7269.0
2018	35-44	7406.2
2017	35-44	7524.0
2016	35-44	7637.7
2015	35-44	7740.3
2019	45-54	7507.8
2018	45-54	7412.8
2017	45-54	7313.5
2016	45-54	7216.5
2015	45-54	7131.6
2019	55-64	6235.2
2018	55-64	6034.1
2017	55-64	5875.0
2016	55-64	5742.2
2015	55-64	5594.0
2019	65+	8907.2
2018	65+	8760.4
2017	65+	8638.2
2016	65+	8539.9
2015	65+	8369.1

The wide table shown at the beginning can be useful for humans to read, as it allows us to compare across rows and columns. But machines prefer to work with tidy data, which usually means long tables. The tidyverse is a set of R packages for data science that expect tidy data to work with.

The Jose M Sallan static website

Tidy data