Plotting Age Pyramids with ggplot

Jose M Sallan 2025-02-07 10 min read

A population pyramid is a graphical illustration of the distribution of a population by age groups and sex. When the population is growing, this representation takes the shape of a pyramid, whence its name. Males are usually shown on the left and females on the right, and they may be measured in absolute numbers or as a percentage of the total population.

If we have data by age group available, we can draw our own age pyramids using the tidyverse functions for plotting and handling data. This representation, though, requires some formatting and adjusting of the dataset.

library(tidyverse)

Let’s read with the read_csv() function from the readr package a .csv file from Idescat with population data by nationality and age group for different years.

pop <- read_csv("censph-536-19792-cat.csv") |>
  janitor::clean_names()

## Rows: 3192 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Catalunya, nacionalitat, sexe, edat quinquennal, concepte
## dbl (2): any, valor
## lgl (1): estat
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pop

## # A tibble: 3,192 × 8
##      any catalunya nacionalitat sexe  edat_quinquennal concepte estat  valor
##    <dbl> <chr>     <chr>        <chr> <chr>            <chr>    <lgl>  <dbl>
##  1  1991 Catalunya espanyola    homes de 0 a 4 anys    població NA    143821
##  2  1991 Catalunya espanyola    homes de 5 a 9 anys    població NA    172906
##  3  1991 Catalunya espanyola    homes de 10 a 14 anys  població NA    235146
##  4  1991 Catalunya espanyola    homes de 15 a 19 anys  població NA    262054
##  5  1991 Catalunya espanyola    homes de 20 a 24 anys  població NA    246040
##  6  1991 Catalunya espanyola    homes de 25 a 29 anys  població NA    232504
##  7  1991 Catalunya espanyola    homes de 30 a 34 anys  població NA    217492
##  8  1991 Catalunya espanyola    homes de 35 a 39 anys  població NA    200722
##  9  1991 Catalunya espanyola    homes de 40 a 44 anys  població NA    199194
## 10  1991 Catalunya espanyola    homes de 45 a 49 anys  població NA    181117
## # ℹ 3,182 more rows

Formatting and Adjusting Data

Let’s format and adjust data to draw an age pyramid. I will start with filtering, removing some columns with unnecessary information and rows of total values of population for each year regarding age groups and gender.

pop <- pop |>
  select(any, nacionalitat, sexe, edat_quinquennal, valor) |>
  filter(edat_quinquennal != "total", sexe != "total")

I will present values for each gender and age group as percentage of total population, so I will obtain the prop variable by dividing each value by the total value of population of each year. Note how I am using group:by() together with mutate() to do this.

pop <- pop |>
  group_by(any) |>
  mutate(prop = valor/sum(valor)) |>
  ungroup()

Now I need to relabel the values of age groups so that they can be presented in a more compact way. I am generating a tab_groups table with the present and desired labels for age groups.

tab_groups <- tibble(edat_quinquennal = sort(unique(pop$edat_quinquennal)))

grups <- c("85 +  ", "  0 - 4", "10 - 14", "15 - 19", "20 - 24", "25 - 29",
           "30 - 34", "35 - 39", "40 - 44", "45 - 49", "  5 - 9", "50 - 54",
           "55 - 59", "60 - 64", "65 - 69", "70 - 74", "75 - 79", "80 - 84")

tab_groups <- tab_groups |>
  mutate(age_groups = grups)

tab_groups

## # A tibble: 18 × 2
##    edat_quinquennal age_groups
##    <chr>            <chr>     
##  1 85 anys o més    "85 +  "  
##  2 de 0 a 4 anys    "  0 - 4" 
##  3 de 10 a 14 anys  "10 - 14" 
##  4 de 15 a 19 anys  "15 - 19" 
##  5 de 20 a 24 anys  "20 - 24" 
##  6 de 25 a 29 anys  "25 - 29" 
##  7 de 30 a 34 anys  "30 - 34" 
##  8 de 35 a 39 anys  "35 - 39" 
##  9 de 40 a 44 anys  "40 - 44" 
## 10 de 45 a 49 anys  "45 - 49" 
## 11 de 5 a 9 anys    "  5 - 9" 
## 12 de 50 a 54 anys  "50 - 54" 
## 13 de 55 a 59 anys  "55 - 59" 
## 14 de 60 a 64 anys  "60 - 64" 
## 15 de 65 a 69 anys  "65 - 69" 
## 16 de 70 a 74 anys  "70 - 74" 
## 17 de 75 a 79 anys  "75 - 79" 
## 18 de 80 a 84 anys  "80 - 84"

Then, I am attaching the new labels for age groups joining the obtained table with the original table.

pop <- inner_join(pop, tab_groups, by = "edat_quinquennal")

Now let’s format the age_groups column as an ordered factor, so that the age groups will be plotted in the correct order.

grups_ord <- grups[c(2, 11, 3:10, 12:18, 1)]

pop <- pop |>
  mutate(age_groups = factor(age_groups, ordered = TRUE, levels = grups_ord))

Finally, we can remove the edat_quinquennal and valor columns, which have been replaced by age_group and prop.

pop <- pop |>
  select(-c(edat_quinquennal, valor))

This is how the dataset looks like after formatting and adjusting:

pop

## # A tibble: 2,016 × 5
##      any nacionalitat sexe    prop age_groups
##    <dbl> <chr>        <chr>  <dbl> <ord>     
##  1  1991 espanyola    homes 0.0119 "  0 - 4" 
##  2  1991 espanyola    homes 0.0143 "  5 - 9" 
##  3  1991 espanyola    homes 0.0194 "10 - 14" 
##  4  1991 espanyola    homes 0.0216 "15 - 19" 
##  5  1991 espanyola    homes 0.0203 "20 - 24" 
##  6  1991 espanyola    homes 0.0192 "25 - 29" 
##  7  1991 espanyola    homes 0.0179 "30 - 34" 
##  8  1991 espanyola    homes 0.0166 "35 - 39" 
##  9  1991 espanyola    homes 0.0164 "40 - 44" 
## 10  1991 espanyola    homes 0.0149 "45 - 49" 
## # ℹ 2,006 more rows

Plotting an Age Pyramid

Let’s plot an age pyramid with these data. An age pyramid is a stacked, horizontal barplot, with negative values for men and positive values for women. Here I am selecting data from 2011 and total population.

pop |>
  filter(any == 2011, nacionalitat == "total") |>
  mutate(valor_pir = ifelse(sexe == "homes", -prop, prop)) |>
  select(sexe, valor_pir, age_groups) |>
  ggplot(aes(valor_pir, age_groups, fill = sexe)) +
  geom_col()

Once obtained the pyramid, let’s edit it to make it more readable by using:

scale_fill_manual() to change color bars and edit the legend.
scale_x_continuous() to set percentages in absolute value in the x axis and remove label of x axis.
theme_minimal() and theme() to change theme and position legend at bottom.
labs() to put a title and a caption to the plot, and remove label of y axis.

pop |>
  filter(any == 2011, nacionalitat == "total") |>
  mutate(valor_pir = ifelse(sexe == "homes", -prop, prop)) |>
  select(sexe, valor_pir, age_groups) |>
  ggplot(aes(valor_pir, age_groups, fill = sexe)) +
  geom_col() +
  scale_fill_manual(name = "sex", values = c("#FF0000", "#0000FF"), labels = c("female", "male"))  +
  scale_x_continuous(name = NULL,
                     breaks = seq(-0.05, 0.05, 0.01),
                     labels = \(x) paste0(abs(x*100), "%")) +
  theme_minimal() +
  labs(title = "Catalan age pyramid (2011)", y = NULL, caption = "source: Idescat") +
  theme(legend.position = "bottom")

Age Pyramid with Gender Excess

An alternative representation of age pyramids is presenting gender excess, that is, indicating which of the two genders has more individuals for an age group. This requires creating new variables from the original table for each age group:

min_m and min_w, representing the minimum value of men and women. Both variables have the same value.
excess_m equal to the difference between men and women if it is positive and zero otherwise.
excess_w equal to the difference between women and men if it is positive and zero otherwise.

I am using pivot_wider() to get values of men and women of an age group in the same row, and pivot_longer() to put the table in long format after the calculations. I have also formatted the gender column as an ordered factor.

pop_excess <- pop |>
  pivot_wider(names_from = "sexe", values_from = "prop") |>
  mutate(min_m = ifelse(homes < dones, homes, dones),
         min_w = min_m,
         excess_m = ifelse(homes > dones, homes - dones, 0),
         excess_w = ifelse(homes < dones, dones - homes, 0)) |>
  select(-c(homes, dones)) |>
  pivot_longer(min_m:excess_w, 
               names_to = "gender", values_to = "prop") |>
  mutate(gender = factor(gender, 
                         levels = c("excess_m", "min_m", "excess_w", "min_w"), 
                         ordered = TRUE))
pop_excess

## # A tibble: 4,032 × 5
##      any nacionalitat age_groups gender       prop
##    <dbl> <chr>        <ord>      <ord>       <dbl>
##  1  1991 espanyola    "  0 - 4"  min_m    0.0110  
##  2  1991 espanyola    "  0 - 4"  min_w    0.0110  
##  3  1991 espanyola    "  0 - 4"  excess_m 0.000842
##  4  1991 espanyola    "  0 - 4"  excess_w 0       
##  5  1991 espanyola    "  5 - 9"  min_m    0.0133  
##  6  1991 espanyola    "  5 - 9"  min_w    0.0133  
##  7  1991 espanyola    "  5 - 9"  excess_m 0.000940
##  8  1991 espanyola    "  5 - 9"  excess_w 0       
##  9  1991 espanyola    "10 - 14"  min_m    0.0181  
## 10  1991 espanyola    "10 - 14"  min_w    0.0181  
## # ℹ 4,022 more rows

Now we can plot the age pyramid from the pop_excess table. Instead of a legend, I have set an annotation to signal data for men and women.

pop_excess |>
  filter(any == 1991, nacionalitat == "total") |>
  mutate(prop = ifelse(gender %in% c("min_m", "excess_m"), -prop, prop)) |>
  ggplot(aes(prop, age_groups, fill = gender)) +
  geom_col() +
  scale_fill_manual(values = c("#0000FF", "#9999FF", "#FF0000", "#FF9999")) +
  scale_x_continuous(name = NULL,
                     breaks = seq(-0.05, 0.05, 0.01),
                     labels = \(x) paste0(abs(x*100), "%")) +
  theme_light(base_size = 12) +
  labs(title = "Catalan age pyramid (1991)", y = NULL) +
  theme(legend.position = "none") +
  annotate("text", x = -0.015, y = "70 - 74", label = "men", size = 5) +
  annotate("text", x = 0.015, y = "70 - 74", label = "women", size = 5)

Faceted Plot of Age Pyramids

The original data has values for nationality and year, so we can present age pyramids for different years by facetting with facet_wrap().

pop_excess |>
  filter(any %in% c(1991, 2001, 2011, 2021), nacionalitat == "total") |>
  mutate(prop = ifelse(gender %in% c("min_m", "excess_m"), -prop, prop)) |>
  ggplot(aes(prop, age_groups, fill = gender)) +
  geom_col() +
  scale_fill_manual(values = c("#0000FF", "#9999FF", "#FF0000", "#FF9999")) +
  scale_x_continuous(name = NULL,
                     breaks = seq(-0.05, 0.05, 0.01),
                     labels = \(x) paste0(abs(x*100), "%")) +
  theme_light(base_size = 10) +
  labs(title = "Catalan age pyramid", y = NULL, caption = "source: Idescat") +
  theme(legend.position = "none") +
  facet_wrap(. ~ any, ncol = 2)

References

Idescat data of population by nationality (continents), sex and five-year age group. https://www.idescat.cat/pub/?id=censph&n=536

Session Info

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Linux Mint 21.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.4 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
##  [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
##  [9] ggplot2_3.5.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.4        sass_0.4.9        generics_0.1.3    blogdown_1.19    
##  [5] stringi_1.8.3     hms_1.1.3         digest_0.6.35     magrittr_2.0.3   
##  [9] evaluate_0.23     grid_4.4.2        timechange_0.3.0  bookdown_0.39    
## [13] fastmap_1.1.1     jsonlite_1.8.9    scales_1.3.0      jquerylib_0.1.4  
## [17] cli_3.6.2         rlang_1.1.5       crayon_1.5.2      bit64_4.0.5      
## [21] munsell_0.5.1     withr_3.0.0       cachem_1.0.8      yaml_2.3.8       
## [25] tools_4.4.2       parallel_4.4.2    tzdb_0.4.0        colorspace_2.1-0 
## [29] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   snakecase_0.11.1 
## [33] bit_4.0.5         vroom_1.6.5       janitor_2.2.0     pkgconfig_2.0.3  
## [37] pillar_1.10.1     bslib_0.7.0       gtable_0.3.5      glue_1.7.0       
## [41] highr_0.10        xfun_0.43         tidyselect_1.2.1  rstudioapi_0.16.0
## [45] knitr_1.46        farver_2.1.1      htmltools_0.5.8.1 labeling_0.4.3   
## [49] rmarkdown_2.26    compiler_4.4.2

Data retrieved at 29 January 2025.

The Jose M Sallan static website