Getting Barcelona Sunset and Sunrise Hours

Jose M Sallan 2024-02-13 15 min read

data cleaning

Getting and cleaning data is one of the most time-consuming activities in data analysis. This job consists of acquiring data from an unstructured source (if you prefer, from a source with maintainers with different ideas of what a structured dataset is) and transforming it into a (correctly) structured dataset. Paraphrasing Tolstoy, we can say that “structured datasets are all alike; every unstructured dataset is unstructured in its own way”.

In this post, I will present how I have retrieved sunrise (orto in Spanish) and sunset (ocaso in Spanish) hours in Barcelona during 2019. Data is publicly available as a text file from the following link:

https://cdn.mitma.gob.es/portal-web-drupal/salidapuestasol/2019/Barcelona-2019.txt

Let’s save the link in a R variable.

url_oo <- "https://cdn.mitma.gob.es/portal-web-drupal/salidapuestasol/2019/Barcelona-2019.txt"

In this post, I will rely on the powers of the tidyverse.

library(tidyverse)

Among the packages loaded in the above instruction is the stringr package, which will be very useful in this job.

Looking at Data

We can take a look at the data using readLines. This function parses a text file, so that each line is the component of a vector. As all reading functions from R, input can be a location in your computer or a link to a remote resource.

oo <- readLines(url_oo)
oo

##  [1] "BARCELONA                                  SALIDA Y PUESTA DE SOL PARA 2019             Observatorio Astron\xf3mico Nacional"    
##  [2] "Latitud y longitud: 41 23  7, + 2 10 39                                                   Instituto Geogr\xe1fico Nacional"      
##  [3] "A\xf1o 2019                                Hora oficial en la pen\xednsula y Baleares           Ministerio de Fomento, Espa\xf1a"
##  [4] ""                                                                                                                                
##  [5] "Dia  Enero   Febrero    Marzo     Abril      Mayo     Junio     Julio     Agosto   Septiem.  Octubre   Noviemb.  Diciemb."       
##  [6] "   Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas  Ort Ocas"       
##  [7] "   h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m"       
##  [8] " 1 817 1732  803 1807  726 1841  735 2016  649 2048  620 2118  621 2129  646 2109  717 2025  748 1934  722 1747  757 1723"       
##  [9] " 2 817 1733  802 1808  725 1843  734 2017  648 2049  620 2119  622 2129  647 2108  718 2023  749 1932  723 1746  758 1722"       
## [10] " 3 818 1734  801 1810  723 1844  732 2018  647 2051  620 2120  622 2128  648 2107  719 2022  750 1930  725 1744  759 1722"       
## [11] " 4 818 1735  800 1811  722 1845  730 2019  645 2052  619 2120  623 2128  649 2105  720 2020  751 1929  726 1743  800 1722"       
## [12] " 5 818 1736  759 1812  720 1846  729 2020  644 2053  619 2121  624 2128  650 2104  721 2018  752 1927  727 1742  801 1722"       
## [13] " 6 817 1737  758 1813  719 1847  727 2021  643 2054  619 2122  624 2128  651 2103  722 2017  753 1925  728 1741  802 1722"       
## [14] " 7 817 1738  757 1815  717 1848  725 2023  642 2055  618 2122  625 2127  652 2102  723 2015  754 1924  729 1740  803 1722"       
## [15] " 8 817 1739  756 1816  715 1850  724 2024  640 2056  618 2123  625 2127  653 2100  724 2013  755 1922  731 1739  804 1722"       
## [16] " 9 817 1740  754 1817  714 1851  722 2025  639 2057  618 2123  626 2127  654 2059  725 2012  756 1921  732 1738  805 1722"       
## [17] "10 817 1741  753 1818  712 1852  720 2026  638 2058  618 2124  627 2126  655 2058  726 2010  757 1919  733 1737  806 1722"       
## [18] "11 817 1742  752 1820  710 1853  719 2027  637 2059  617 2125  628 2126  656 2057  727 2008  758 1917  734 1736  807 1722"       
## [19] "12 816 1743  751 1821  709 1854  717 2028  636 2100  617 2125  628 2125  657 2055  728 2006  759 1916  736 1735  808 1722"       
## [20] "13 816 1744  749 1822  707 1855  716 2029  635 2101  617 2125  629 2125  658 2054  729 2005  800 1914  737 1734  808 1722"       
## [21] "14 816 1745  748 1823  705 1856  714 2030  634 2102  617 2126  630 2124  659 2052  730 2003  801 1913  738 1733  809 1722"       
## [22] "15 815 1746  747 1825  704 1857  712 2031  633 2103  617 2126  631 2124  700 2051  731 2001  803 1911  739 1732  810 1722"       
## [23] "16 815 1747  745 1826  702 1859  711 2032  632 2104  617 2127  631 2123  701 2050  732 2000  804 1909  740 1731  811 1723"       
## [24] "17 814 1749  744 1827  700 1900  709 2033  631 2105  617 2127  632 2122  702 2048  733 1958  805 1908  742 1730  811 1723"       
## [25] "18 814 1750  743 1828  659 1901  708 2034  630 2106  617 2127  633 2122  703 2047  734 1956  806 1906  743 1730  812 1723"       
## [26] "19 813 1751  741 1830  657 1902  706 2035  629 2107  618 2128  634 2121  704 2045  735 1954  807 1905  744 1729  813 1724"       
## [27] "20 813 1752  740 1831  655 1903  705 2037  628 2108  618 2128  635 2120  705 2044  736 1953  808 1903  745 1728  813 1724"       
## [28] "21 812 1753  738 1832  654 1904  703 2038  628 2109  618 2128  636 2119  706 2042  737 1951  809 1902  746 1727  814 1725"       
## [29] "22 811 1755  737 1833  652 1905  702 2039  627 2110  618 2128  636 2119  707 2041  738 1949  811 1900  747 1727  814 1725"       
## [30] "23 811 1756  736 1834  650 1906  700 2040  626 2111  618 2129  637 2118  708 2039  739 1948  812 1859  749 1726  815 1726"       
## [31] "24 810 1757  734 1836  649 1907  659 2041  625 2112  619 2129  638 2117  709 2038  740 1946  813 1858  750 1726  815 1726"       
## [32] "25 809 1758  733 1837  647 1908  657 2042  625 2112  619 2129  639 2116  710 2036  741 1944  814 1856  751 1725  816 1727"       
## [33] "26 808 1759  731 1838  645 1910  656 2043  624 2113  619 2129  640 2115  711 2035  742 1942  815 1855  752 1725  816 1728"       
## [34] "27 808 1801  730 1839  644 1911  655 2044  623 2114  620 2129  641 2114  712 2033  743 1941  716 1753  753 1724  816 1728"       
## [35] "28 807 1802  728 1840  642 1912  653 2045  623 2115  620 2129  642 2113  713 2031  744 1939  718 1752  754 1724  817 1729"       
## [36] "29 806 1803            640 1913  652 2046  622 2116  620 2129  643 2112  714 2030  745 1937  719 1751  755 1723  817 1730"       
## [37] "30 805 1804            639 1914  650 2047  621 2117  621 2129  644 2111  715 2028  746 1936  720 1749  756 1723  817 1730"       
## [38] "31 804 1806            737 2015            621 2117            645 2110  716 2027            721 1748            817 1731"       
## [39] "   h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m  h m"       
## [40] ""                                                                                                                                
## [41] "Se ha considerado el horario adelantado desde el \xfaltimo domingo de marzo al \xfaltimo domingo de octubre. Las coordenadas"    
## [42] "vienen dadas en grados, minutos y segundos, siendo la longitud positiva al Este y negativa al Oeste del meridiano cero."

Taking a look at the results we observe that:

Data is between lines 8 and 38.
Column names are useless in this context.
Sunrise and sunset hours occur in the same positions of each line for all days.
We can obtain the final position of each sunrise and sunset hour with the m character in line 7.

To obtain the positions of the m character in line 7, we can use stringr::str_locate_all():

pos_e <- str_locate_all(oo[7], "m")[[1]][ ,1]
pos_e

##  [1]   6  11  16  21  26  31  36  41  46  51  56  61  66  71  76  81  86  91  96
## [20] 101 106 111 116 121

In pos_e is retrieved the end of each block of data. All sunrise hours have three characters and all sunset have four, so we can find the start of each block doing:

dec <- rep(c(2, 3), 12)
pos_s <- pos_e - dec
pos_s

##  [1]   4   8  14  18  24  28  34  38  44  48  54  58  64  68  74  78  84  88  94
## [20]  98 104 108 114 118

Therefore, pos_e and pos_s store the positions of each sunrise and sunset hour of each row.

Retrieving the Data

With pos_s and pos_e we can retrieve the data of each row. We can pick each block with stringr::str_sub(). This function picks the values of a string between start and end positions. These can be vectors, so it can separate each value of a row. Let’s illustrate it with the first row:

str_sub(oo[8], pos_s, pos_e)

##  [1] "817"  "1732" "803"  "1807" "726"  "1841" "735"  "2016" "649"  "2048"
## [11] "620"  "2118" "621"  "2129" "646"  "2109" "717"  "2025" "748"  "1934"
## [21] "722"  "1747" "757"  "1723"

We can iterate this expression along rows 8 to 38 of oo and stack all rows together into a data frame. We can do this in one step with purrr::map_dfr(), but we need to name each vector with the same column names. Let’s define the column names setting months as numbers and distinguishing between sunrise and sunset with orto and ocaso. Note how we are using rep() with different arguments to do this.

cn1 <- rep(1:12, each = 2)
cn2 <- rep(c("orto", "ocaso"), times = 12)
cn <- paste(cn1, cn2, sep = "_")
cn

##  [1] "1_orto"   "1_ocaso"  "2_orto"   "2_ocaso"  "3_orto"   "3_ocaso" 
##  [7] "4_orto"   "4_ocaso"  "5_orto"   "5_ocaso"  "6_orto"   "6_ocaso" 
## [13] "7_orto"   "7_ocaso"  "8_orto"   "8_ocaso"  "9_orto"   "9_ocaso" 
## [19] "10_orto"  "10_ocaso" "11_orto"  "11_ocaso" "12_orto"  "12_ocaso"

Now we are ready to arrange the data into a data frame:

table <- map_dfr(oo[8:38], ~ {v <- str_sub(. , pos_s, pos_e)
                              names(v) <- cn
                              return(v)})

Cleaning the Data

Let’s see what we’ve got so far:

table |>
  print(n = Inf)

## # A tibble: 31 × 24
##    `1_orto` `1_ocaso` `2_orto` `2_ocaso` `3_orto` `3_ocaso` `4_orto` `4_ocaso`
##    <chr>    <chr>     <chr>    <chr>     <chr>    <chr>     <chr>    <chr>    
##  1 817      1732      "803"    "1807"    726      1841      "735"    "2016"   
##  2 817      1733      "802"    "1808"    725      1843      "734"    "2017"   
##  3 818      1734      "801"    "1810"    723      1844      "732"    "2018"   
##  4 818      1735      "800"    "1811"    722      1845      "730"    "2019"   
##  5 818      1736      "759"    "1812"    720      1846      "729"    "2020"   
##  6 817      1737      "758"    "1813"    719      1847      "727"    "2021"   
##  7 817      1738      "757"    "1815"    717      1848      "725"    "2023"   
##  8 817      1739      "756"    "1816"    715      1850      "724"    "2024"   
##  9 817      1740      "754"    "1817"    714      1851      "722"    "2025"   
## 10 817      1741      "753"    "1818"    712      1852      "720"    "2026"   
## 11 817      1742      "752"    "1820"    710      1853      "719"    "2027"   
## 12 816      1743      "751"    "1821"    709      1854      "717"    "2028"   
## 13 816      1744      "749"    "1822"    707      1855      "716"    "2029"   
## 14 816      1745      "748"    "1823"    705      1856      "714"    "2030"   
## 15 815      1746      "747"    "1825"    704      1857      "712"    "2031"   
## 16 815      1747      "745"    "1826"    702      1859      "711"    "2032"   
## 17 814      1749      "744"    "1827"    700      1900      "709"    "2033"   
## 18 814      1750      "743"    "1828"    659      1901      "708"    "2034"   
## 19 813      1751      "741"    "1830"    657      1902      "706"    "2035"   
## 20 813      1752      "740"    "1831"    655      1903      "705"    "2037"   
## 21 812      1753      "738"    "1832"    654      1904      "703"    "2038"   
## 22 811      1755      "737"    "1833"    652      1905      "702"    "2039"   
## 23 811      1756      "736"    "1834"    650      1906      "700"    "2040"   
## 24 810      1757      "734"    "1836"    649      1907      "659"    "2041"   
## 25 809      1758      "733"    "1837"    647      1908      "657"    "2042"   
## 26 808      1759      "731"    "1838"    645      1910      "656"    "2043"   
## 27 808      1801      "730"    "1839"    644      1911      "655"    "2044"   
## 28 807      1802      "728"    "1840"    642      1912      "653"    "2045"   
## 29 806      1803      "   "    "    "    640      1913      "652"    "2046"   
## 30 805      1804      "   "    "    "    639      1914      "650"    "2047"   
## 31 804      1806      "   "    "    "    737      2015      "   "    "    "   
## # ℹ 16 more variables: `5_orto` <chr>, `5_ocaso` <chr>, `6_orto` <chr>,
## #   `6_ocaso` <chr>, `7_orto` <chr>, `7_ocaso` <chr>, `8_orto` <chr>,
## #   `8_ocaso` <chr>, `9_orto` <chr>, `9_ocaso` <chr>, `10_orto` <chr>,
## #   `10_ocaso` <chr>, `11_orto` <chr>, `11_ocaso` <chr>, `12_orto` <chr>,
## #   `12_ocaso` <chr>

There are several problems with this table:

There are empty spaces with days that do not exist, like 30 of February.
We need to add a column specifying the day.
Data is in wide format, and computers need long tables.
Time is stored as a three- of four-character string.

Let’s tackle those issues step by step. First, let’s put the times that do not exist as NA:

table <- table |>
  mutate(across(everything(), ~ case_when(. == "   " ~ NA,
                                          . == "    " ~ NA,
                                          TRUE ~ .)))

Let’s add the dia column for the day number:

table <- table |>
  mutate(dia = 1:31)

Now we are ready to set the table in long format. In this format, we don’t need to keep the non-existing days, so we remove them with tidyr::drop_na().

table <- table |>
  pivot_longer(-dia, names_to = "var", values_to = "tiempo") |>
  drop_na()

Let’s take a look at the result:

table

## # A tibble: 730 × 3
##      dia var     tiempo
##    <int> <chr>   <chr> 
##  1     1 1_orto  817   
##  2     1 1_ocaso 1732  
##  3     1 2_orto  803   
##  4     1 2_ocaso 1807  
##  5     1 3_orto  726   
##  6     1 3_ocaso 1841  
##  7     1 4_orto  735   
##  8     1 4_ocaso 2016  
##  9     1 5_orto  649   
## 10     1 5_ocaso 2048  
## # ℹ 720 more rows

We can split the values of var with tidyr::separate():

table <- table |>
  separate(var, sep = "_", into = c("mes", "evento"))
table

## # A tibble: 730 × 4
##      dia mes   evento tiempo
##    <int> <chr> <chr>  <chr> 
##  1     1 1     orto   817   
##  2     1 1     ocaso  1732  
##  3     1 2     orto   803   
##  4     1 2     ocaso  1807  
##  5     1 3     orto   726   
##  6     1 3     ocaso  1841  
##  7     1 4     orto   735   
##  8     1 4     ocaso  2016  
##  9     1 5     orto   649   
## 10     1 5     ocaso  2048  
## # ℹ 720 more rows

There are several ways of picking the hour and minute from this time format. I have chosen to transform it into numeric and to use the rounding floor() function to get the hour hora and get the value of minutes min from h:

table <- table |>
  mutate(tiempo = as.numeric(tiempo),
         hora = floor(tiempo/100),
         min = tiempo - hora*100)
table

## # A tibble: 730 × 6
##      dia mes   evento tiempo  hora   min
##    <int> <chr> <chr>   <dbl> <dbl> <dbl>
##  1     1 1     orto      817     8    17
##  2     1 1     ocaso    1732    17    32
##  3     1 2     orto      803     8     3
##  4     1 2     ocaso    1807    18     7
##  5     1 3     orto      726     7    26
##  6     1 3     ocaso    1841    18    41
##  7     1 4     orto      735     7    35
##  8     1 4     ocaso    2016    20    16
##  9     1 5     orto      649     6    49
## 10     1 5     ocaso    2048    20    48
## # ℹ 720 more rows

Finally, let’s use dplyr::select() to retrieve the columns we need:

table <- table |>
  select(mes, dia, evento, hora, min) |>
  arrange(mes, dia)
table

## # A tibble: 730 × 5
##    mes     dia evento  hora   min
##    <chr> <int> <chr>  <dbl> <dbl>
##  1 1         1 orto       8    17
##  2 1         1 ocaso     17    32
##  3 1         2 orto       8    17
##  4 1         2 ocaso     17    33
##  5 1         3 orto       8    18
##  6 1         3 ocaso     17    34
##  7 1         4 orto       8    18
##  8 1         4 ocaso     17    35
##  9 1         5 orto       8    18
## 10 1         5 ocaso     17    36
## # ℹ 720 more rows

Using the dataset

The way that the table has been arranged allows using it flexibly. We can collapse the date and time values into a timestamp with lubridate::makedatetime() or create a date with lubridate::makedate() and a time with hms::hms().

As an example, let’s plot sunrise and sunset time for each day.

library(hms)
table |>
  mutate(date = make_date(year = 2019, month = mes, day = dia),
         time = hms(minutes = min, hours = hora)) |>
  ggplot(aes(date, time, color = evento)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  scale_color_manual(values = c("#FF3333", "#FF9933")) +
  theme(legend.position = "top", 
        axis.title = element_blank(),
        plot.title.position = "plot") +
  ggtitle("Sunrise (orto) and sunset (ocaso) in Barcelona (2019)")

Here we can see the effect of the daylight saving time.

References

Instituto Geográfico Nacional (2024). Hora, salida y puesta de sol. https://astronomia.ign.es/hora-salidas-y-puestas-de-sol

Session Info

## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Linux Mint 21.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] hms_1.1.3       lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
##  [5] dplyr_1.1.4     purrr_1.0.2     readr_2.1.5     tidyr_1.3.0    
##  [9] tibble_3.2.1    ggplot2_3.4.4   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.3      jsonlite_1.8.8    highr_0.10        compiler_4.3.2   
##  [5] tidyselect_1.2.0  jquerylib_0.1.4   scales_1.2.1      yaml_2.3.7       
##  [9] fastmap_1.1.1     R6_2.5.1          labeling_0.4.2    generics_0.1.3   
## [13] knitr_1.42        bookdown_0.33     munsell_0.5.0     tzdb_0.3.0       
## [17] bslib_0.5.0       pillar_1.9.0      rlang_1.1.3       utf8_1.2.3       
## [21] stringi_1.7.12    cachem_1.0.7      xfun_0.39         sass_0.4.5       
## [25] timechange_0.2.0  cli_3.6.1         withr_2.5.0       magrittr_2.0.3   
## [29] digest_0.6.31     grid_4.3.2        rstudioapi_0.15.0 lifecycle_1.0.3  
## [33] vctrs_0.6.4       evaluate_0.20     glue_1.6.2        farver_2.1.1     
## [37] blogdown_1.16     fansi_1.0.4       colorspace_2.1-0  rmarkdown_2.21   
## [41] tools_4.3.2       pkgconfig_2.0.3   htmltools_0.5.5

The Jose M Sallan static website