k-means clusters of Barcelona Nightlife

Jose M Sallan 2025-12-13 10 min read

Cluster analysis, or clustering, is a data analysis technique aimed at partitioning a set of objects into groups such that objects within the same group or cluster exhibit greater similarity to one another than to those in other groups. One of the best known clustering techniques is k-means clustering. The aim of k-means clustering is to minimize the sum of within-clusters sum of squares. Objects are assigned to the cluster whose centroid is closer to the object. Usually k-means is implemented through a iterative heuristic. In R, this heuristic is implemented in the kmeans() R base function.

In this post, I will present an example of k-means clustering with geographical data, using the set of nightlife venues of Barcelona registered in 2024. Additionnally I will show how can we extract information from kmeans() using broom, how to obtain a reasonable value of the number of clusters \(k\) and how to interpret the results using data features, in this case the neoighbourhood where venues are located.

library(tidyverse)
library(broom)
library(sf)
library(BAdatasetsSpatial)
library(kselection)

Data

There are two relevant datasets for this analysis. First I will retrieve a Barcelona neighborhood map bcn_neigh from BAdatasetsSpatial.

bcn_neigh <- BCNNeigh |>
  select(c_barri, n_barri, c_distri, n_distri)

Second I will retrieve relevant information from each venue, specifically the geographical location with longitud and latitud, the neighborhood and the name and address. All is gathered in the nightlife_2024 data frame.

nightlife_2024 <- data_2024 |>
  select(nom_local, latitud, longitud, nom_barri, codi_barri, nom_via, porta)

Now I need to elaborate on this data a bit more. Instead of relying on dataset information, I will perform a spatial join to assign the neighborhood to each venue. To do so, I need to create the simple feature object nightlife_2024_sf and then do the spatial join with bcn_neigh.

nightlife_2024_sf <- nightlife_2024 |>
  st_as_sf(coords = c("longitud", "latitud"), crs = 4326, remove = FALSE)

nightlife_2024_sf  <- st_join(nightlife_2024_sf, 
                              bcn_neigh |> select(c_barri, n_barri),
                              join = st_intersects)

Finally, I am removing observations that fall out of the polygons of bcn_neigh.

nightlife_2024_sf <- nightlife_2024_sf |>
  filter(!is.na(c_barri))

k-means Clustering with broom

Let’s test how the kmeans() function works. In this case, we are using as features longitude and latitude. In this case, the distance between observations can be interpreted as geographical distance. First we need to prepare the nl_coords table retaining the relevant variables and removing the geometry column.

nl_coords <- nightlife_2024_sf |>
  st_drop_geometry() |>
  select(longitud, latitud)

Let’s do a k-means clustering with four centers. As the heuristics has random elements, I am setting a seed of random numbers for reproducibility. The result of the analysis is stored in test_kmeans.

set.seed(55)
test_kmeans <- kmeans(nl_coords, centers = 4)

Let’s apply the three functions of broom to test_kemans.

tidy(test_kmeans) # centers of clusters

## # A tibble: 4 × 5
##   longitud latitud  size withinss cluster
##      <dbl>   <dbl> <int>    <dbl> <fct>  
## 1     2.16    41.4    54  0.00403 1      
## 2     2.14    41.4    77  0.0125  2      
## 3     2.18    41.4    63  0.00541 3      
## 4     2.17    41.4    47  0.0119  4

glance(test_kmeans) # performance metrics

## # A tibble: 1 × 4
##   totss tot.withinss betweenss  iter
##   <dbl>        <dbl>     <dbl> <int>
## 1 0.128       0.0338    0.0938     3

augment(test_kmeans, data = nl_coords) # assigns cluster variable

## # A tibble: 241 × 3
##    longitud latitud .cluster
##       <dbl>   <dbl> <fct>   
##  1     2.15    41.4 2       
##  2     2.13    41.4 2       
##  3     2.16    41.4 1       
##  4     2.14    41.4 2       
##  5     2.17    41.4 1       
##  6     2.17    41.4 1       
##  7     2.15    41.4 2       
##  8     2.17    41.4 1       
##  9     2.17    41.4 4       
## 10     2.13    41.4 2       
## # ℹ 231 more rows

For k-means clustering the broom functions retrieve the following information:

tidy() returns the centroids of each of the clusters.
glance() presents some performance parameters. The more relevant is the within-clusters sum of squares tot.withinss.
augment() adds to data a .cluster column with the cluster assigned to each observation. For convenience, this column is encoded as factor.

I have used augment() and tidy() to plot the solution of the clustering with four centers, representing each point and the clusters centroids.

augment(test_kmeans, data = nl_coords) |>
  ggplot(aes(longitud, latitud, color = .cluster)) +
  geom_point() +
  geom_point(data = tidy(test_kmeans), aes(longitud, latitud, color = cluster), shape = 3, size = 10) +
  theme_minimal() +
  theme(legend.position = "none")

From the plot, we observe that the clusters obtained with k-means tend to be of circular shape, centered in the cluster centroid.

Number of Clusters

It is about time to select how many clusters to enter into the k-means algorithm. The first strategy is to evaluate the within-clusters sum of squares for a range of values of k. As k increases the sum of squares is smaller, but there is a moment where the gain of performance is not significant. We can appreciate that plotting the within-clusters sum of squares versus the number of clusters. The elbow_table contains the performance metrics obtained with glance() for values of k between one and ten.

set.seed(1111)
elbow_table <- 
  map_dfr(1:10, ~ kmeans(nl_coords, centers = .) |> 
            glance() |> 
            mutate(k = .) |> 
            relocate(k))

elbow_table |>
  ggplot(aes(k, tot.withinss)) +
  geom_line() +
  geom_point() +
  geom_vline(xintercept = 5, linetype = "dashed", 
             color ="red", linewidth = 1) +
  scale_x_continuous(breaks = 1:10) +
  theme_minimal(base_size = 12)

From the plot above, the elbow point from which within-clusters sum of squares is not reduced significantly is k = 5.

The kselection package implements the procedure described by Pahm et al. (2004) to determine the number of clusters in k-means clustering. Here I have just run the code presented in the package vignette.

set.seed(1313)
sol <- kselection(nl_coords)
num_clusters(sol)

## [1] 13

get_f_k(sol)

##  [1] 1.0000000 0.8644771 0.9132815 1.0342962 0.8466897 1.0194538 0.9463159
##  [8] 0.9961763 1.0007176 0.9894164 1.0666465 0.9523737 0.7859992 1.0725217
## [15] 0.8699234

plot(sol)

kselection recommends also k = 5 clusters.

Five Nightlife Clusters

Let’s find the k-means partition of nightlife venues into five clusters. The result is in nl_kmeans05.

set.seed(1212)
nl_kmeans_05 <- kmeans(nl_coords, centers = 5)

I am adding the cluster variable to nightlife_2024_sf as a vector. I cannot use glance() here, as it does not work with spatial objects.

nightlife_2024_sf <- nightlife_2024_sf |>
  mutate(cluster = as.factor(nl_kmeans_05$cluster)) # augment() cannot be used

Let’s do a draft of the map with the venues colored by cluster.

ggplot(bcn_neigh) +
  geom_sf() +
  geom_sf(data = nightlife_2024_sf, aes(color = cluster))

Interpreting Clusters

To interpret the clusters, I will count the number of venues by cluster and neighborhood. The results are in the clusters_neigh table.

clusters_neigh <- nightlife_2024_sf |>
  st_drop_geometry() |>
  group_by(cluster, n_barri) |>
  summarise(venues = n(), .groups = "drop")

Let’s examine clusters one by one:

clusters_neigh |>
  filter(cluster == 1) # gotic - la ribera

## # A tibble: 9 × 3
##   cluster n_barri                               venues
##   <fct>   <chr>                                  <int>
## 1 1       Sant Pere, Santa Caterina i la Ribera     16
## 2 1       el Barri Gòtic                            21
## 3 1       el Clot                                    1
## 4 1       el Fort Pienc                              3
## 5 1       el Parc i la Llacuna del Poblenou          8
## 6 1       el Poblenou                                1
## 7 1       la Barceloneta                             3
## 8 1       la Dreta de l'Eixample                     6
## 9 1       la Vila Olímpica del Poblenou              2

clusters_neigh |>
  filter(cluster == 2) # sant gervasi - gracia

## # A tibble: 6 × 3
##   cluster n_barri                            venues
##   <fct>   <chr>                               <int>
## 1 2       Sant Gervasi - Galvany                 29
## 2 2       el Camp d'en Grassot i Gràcia Nova      1
## 3 2       l'Antiga Esquerra de l'Eixample         8
## 4 2       la Dreta de l'Eixample                  8
## 5 2       la Nova Esquerra de l'Eixample          5
## 6 2       la Vila de Gràcia                      15

clusters_neigh |>
  filter(cluster == 3) #poble sec and raval, lower Eixample

## # A tibble: 6 × 3
##   cluster n_barri                         venues
##   <fct>   <chr>                            <int>
## 1 3       Sant Antoni                          7
## 2 3       el Barri Gòtic                       2
## 3 3       el Poble Sec                        23
## 4 3       el Raval                            16
## 5 3       l'Antiga Esquerra de l'Eixample      3
## 6 3       la Nova Esquerra de l'Eixample       1

clusters_neigh |>
  filter(cluster == 4) # guinardo - horta

## # A tibble: 10 × 3
##    cluster n_barri                       venues
##    <fct>   <chr>                          <int>
##  1 4       Horta                              3
##  2 4       Navas                              3
##  3 4       Vilapicina i la Torre Llobeta      1
##  4 4       el Baix Guinardó                   1
##  5 4       el Carmel                          1
##  6 4       el Congrés i els Indians           1
##  7 4       el Guinardó                        8
##  8 4       el Turó de la Peira                2
##  9 4       la Sagrada Família                 1
## 10 4       la Sagrera                         1

clusters_neigh |>
  filter(cluster == 5) # sants and les corts // covers also Sarrià and Les Tres Torres

## # A tibble: 10 × 3
##    cluster n_barri                        venues
##    <fct>   <chr>                           <int>
##  1 5       Hostafrancs                         2
##  2 5       Pedralbes                           2
##  3 5       Sant Gervasi - Galvany              3
##  4 5       Sants                               9
##  5 5       Sants - Badal                       2
##  6 5       Sarrià                              3
##  7 5       la Maternitat i Sant Ramon          3
##  8 5       la Nova Esquerra de l'Eixample      2
##  9 5       les Corts                          11
## 10 5       les Tres Torres                     3

The five clusters can be interpreted as follows:

Cluster 1 Gòtic - la Ribera. A cluster covering the north of the city centre. Also covers venues of close neighborhoods, extening to Port Olímpic.
Cluster 2 Sant Gervasi - Gràcia. Venues above the Diagonal street, more focused on a local, affluent public.
Cluster 3 Poble Sec - Raval: Venues of the south of the city centre. While Raval venues are targeted to tourists, Poble Sec honors a lasting tradition of nightlife.
Cluster 4 Guinardó - Horta: This part of the city comes from villages that were absorbed by Barcelona, and that they have still their own communities. The most relevant nightlife scenes are from Guinardó, Horta and Navas.
Cluster 5 Les Corts - Sarrià: even more posh than Sant Gervasi, this cluster covers the nightlife of the richest neighbourhoods.

The Five Nightlife Clusters

Here is the map of the five nightlife clusters, with some coloring and the clusters labels.

ggplot(bcn_neigh) +
  geom_sf(fill = "#FFE5CC") +
  geom_sf(data = nightlife_2024_sf, aes(color = cluster)) +
  scale_color_brewer(labels = c("Gòtic - la Ribera", "Sant Gervasi - Gràcia", "Poble Sec - Raval", "Guinardó - Horta",  "Les Corts - Sarrià"), 
                     palette = "Dark2") +
  theme_void() +
  theme(legend.position = "inside", 
        legend.position.inside = c(0.2, 0.3),
        panel.background = element_rect(fill = "white"))

As elements are added to the cluster by proximity to cluster centroid, these tend to be circular. This can be overriden by other clustering algorithms like DBSCAN.

References

Package kselection: https://github.com/drodriguezperez/kselection
D T Pham, S S Dimov, and C D Nguyen (2004). Selection of k in k-means clustering, Mechanical Engineering Science, 219(1), 103-119. https://doi.org/10.1243/095440605X8298

Session Info

## R version 4.5.2 (2025-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Linux Mint 21.1
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Europe/Madrid
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] kselection_0.2.1        BAdatasetsSpatial_0.1.0 sf_1.0-20              
##  [4] broom_1.0.10            lubridate_1.9.4         forcats_1.0.1          
##  [7] stringr_1.6.0           dplyr_1.1.4             purrr_1.2.0            
## [10] readr_2.1.5             tidyr_1.3.1             tibble_3.3.0           
## [13] ggplot2_4.0.0           tidyverse_2.0.0        
## 
## loaded via a namespace (and not attached):
##  [1] s2_1.1.7           utf8_1.2.4         sass_0.4.10        generics_0.1.3    
##  [5] class_7.3-23       KernSmooth_2.23-26 blogdown_1.21      stringi_1.8.7     
##  [9] hms_1.1.4          digest_0.6.37      magrittr_2.0.4     evaluate_1.0.3    
## [13] grid_4.5.2         timechange_0.3.0   RColorBrewer_1.1-3 bookdown_0.43     
## [17] fastmap_1.2.0      jsonlite_2.0.0     e1071_1.7-16       backports_1.5.0   
## [21] DBI_1.2.3          scales_1.4.0       jquerylib_0.1.4    cli_3.6.4         
## [25] rlang_1.1.6        units_0.8-7        withr_3.0.2        cachem_1.1.0      
## [29] yaml_2.3.10        tools_4.5.2        tzdb_0.5.0         vctrs_0.6.5       
## [33] R6_2.6.1           proxy_0.4-27       lifecycle_1.0.4    classInt_0.4-11   
## [37] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0        gtable_0.3.6      
## [41] Rcpp_1.1.0         glue_1.8.0         xfun_0.52          tidyselect_1.2.1  
## [45] rstudioapi_0.17.1  knitr_1.50         farver_2.1.2       htmltools_0.5.8.1 
## [49] labeling_0.4.3     rmarkdown_2.29     wk_0.9.4           compiler_4.5.2    
## [53] S7_0.2.0

The Jose M Sallan static website