On the performance of for loops in R

Jose M Sallan 2021-03-05 4 min read

A common advice to R users is that we need to avoid for loops. This is because the basic data types of R are vectors (a variable in R is a vector of length one). Let’s see a comparison of the performance of a vectorized operation versus a for loop.

Filtering a vector

We need to find which positions of a numeric vector vec contain a number num. We may be tempted to do something like this:

f1 <- function(vec, num){
   n <- length(vec)
  sol <- logical(n)
  k <- 1
  sol <- integer()
  for(i in 1:n){
    if(vec[i] == num){
      sol[k] <- i
      k <- k+1
    }
  }
  return(sol)
}

But as R is a vectorial language, we can obtain the same result doing:

f2 <- function(vec, num) return(which(vec == num))

Let’s build a large vector, test both functions on it and see if they bring the same result:

set.seed(1313)
large_vector <- sample(0:9, 10000, replace = TRUE)
s1 <- f1(large_vector, 4)
s2 <- f2(large_vector, 4)
identical(s1, s2)
## [1] TRUE

Let’s examine the speed of both operations with the rbenchmark library:

library(rbenchmark)
benchmark(f1(large_vector, 4), f2(large_vector, 4), columns=c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'), replications = 100, order='elapsed')
##                  test replications elapsed relative user.self sys.self
## 2 f2(large_vector, 4)          100   0.006      1.0     0.004    0.001
## 1 f1(large_vector, 4)          100   0.075     12.5     0.066    0.009

We see that the function with a for loop is much slower than the vectorised function. This is empirical evidence advocating using vectorised functions when operating with vectors.

Sum columns of a matrix

Let’s now sum the columns of a very large matrix. We have three ways of doing that: with a for loop, with an apply loop or with a specific function.

Three ways to do the job

To define a function that sums each column and puts the value into a vector component using a for loop:

f3 <- function(m){
  n <- dim(m)[2]
  s <- numeric(n)
  for(i in 1:n) s[i] <- sum(m[,i])
  return(s)
}

A second possibility is to run a set of vectorized operations. We can do the same with an apply loop over columns:

f4 <- function(m) return(apply(m, 2, sum))

Finally, we can use a built-in function called colSums that performs that task.

Testing performance

Let’s define a very large matrix M, and apply the three functions to it to obtain the sum of columns:

M <- matrix(sample(1:100, 1000000, replace = TRUE), 1000, 1000)
s3 <- f3(M)
s4 <- f4(M)
s5 <- colSums(M)

The s4 vector is integer, and the other two numeric. Let’s check if they yield the same values:

s4 <- as.numeric(s4)
identical(s3, s4)
## [1] TRUE
identical(s4, s5)
## [1] TRUE
identical(s3, s5)
## [1] TRUE

Let’s check the speed of each function:

library(rbenchmark)
benchmark(f3(M), f4(M), colSums(M), columns=c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'), order='elapsed')
##         test replications elapsed relative user.self sys.self
## 3 colSums(M)          100   0.091    1.000     0.090    0.001
## 1      f3(M)          100   0.676    7.429     0.613    0.061
## 2      f4(M)          100   1.155   12.692     1.085    0.065

The best performance is achieved by the built-in function. The for loop implementation seems to go faster than apply in this context.

Compute the means of a list of vectors

Finally, we can consider the job of computing the means of a very large list of vectors. Let’s define a very large list of vectors:

vectors <- lapply(1:1000, function(x) sample(1:100, 10000, replace = TRUE))

We can compute the mean with a looping function over the list, using the vectorized mean function:

looping_mean <- function(list){
  n <- length(list)
  means <- numeric(n)
  for(i in 1:n) means[i] <- mean(list[[i]])
  return(means)
}

An alternative for the for loop is the sapply function:

sapply(vectors, mean)

Both functions return the same result:

identical(looping_mean(vectors), sapply(vectors, mean))
## [1] TRUE

Alternatively, we want to store the results on a list, instead of a vector:

lapply(vectors, mean)

Let’s examine the performance of each implementation:

benchmark(looping_mean(vectors),
          sapply(vectors, mean),
          lapply(vectors, mean),
          columns=c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'), 
          order='elapsed',
          replications = 100)
##                    test replications elapsed relative user.self sys.self
## 3 lapply(vectors, mean)          100   1.284    1.000     1.276    0.005
## 2 sapply(vectors, mean)          100   1.313    1.023     1.301    0.007
## 1 looping_mean(vectors)          100   1.325    1.032     1.315    0.005

In this case, the three implementations have a similar performance.

When to avoid for loops in R

This small experiment show us when do we need to avoid for loops in R: when performing vectorized operations. As R is a vectorized language, we must avoid looping across the components of a vector, like when filtering a vector. This means that we must avoid looping in operations such as computing summarised vector statistics, or subsetting rows of a data frame. In other contexts, iterating functions of the apply family have a performance similar as for loops. The relative merits of each function seem to depend on each type of iteration.