A common advice to R users is that we need to avoid `for`

loops. This is because the basic data types of R are **vectors** (a variable in R is a vector of length one). Let’s see a comparison of the performance of a vectorized operation versus a `for`

loop.

## Filtering a vector

We need to find which positions of a numeric vector `vec`

contain a number `num`

. We may be tempted to do something like this:

```
f1 <- function(vec, num){
n <- length(vec)
sol <- logical(n)
k <- 1
sol <- integer()
for(i in 1:n){
if(vec[i] == num){
sol[k] <- i
k <- k+1
}
}
return(sol)
}
```

But as R is a vectorial language, we can obtain the same result doing:

`f2 <- function(vec, num) return(which(vec == num))`

Let’s build a large vector, test both functions on it and see if they bring the same result:

```
set.seed(1313)
large_vector <- sample(0:9, 10000, replace = TRUE)
s1 <- f1(large_vector, 4)
s2 <- f2(large_vector, 4)
identical(s1, s2)
```

`## [1] TRUE`

Let’s examine the speed of both operations with the `rbenchmark`

library:

```
library(rbenchmark)
benchmark(f1(large_vector, 4), f2(large_vector, 4), columns=c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'), replications = 100, order='elapsed')
```

```
## test replications elapsed relative user.self sys.self
## 2 f2(large_vector, 4) 100 0.006 1.0 0.004 0.001
## 1 f1(large_vector, 4) 100 0.075 12.5 0.066 0.009
```

We see that the function with a `for`

loop is much slower than the vectorised function. This is empirical evidence advocating using vectorised functions when operating with vectors.

## Sum columns of a matrix

Let’s now sum the columns of a very large matrix. We have three ways of doing that: with a `for`

loop, with an `apply`

loop or with a specific function.

### Three ways to do the job

To define a function that sums each column and puts the value into a vector component using a `for`

loop:

```
f3 <- function(m){
n <- dim(m)[2]
s <- numeric(n)
for(i in 1:n) s[i] <- sum(m[,i])
return(s)
}
```

A second possibility is to run a set of vectorized operations. We can do the same with an `apply`

loop over columns:

`f4 <- function(m) return(apply(m, 2, sum))`

Finally, we can use a built-in function called `colSums`

that performs that task.

### Testing performance

Let’s define a very large matrix `M`

, and apply the three functions to it to obtain the sum of columns:

```
M <- matrix(sample(1:100, 1000000, replace = TRUE), 1000, 1000)
s3 <- f3(M)
s4 <- f4(M)
s5 <- colSums(M)
```

The `s4`

vector is integer, and the other two numeric. Let’s check if they yield the same values:

```
s4 <- as.numeric(s4)
identical(s3, s4)
```

`## [1] TRUE`

`identical(s4, s5)`

`## [1] TRUE`

`identical(s3, s5)`

`## [1] TRUE`

Let’s check the speed of each function:

```
library(rbenchmark)
benchmark(f3(M), f4(M), colSums(M), columns=c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'), order='elapsed')
```

```
## test replications elapsed relative user.self sys.self
## 3 colSums(M) 100 0.091 1.000 0.090 0.001
## 1 f3(M) 100 0.676 7.429 0.613 0.061
## 2 f4(M) 100 1.155 12.692 1.085 0.065
```

The best performance is achieved by the built-in function. The `for`

loop implementation seems to go faster than `apply`

in this context.

## Compute the means of a list of vectors

Finally, we can consider the job of computing the means of a very large list of vectors. Let’s define a very large list of vectors:

`vectors <- lapply(1:1000, function(x) sample(1:100, 10000, replace = TRUE))`

We can compute the mean with a looping function over the list, using the vectorized `mean`

function:

```
looping_mean <- function(list){
n <- length(list)
means <- numeric(n)
for(i in 1:n) means[i] <- mean(list[[i]])
return(means)
}
```

An alternative for the `for`

loop is the `sapply`

function:

`sapply(vectors, mean)`

Both functions return the same result:

`identical(looping_mean(vectors), sapply(vectors, mean))`

`## [1] TRUE`

Alternatively, we want to store the results on a list, instead of a vector:

`lapply(vectors, mean)`

Let’s examine the performance of each implementation:

```
benchmark(looping_mean(vectors),
sapply(vectors, mean),
lapply(vectors, mean),
columns=c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'),
order='elapsed',
replications = 100)
```

```
## test replications elapsed relative user.self sys.self
## 3 lapply(vectors, mean) 100 1.284 1.000 1.276 0.005
## 2 sapply(vectors, mean) 100 1.313 1.023 1.301 0.007
## 1 looping_mean(vectors) 100 1.325 1.032 1.315 0.005
```

In this case, the three implementations have a similar performance.

## When to avoid for loops in R

This small experiment show us when do we need to avoid for loops in R: when performing vectorized operations. As R is a vectorized language, **we must avoid looping across the components of a vector**, like when filtering a vector. This means that we must avoid looping in operations such as computing summarised vector statistics, or subsetting rows of a data frame. In other contexts, iterating functions of the `apply`

family have a performance similar as `for`

loops. The relative merits of each function seem to depend on each type of iteration.