Intermediate programming with R
Loops
Learning Objectives
- Use a
for
loop to repeat operations - Avoid writing slow loops
- Use
apply
as an alternative tofor
loops
Using looping structures allows us to repeat operations, e.g. do something to every row of a data frame. In this lesson we will focus specifically on for
loops. Here is the basic structure of a for
loop:
for (variable in vector) {
do something
}
The loop is repeated for every element of the vector. Using the names above, in each iteration variable
takes the value of one of the elements of vector
. Here is a simple example:
for (i in 1:10) {
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
Loops have a bad reputation in R, and a common myth is that “loops in R are slow.” However, it would be more accurate to say that “poorly written loops in R are slow.”
We’ll start with a trivial example of a poorly written for loop to review the basics before we move on to more interesting examples. In a previous lesson we added a pseudocount of 1 to every citation count before taking the log. Here’s one way we could do that using a for
loop.
x <- numeric()
for (i in 1:length(counts_raw$wosCountThru2011)) {
x <- c(x, counts_raw$wosCountThru2011[i] + 1)
}
The function length
returns the total number of elements in the vector. Each time through the loop, the variable i
increases by 1. Furthermore, i
is used to index the vector of citation counts. Thus each time through the loop, the next element of vector gets 1 added to it, and this new value is appended to x
.
This is incredibly slow. The main culprit is because each time through the loop, the new vector x
grows in size. It is more memory-efficient to pre-allocate the new vector to a given size before starting the loop. To create a numeric vector of a given size, we use the function numeric
.
x <- numeric(length = length(counts_raw$wosCountThru2011))
for (i in 1:length(counts_raw$wosCountThru2011)) {
x[i] <- counts_raw$wosCountThru2011[i] + 1
}
This was much faster because the new vector x
did not grow in each iteration. Instead we used i
to index both the new vector and the old vector.
Lastly, you can avoid slow loops by using vectorized operations whenever possible. This is what we had done in the previous lesson. Because R is vectorized by default, we do not need to use a for
loop when doing something simple like adding a number to a vector. Many times the vectorized version is optimized to be faster than any loop you could write yourself.
x <- counts_raw$wosCountThru2011 + 1
So in summary, avoid slow loops by:
- Using vectorized operations when possible
- Pre-allocating the size of the new object before the loop begins
As a more useful example, let´s calculate the mean number of citations for articles in the different PLOS journals. There are seven journals in our data set, and they are stored as a factor in the column journal
.
levels(counts_raw$journal)
[1] "pbio" "pcbi" "pgen" "pmed" "pntd" "pone" "ppat"
Thus our result will need to be pre-allocated to a size of 7.
result <- numeric(length = length(levels(counts_raw$journal)))
In the first example, we used i
as the looping variable, and it took on the values from 1 to the length of the vector. Since we want to loop through the seven journals, this time our looping variable will take on these values. In order to index result
using this indexing variable, we need to name each element of result
.
names(result) <- levels(counts_raw$journal)
result
pbio pcbi pgen pmed pntd pone ppat
0 0 0 0 0 0 0
result["pone"]
pone
0
Now we construct the for
loop.
for (j in levels(counts_raw$journal)) {
print(j)
}
[1] "pbio"
[1] "pcbi"
[1] "pgen"
[1] "pmed"
[1] "pntd"
[1] "pone"
[1] "ppat"
Lastly, we need to calculate the mean number of citations, using a conditional statement to keep only citation counts for articles in a specific journal.
for (j in levels(counts_raw$journal)) {
result[j] <- mean(counts_raw$wosCountThru2011[counts_raw$journal == j])
}
result
pbio pcbi pgen pmed pntd pone ppat
28.705905 14.219258 22.928208 18.148110 7.348564 8.306972 20.892613
Alternative options
Because performing analyses on each level of a factor is such a common practice, R has built-in functions to do this like tapply
and aggregate
. Also, packages like dplyr, which we will see in future lessons, also provide this functionality.
Using apply
statements
R provides other methods for repeating operations. One useful function is apply
, which performs the same operation across all the rows or columns of a data frame.
Let’s create a new summary statistic for the articles that is the average of their citation count in 2011 (wosCountThru2011
), the number of tweets (backtweetsCount
), and the number of PLOS comments (plosCommentCount
). We subset the data frame by listing the 3 columns we want:
counts_sub <- counts_raw[, c("wosCountThru2011", "backtweetsCount",
"plosCommentCount")]
counts_sub[1:5, ]
wosCountThru2011 backtweetsCount plosCommentCount
1 33 0 0
2 181 0 0
3 0 0 0
4 0 0 0
5 371 0 0
apply
takes 3 arguments. The first is the name of the data frame, the second is the number 1 to specify rows or 2 to specify columns, and the third is the function to be applied. We’ll name the new summary statistic sum_stat
.
sum_stat <- apply(counts_sub, 1, mean)
summary(sum_stat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.6667 2.0000 4.7060 5.0000 245.7000
Thus with just one line of code we were able to compute the mean across every row of the data frame.
Challenges
Using apply
Using apply
and sd
, calculate the standard deviation of each row of counts_sub
.
Using apply
and max
, calculate the maximum of each column of counts_sub
.