Control Flow in R

Often when we're coding we want to control the flow of our actions. Control flow is simply the order in which we code and have our statements evaluated. That can be done by setting things to happen only if a condition or a set of conditions are met. Alternatively, we can also set an action to be computed for a particular number of times. There are many ways these can be achieved in R. For conditional statements, the most commonly used approaches are the constructs:

if (and else)
while

Say, for example, that we want R to print a message depending on how a number x relates to 5:

x <- 5
if (x >= 5) {
    print(paste0(x, " is greater or equal than 5"))
} else {
    print(paste0(x, "is smaller than 5"))
}
## [1] "5 is greater or equal than 5"

We could also have a situation where we want an action to be executed while a condition is being matched:

x <- 0
vec <- vector()
while (x < 10) {
    vec <- c(vec, x^2)
    x <- x + 1
}
vec
##  [1]  0  1  4  9 16 25 36 49 64 81

Now, and this is very important. When R evaluates the condition inside if and while statements, it is looking for a logical element, i.e., TRUE or FALSE. For example:

if (4 == 3) {
    "4 equals 3"
}

The message was not printed because the condition 4 == 3 is FALSE

4 == 3
## [1] FALSE

R also offers a built-in function that is more intuitive to evaluate conditions.

mat <- matrix(0:19, 5, 4)
mat <- ifelse(mat < 5, 0, mat)

# OR

mat <- matrix(sample(c(NA, 1:10), 50, replace = TRUE), 5, 10)
mat <- ifelse(is.na(mat), 0, 1)

In the first example, ifelse looks for all the conditions there are TRUE in the statement mat < 5 and replaces them by the first action (transform in 0), otherwise does the altervative action (keep originals). In this case, I told R to keep the original mat values where the condition did not occur. Similarly, in the second example, I told R to replace all existing numbers with 1 and replace missing values with 0. VERY IMPORTANT: ifelse can be very handy, but for heavy and repetitive tasks, it is considerably slower than regular if and else statements.

As above, another way of controlling the flow of actions is to set them to occur for a specific number of times. We achieve that using the command for

salad <- c("lettuce", "carrots", "tomatos")
amounts <- c("1 bunch", "1 package", "1 bag")
for (vegetable in seq_along(salad)) {
    print(paste0("Our salad has ", amounts[vegetable], " of ", salad[vegetable]))
}
## [1] "Our salad has 1 bunch of lettuce"
## [1] "Our salad has 1 package of carrots"
## [1] "Our salad has 1 bag of tomatos"

Notice that at each loop, the index vegetable takes a value following the sequence of the vector salad. During the first loop, vegetable, in this particular case, is going to be 1, then 2, and finally 3.

for (index in 1:3) {
    print(index)
}
## [1] 1
## [1] 2
## [1] 3

The index can take any name you want (preferably names that do not overlap with existing objects or functions). Conventionally, when we deal with matrices or data.frames, we use the index i to denote rows and the index j to denote columns. Example:

mat <- matrix(1:4, 2, 2)
mat
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
for (i in 1:nrow(mat)) {
    for (j in 1:ncol(mat)) {
        print(mat[i, j])
    }
}
## [1] 1
## [1] 3
## [1] 2
## [1] 4

At this point you may be wondering if we can mix the different approaches and the answer is yes. For a certain for loop, you may want to compute your action only if a condition is matched.

set.seed(10)
mat <- matrix(rpois(16, lambda = 11), 4, 4)
mat
##      [,1] [,2] [,3] [,4]
## [1,]   11   12   14   13
## [2,]   10    6   13   11
## [3,]    6   10   10    7
## [4,]    8    8   14   13
for (i in 1:nrow(mat)) {
    for (j in 1:ncol(mat)) {
        if (mat[i, j] >= 10) {
            next
        } else {
            print(paste0("element at row ", i, ", column ", j, ", is smaller than 5"))
        }
    }
}
## [1] "element at row 2, column 2, is smaller than 5"
## [1] "element at row 3, column 1, is smaller than 5"
## [1] "element at row 3, column 4, is smaller than 5"
## [1] "element at row 4, column 1, is smaller than 5"
## [1] "element at row 4, column 2, is smaller than 5"

Do not worry that much about the the functions set.seed and rpois. All you need to know is that set.seed will allow us to generate the same random numbers in different computers for comparison matters and rpois generates positive integers randomly following a Poisson distribution. Notice that I used next to skip the loop in case the condition was TRUE. You can also use the command break if you want to stop the loop at once.

Exercise

Step 1: create a matrix called mat.ex with 3 rows and 10 columns using set.seed(20); rpois(30, lambda=12) as values;

Step 2: create an empty numeric vector called means with length = ncol(mat.ex)

Step 3: loop through the columns of mat.ex and calculate the respective mean values of each column;

Step 4: if the mean calculated value is greater than 11, store it in means;

Step 5: print vector means without zeros

solution:

set.seed(20)
mat.ex <- matrix(rpois(30, lambda = 12), 3, 10)
means <- vector(mode = "numeric", length = ncol(mat.ex))
for (j in 1:ncol(mat.ex)) {
    if (mean(mat.ex[, j]) > 11) 
        means[j] <- mean(mat.ex[, j])
}
print(means[means != 0])
## [1] 14.67 11.33 14.33 12.67 12.00 12.67 11.67 17.00

Speeding up using the family apply

One of the common features across most R beginners is to think everything as a loop. While this approach is valid and in many cases the fastest solution to think of, sometimes it can be very, very slow. Try the following:

set.seed(10)
x <- matrix(rpois(500, lambda = 25), 20, 25)
a <- vector(mode = "numeric", length = length(x))
for (index in 1:ncol(x)) {
    a[index] <- mean(x[, index])
}

That took < 1 second to run. Although it was effectively fast, the code was relatively big (5 lines) for its small task. Fortunately, R has a family of loop functions called the apply family, which are slightly different but generally as fast as or faster loops than for loops. For the moment, you do not need to fully understand the detailed differences between a regular for loop and it's equivalent using the apply family. Let's replicate the example above using the mother function apply:

set.seed(10)
x <- matrix(rpois(500, lambda = 25), 20, 25)
a <- apply(x, 2, mean)

a takes the same values as in the example before, but with much less code (3 lines) which is also simpler. In this case, apply evaluates the object x and then apply the function mean to its 2nd dimension, i.e., columns - dimension=1 would imply rows. apply assumes no margin for dimension for it's evaluations, so you always have to provide it. There are even more convinient and faster ways to calculate sum or mean of dimensions in a matrix or data.frame.

set.seed(10)
x <- matrix(rpois(500, lambda = 25), 20, 25)
a <- colMeans(x)  #or colSums or rowMeans or rowSums

We can also easily create tables using the function tapply from this same family.

set.seed(10)
dat <- data.frame(genes = rep(paste0("gene_", 1:3), each = 5), species = rep(paste0("species_", 
    1:5), 3), gc_content = runif(15), stringsAsFactors = FALSE)
dat <- dat[-nrow(dat), ]
dat
##     genes   species gc_content
## 1  gene_1 species_1    0.50748
## 2  gene_1 species_2    0.30677
## 3  gene_1 species_3    0.42691
## 4  gene_1 species_4    0.69310
## 5  gene_1 species_5    0.08514
## 6  gene_2 species_1    0.22544
## 7  gene_2 species_2    0.27453
## 8  gene_2 species_3    0.27231
## 9  gene_2 species_4    0.61583
## 10 gene_2 species_5    0.42967
## 11 gene_3 species_1    0.65166
## 12 gene_3 species_2    0.56774
## 13 gene_3 species_3    0.11351
## 14 gene_3 species_4    0.59593

Suppose we want to know what the mean gc content per gene is:

tapply(dat$gc_content, dat$genes, mean)
## gene_1 gene_2 gene_3 
## 0.4039 0.3636 0.4822

Sometimes, you may also have a very big and messy dataset and you want to obtain a table that makes things clearer to see. Say we want, for example, the gc_content of species (rows) X genes (columns).

tapply(dat$gc_content, list(dat$species, dat$genes), sum)
##            gene_1 gene_2 gene_3
## species_1 0.50748 0.2254 0.6517
## species_2 0.30677 0.2745 0.5677
## species_3 0.42691 0.2723 0.1135
## species_4 0.69310 0.6158 0.5959
## species_5 0.08514 0.4297     NA

It is very important to notice that the last value is a NA, simply because the species 5 did not have gene 3, and therefore no gc content. fortunately, we already know how to get rid of NAs in tables if we want to.

tab <- tapply(dat$gc_content, list(dat$species, dat$genes), sum)
tab <- ifelse(is.na(tab), 0, tab)  #try also tab[is.na(tab)]  <-  0

Finally, you can customize your own functions and tomorrow John is going to teach you how to embed your customized functions inside apply calls.

Dealing with lists

R offers a very powerful class of objects called lists. A list can virtually contain anything. You can have many data.frames inside a list, or you can mix a vector, a data.frame, any other type of array, other lists, and so on. Example:

set.seed(10)
our_list <- list(runif(10), matrix(1:10, 2, 5), data.frame(letters = letters, 
    order = 1:26))
class(our_list)
## [1] "list"
str(our_list)
## List of 3
##  $ : num [1:10] 0.5075 0.3068 0.4269 0.6931 0.0851 ...
##  $ : int [1:2, 1:5] 1 2 3 4 5 6 7 8 9 10
##  $ :'data.frame':    26 obs. of  2 variables:
##   ..$ letters: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ order  : int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
dim(our_list)  #lists do not have dimensions as arrays, thay are measured by their number of elements
## NULL
length(our_list)
## [1] 3
names(our_list)  #create names for our list
## NULL
names(our_list) <- c("random_numbers", "a_matrix", "a_dataframe")
our_list
## $random_numbers
##  [1] 0.50748 0.30677 0.42691 0.69310 0.08514 0.22544 0.27453 0.27231
##  [9] 0.61583 0.42967
## 
## $a_matrix
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
## 
## $a_dataframe
##    letters order
## 1        a     1
## 2        b     2
## 3        c     3
## 4        d     4
## 5        e     5
## 6        f     6
## 7        g     7
## 8        h     8
## 9        i     9
## 10       j    10
## 11       k    11
## 12       l    12
## 13       m    13
## 14       n    14
## 15       o    15
## 16       p    16
## 17       q    17
## 18       r    18
## 19       s    19
## 20       t    20
## 21       u    21
## 22       v    22
## 23       w    23
## 24       x    24
## 25       y    25
## 26       z    26

If we want to extract an element from a list, we use double squared brackets [[]].

our_list[[1]]
##  [1] 0.50748 0.30677 0.42691 0.69310 0.08514 0.22544 0.27453 0.27231
##  [9] 0.61583 0.42967
our_list[["a_matrix"]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
class(our_list[[1]])
## [1] "numeric"
dim(our_list[[2]])
## [1] 2 5

Now that we understand the basics of a list, we can use use the function lapply from the apply family, which is a very powerful tool to loop over vectors and lists. Exmaple:

list_gcs <- list(gene_a = runif(3), gene_b = runif(3), gene_c = runif(3))
means <- lapply(list_gcs, mean)
class(means)
## [1] "list"
names(means)
## [1] "gene_a" "gene_b" "gene_c"
means[[1]]  #or means[['gene_a']], for example
## [1] 0.4443
means
## $gene_a
## [1] 0.4443
## 
## $gene_b
## [1] 0.4609
## 
## $gene_c
## [1] 0.2383

Using our_list, we can also use lapply to quickly explore the classes of our different objects within it.

lapply(our_list, class)
## $random_numbers
## [1] "numeric"
## 
## $a_matrix
## [1] "matrix"
## 
## $a_dataframe
## [1] "data.frame"

Notice that lapply returns a list as its default behavior. R offers a second function that simplifies the output whenever possible. It's called sapply, which stands for simplify lapply. If your input is a mtrix, the output will also be coerced into a matrix, if your input is a vector, R will try to coerce the output into a vector as well.

sap <- sapply(our_list, class)
sap
## random_numbers       a_matrix    a_dataframe 
##      "numeric"       "matrix"   "data.frame"
class(sap)
## [1] "character"