Intermediate programming with R

Functions

Learning Objectives

  • Functions have two parts: arguments and body
  • Functions have their own environment
  • Convert code into functions to repeat operations

In the last lesson we wrote loops to performs some calculations. But what if we wanted to perform similar calculations on different columns? We would have to copy-paste the loops and change all the variable names. This strategy would be both tedious, error prone, and difficult to update if we want to make a change. To avoid these problems, we will review how to write our own functions.

The parts of a function

We’ve already been using built-in R functions: read.delim, mean, apply, etc. These functions allow us to run the same routine with different inputs.

Let’s explore read.delim further. All functions in R have two parts: the input arguments and the body. We can see the arguments of a function with the args.

args(read.delim)
function (file, header = TRUE, sep = "\t", quote = "\"", dec = ".", 
    fill = TRUE, comment.char = "", ...) 
NULL

So when we pass a character vector like "data/counts-raw.txt.gz", this gets assigned to the argument file. All the other arguments have defaults set, so we do not need to assign them a value.

After the arguments have been assigned values, then the body of the function is executed. We can view the body of a function with body.

body(read.delim)
read.table(file = file, header = header, sep = sep, quote = quote, 
    dec = dec, fill = fill, comment.char = comment.char, ...)

read.delim is very short. It just calls another function, read.table, using the input file and the default arguments as the arguments passed to read.table.

When we define our own functions, we use the syntax below. We list the arguments, separated by commas, within the parentheses. The body follows, contained within curly brackets {}.

function_name <- function(args) {
  body
}

The principle of encapsulation

An important feature of functions is the principle of encapsulation: the environment inside the function is distinct from the environment outside the function. In other words, variables defined inside a function are separate from variables defined outside the function.

Here’s an small example to demonstrate this idea. The function ex_fun takes two input arguments, x and y. It calculates z and returns its value.

ex_fun <- function(x, y) {
  z <- x - y
  return(z)
}

When we run ex_fun, the only thing returned to the global environment is the value that was assigned to z. The variable z itself was only defined in the function environment, and does not exist in the global environment.

ex_fun(3, 10)
[1] -7
z
Error in eval(expr, envir, enclos): object 'z' not found

Examples

In the last lesson we wrote the following for loop to calculate the mean number of citations for each journal. Let’s generalize this code to a function so that we can perform a similar calculation for any of the metrics across any of the categorical variables.

result <- numeric(length = length(levels(counts_raw$journal)))
names(result) <- levels(counts_raw$journal)
for (j in levels(counts_raw$journal)) {
  result[j] <- mean(counts_raw$wosCountThru2011[counts_raw$journal == j])
}
result
     pbio      pcbi      pgen      pmed      pntd      pone      ppat 
28.705905 14.219258 22.928208 18.148110  7.348564  8.306972 20.892613 

We’ll name the function mean_metric_per_var, and it will take two input arguments: metric and variable. The outline of our function looks like this.

mean_metric_per_var <- function(metric, variable) {
  # body goes here
}

Now we can copy paste our loop code into the body of the function. We indent the code by two spaces as a convention to aid readability, it doesn’t actually affect the ability of the code to run (to indent in RStudio you can highlight all the lines and press Ctrl-I).

mean_metric_per_var <- function(metric, variable) {
  result <- numeric(length = length(levels(counts_raw$journal)))
  names(result) <- levels(counts_raw$journal)
  for (j in levels(counts_raw$journal)) {
    result[j] <- mean(counts_raw$wosCountThru2011[counts_raw$journal == j])
  }
  result
}

Now we need to replace the specific data we used, the journal and the 2011 citations, with the names of the function arguments. We’ll also add the return.

mean_metric_per_var <- function(metric, variable) {
  result <- numeric(length = length(levels(variable)))
  names(result) <- levels(variable)
  for (j in levels(variable)) {
    result[j] <- mean(metric[variable == j])
  }
  return(result)
}

Lastly, instead of naming the looping variable j for “journal”, let’s change it to v for “variable”

mean_metric_per_var <- function(metric, variable) {
  result <- numeric(length = length(levels(variable)))
  names(result) <- levels(variable)
  for (v in levels(variable)) {
    result[v] <- mean(metric[variable == v])
  }
  return(result)
}

Now we can run the same analysis we did before:

mean_metric_per_var(counts_raw$wosCountThru2011, counts_raw$journal)
     pbio      pcbi      pgen      pmed      pntd      pone      ppat 
28.705905 14.219258 22.928208 18.148110  7.348564  8.306972 20.892613 

Or a new analysis, like the mean number of tweets grouped by the type of article.

mean_metric_per_var(counts_raw$backtweetsCount, counts_raw$articleType)
                            Best Practice 
                               0.00000000 
         Book Review/Science in the Media 
                               0.00000000 
                              Case Report 
                               0.00000000 
                           Clinical Trial 
                               0.00000000 
                           Community Page 
                               0.10714286 
                               Correction 
                               0.00000000 
                           Correspondence 
                               0.00000000 
  Correspondence and Other Communications 
                               0.00000000 
                                Editorial 
                               0.82010582 
                                Education 
                               0.76470588 
                                    Essay 
                               0.52173913 
                        Expert Commentary 
                               0.00000000 
                                  Feature 
                               0.00000000 
           From Innovation to Application 
                               0.00000000 
                  Guidelines and Guidance 
                               0.00000000 
                         Health in Action 
                               0.07246377 
Historical and Philosophical Perspectives 
                               0.00000000 
     Historical Profiles and Perspectives 
                               0.00000000 
                                Interview 
                               0.03846154 
                             Journal Club 
                               0.25000000 
                           Learning Forum 
                               0.00000000 
                        Message from ISCB 
                               0.07142857 
                        Message from PLoS 
                               0.00000000 
                Message from the Founders 
                               0.00000000 
           Message from the PLoS Founders 
                               0.00000000 
                       Neglected Diseases 
                               0.00000000 
                                 Obituary 
                               0.00000000 
                                  Opinion 
                               0.32258065 
                                 Overview 
                               1.00000000 
                                   Pearls 
                               0.17391304 
                              Perspective 
                               0.10619469 
                               Photo Quiz 
                               0.00000000 
                             Policy Forum 
                               0.52380952 
                          Policy Platform 
                               0.00000000 
                                   Primer 
                               0.05755396 
                                     Quiz 
                               0.00000000 
                         Research Article 
                               0.35784035 
                  Research in Translation 
                               0.04081633 
                                   Review 
                               0.18354430 
                           Special Report 
                               0.00000000 
                            Student Forum 
                               0.00000000 
                                Symposium 
                               0.00000000 
                                 Synopsis 
                               0.02502980 
                         Technical Report 
                               0.00000000 
                               The Debate 
                               0.06666667 
                 The PLoS Medicine Debate 
                               0.33333333 
                         Unsolved Mystery 
                               0.05000000 
                               Viewpoints 
                               0.03225806 

The other loop we wrote used apply to calculate the mean of multiple metrics for each article, i.e. row, of the data frame.

counts_sub <- counts_raw[, c("wosCountThru2011", "backtweetsCount", "plosCommentCount")]
sum_stat <- apply(counts_sub, 1, mean)
summary(sum_stat)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.6667   2.0000   4.7060   5.0000 245.7000 

Let’s generalize this to a function where we can choose which columns to include in the mean summary statistic. We’ll call it calc_sum_stat, and it will take two input arguments: the data frame and a vector of the columns to select. Here’s the outline of the function.

calc_sum_stat <- function(df, cols) {
}

Now we copy-paste our previous code into the body of the function and indent.

calc_sum_stat <- function(df, cols) {
  counts_sub <- counts_raw[, c("wosCountThru2011", "backtweetsCount", "plosCommentCount")]
  sum_stat <- apply(counts_sub, 1, mean)
  summary(sum_stat)
}

Also, replace the specific variable names with the argument names and add return.

calc_sum_stat <- function(df, cols) {
  df_sub <- df[, cols]
  sum_stat <- apply(df_sub, 1, mean)
  return(sum_stat)
}

Now we can perform the same analysis as before:

sum_stat_1 <- calc_sum_stat(counts_raw, c("wosCountThru2011", "backtweetsCount", "plosCommentCount"))
summary(sum_stat_1)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.0000   0.6667   2.0000   4.7060   5.0000 245.7000 

Or choose different metrics to summarize:

sum_stat_2 <- calc_sum_stat(counts_raw, c("wosCountThru2010", "f1000Factor"))
summary(sum_stat_2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.000   4.116   4.000 315.500 

As we have seen, writing functions allows us to repeat operations without having to copy-paste code. In later lessons, we will learn how to debug functions when they are not working as expected.

Challenges

Write your own function

Write your own function to calculate the mean called my_mean. It should take one input argument, x, which is a numeric vector. Compare your results with the results from R’s function mean. Do you receive the same answer?