Our previous lessons have shown us how to manipulate data, define our own functions, and repeat things. However, the programs we have written so far always do the same things, regardless of what data they're given. We want programs to make choices based on the values they are manipulating.
if
and else
branches.&
("and") and |
("or").So far, we have built a function analyze
to plot summary statistics of the inflammation data:
analyze <- function(filename) {
# Plots the average, min, and max inflammation over time.
# Input is character string of a csv file.
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
}
And also built the function analyze_all
to automate the processing of each data file:
analyze_all <- function(pattern) {
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(pattern = pattern)
for (f in filenames) {
analyze(f)
}
}
While these are useful in an interactive R session, what if we want to send our results to our collaborators?
Since we currently have 12 data sets, running analyze_all
creates 36 plots.
Saving each of these individually would be tedious and error-prone.
And in the likely situation that we want to change how the data is processed or the look of the plots, we would have to once again save all 36 before sharing the updated results with our collaborators.
Here's how we can save all three plots of the first inflamation data set in a pdf file:
pdf("inflammation-01.pdf")
analyze("inflammation-01.csv")
dev.off()
The function pdf
redirects all the plots generated by R into a pdf file, which in this case we have named "inflammation-01.pdf".
After we are done generating the plots to be saved in the pdf file, we stop R from redirecting plots with the function dev.off
.
We can update the analyze
function so that it always saves the plots in a pdf.
But that would make it more difficult to interactively test out new changes.
It would be ideal if analyze
would either save or not save the plots based on its input.
In order to update our function to decide between saving or not, we need to write code that automatically decides between multiple options. The tool R gives us for doing this is called a conditional statement, and looks like this:
num <- 37
if (num > 100) {
print("greater")
} else {
print("not greater")
}
print("done")
[1] "not greater"
[1] "done"
The second line of this code uses the keyword if
to tell R that we want to make a choice.
If the test that follows it is true, the body of the if
(i.e., the lines indented underneath it) are executed.
If the test is false, the body of the else
is executed instead.
Only one or the other is ever executed:
In the example above, the test num > 100
returns the value FALSE
, which is why the code inside the if
block was skipped and the code inside the else
statment was run instead.
num > 100
[1] FALSE
And as you likely guessed, the opposite of FALSE
is TRUE
.
num < 100
[1] TRUE
Conditional statements don't have to include an else
.
If there isn't one, R simply does nothing if the test is false:
num <- 53
if (num > 100) {
print("num is greater than 100")
}
We can also chain several tests together when there are more than two options. This makes it simple to write a function that returns the sign of a number:
sign <- function(num) {
if (num > 0) {
return(1)
} else if (num == 0) {
return(0)
} else {
return(-1)
}
}
sign(-3)
[1] -1
sign(0)
[1] 0
sign(2/3)
[1] 1
Note that the test for equality uses two equal signs, ==
.
Tip: Other tests include greater than or equal to (
>=
), less than or equal to (<=
), and not equal to (!=
).
We can also combine tests.
An ampersand, &
, symbolizes "and".
A vertical bar, |
, symbolizes "or".
&
is only true if both parts are true:
if (1 > 0 & -1 > 0) {
print("both parts are true")
} else {
print("at least one part is not true")
}
[1] "at least one part is not true"
while |
is true if either part is true:
if (1 > 0 | -1 > 0) {
print("at least one part is true")
} else {
print("neither part is true")
}
[1] "at least one part is true"
In this case, "either" means "either or both", not "either one or the other but not both".
plot_dist
, that plots a boxplot if the length of the vector is greater than a specified threshold and a stripchart otherwise.
To do this you'll use the R functions boxplot
and stripchart
.dat <- read.csv("inflammation-01.csv", header = FALSE)
plot_dist(dat[, 10], threshold = 10) # day (column) 10
plot_dist(dat[1:5, 10], threshold = 10) # samples (rows) 1-5 on day (column) 10
Now that we know how to have R make decisions based on input values, let's update analyze
:
analyze <- function(filename, output = NULL) {
# Plots the average, min, and max inflammation over time.
# Input:
# filename: character string of a csv file
# output: character string of pdf file for saving
if (!is.null(output)) {
pdf(output)
}
dat <- read.csv(file = filename, header = FALSE)
avg_day_inflammation <- apply(dat, 2, mean)
plot(avg_day_inflammation)
max_day_inflammation <- apply(dat, 2, max)
plot(max_day_inflammation)
min_day_inflammation <- apply(dat, 2, min)
plot(min_day_inflammation)
if (!is.null(output)) {
dev.off()
}
}
We added an argument, output
, that by default is set to NULL
.
An if
statement at the beginning checks the argument output
to decide whether or not to save the plots to a pdf.
Let's break it down.
The function is.null
returns TRUE
if a variable is NULL
and FALSE
otherwise.
The exclamation mark, !
, stands for "not".
Therefore the line in the if
block is only executed if output
is "not null".
output <- NULL
is.null(output)
[1] TRUE
!is.null(output)
[1] FALSE
Now we can use analyze
both interactively:
analyze("inflammation-01.csv")
and to save plots:
analyze("inflammation-01.csv", output = "inflammation-01.pdf")
This now works well when we want to process one data file at a time, but how can we specify the output file in analyze_all
?
We need to substitute the filename ending "csv" with "pdf", which we can do using the function sub
:
f <- "inflammation-01.csv"
sub("csv", "pdf", f)
[1] "inflammation-01.pdf"
Now let's update analyze_all
:
analyze_all <- function(pattern) {
# Runs the function analyze for each file in the current working directory
# that contains the given pattern.
filenames <- list.files(pattern = pattern)
for (f in filenames) {
pdf_name <- sub("csv", "pdf", f)
analyze(f, output = pdf_name)
}
}
Now we can save all of the results with just one line of code:
analyze_all("csv")
Now if we need to make any changes to our analysis, we can edit the analyze
function and quickly regenerate all the figures with analzye_all
.
plot
by reading the documentation (?plot
), update analyze
, and then recreate all the figures with analyze_all
.pdf("name.pdf")
and stop writing to the pdf file with dev.off()
.if (condition)
to start a conditional statement, else if (condition)
to provide additional tests, and else
to provide a default.{ }
.==
to test for equality.X & Y
is only true if both X and Y are true.X | Y
is true if either X or Y, or both, are true.We have now seen the basics of interactively building R code. The last thing we need to learn is how to build command-line programs that we can use in pipelines and shell scripts, so that we can integrate our tools with other people's work. This will be the subject of our next and final lesson.