Intermediate programming with R

Instructor’s Guide

library("dplyr")
library("ggplot2")
library("testit")

This page contains the solutions to the exercises. Since most of the exercises use counts-raw.txt.gz, it is loaded here at the beginning. Also, many use research, the subset of the data that only includes research articles.

counts_raw <- read.delim("data/counts-raw.txt.gz")
research <- filter(counts_raw, articleType == "Research Article")

01 Setting up a project

Create a README file

It is a convention to have a file named README in a project directory to explain what it contains (both for others and your future self). Use nano to create a README file. Include the date and explain that this directory was created for a Software Carpentry workshop.

Looking for an answer something like the following:

nano README
cat REAMDME
This directory contains the files created during the Software Carpentry workshop
at X University on YYYY-MM-DD.

02 Inspecting a file

Q: What could be the reason for the discrepancy in the number of articles in our saved file?

A: The first command searches only in column 11. The second looks for matches in any of the columns. Thus the second command returns more lines because these strings are also present in some other columns.

Largest number of Wikipedia cites

What is the largest number of Wikipedia cites that an article in this data set has received? Hint: The counts of Wikipedia cites are in column 28.

gunzip -c counts-raw.txt.gz | cut -f28 | sort -n | tail -n 1
17

Alternatively:

gunzip -c counts-raw.txt.gz | cut -f28 | sort -nr | head -n 1
17

Find articles in your field

Choose two PLOS subject tags to search for and save these articles to a new file. How many articles are there?

There are lots of possible subject tags to choose from:

gunzip -c counts-raw.txt.gz | cut -f11 | sort | uniq | wc -l
6717

As an example:

gunzip -c counts-raw.txt.gz | cut -f11 | grep "Cardiovascular Disorders" | grep "Nephrology" | wc -l
14

03 Using RStudio

Opening and closing RStudio projects

Using the same drop down menu at the top right of RStudio, which you used to create the project, choose to “Close Project”. In the Console run getwd. It should display your home folder. Furthermore, your home folder should be displayed in the Files pane. Now open the altmetrics project using the same menu and run getwd again. The working directory should have changed to thealtmetrics directory, and the Files pane should display its contents.

If the RStudio Project was setup correctly, this should be straightforward.

04 Importing and inspecting data

Citations versus weeks since publication

Create a scatter plot where the x-axis is the number of weeks since publication and the y-axis is the log of the 2011 citations (use wosCountThru2011). Don’t forget to add a pseudocount of 1.

plot(counts_raw$daysSincePublished / 7,
     log(counts_raw$wosCountThru2011 + 1))

plot of chunk 04-01

05 Conditional statments

Filtering articles

How many articles with the subject tag (plosSubjectTags) “Evolutionary Biology” were published in either PLOS One (“pone”), PLOS Biology (“pbio”), or PLOS Medicine (“pmed”)?

dim(counts_raw[grepl("Evolutionary Biology", counts_raw$plosSubjectTags) &
                 counts_raw$journal %in% c("pone", "pbio", "pmed"), ])
[1] 2080   32

06 Loops

Using apply

Using apply and sd, calculate the standard deviation of each row of counts_sub.
Using apply and max, calculate the maximum of each column of counts_sub.

counts_sub <- counts_raw[, c("wosCountThru2011", "backtweetsCount",
                             "plosCommentCount")]
sum_stat_sd <- apply(counts_sub, 1, sd)
summary(sum_stat_sd)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.155   3.464   7.832   8.563 425.500 
apply(counts_sub, 2, max)
wosCountThru2011  backtweetsCount plosCommentCount 
             737              363               49 

07 Functions

Write your own function

Write your own function to calculate the mean called my_mean. It should take one input argument, x, which is a numeric vector. Compare your results with the results from R’s function mean. Do you receive the same answer?

my_mean <- function(x) {
  result <- sum(x) / length(x)
  return(result)
}
my_mean(1:10)
[1] 5.5
mean(1:10)
[1] 5.5

11 R Markdown Output Options

Update analysis file

You need to share your initial results with your collaborators, but after showing your report to your boss, they had a few suggestions.

  1. In your final report, your collaborators should see your histogram of authors per paper, but not the code that produced the plot.
  2. The figure is hard to see as is, resize it to 8x8 inches
  3. Your collaborators are very interested in how popular articles are on Facebook. Add another histogram plotting the number of facebook shares per article (facebookShareCount), ensuring there are respectible titles and axis labels. Also, just like the previous figure, make sure there is a legend and that the code to generate the figure does not appear in the final report.
  4. Additionally under the new figure, your collaborators should see a sentence that says “The average number of Facebook shares per paper in the data set is X”, where X is the mean number of Facebook shares per paper, as evaluated by inline code.
  1. Use chunk option echo=FALSE.
  2. Use chunk options fig.width=8, fig.height=8.
hist(counts_raw$facebookShareCount, xlab = "Number of shares on Facebook",
     ylab = "Number of articles", main = "Distributin of Facebook Shares")
  1. Use the inline code:
`r mean(counts_raw$facebookShareCount)`

12 Subsetting with dplyr

How much did altmetrics numbers change by 2009?

How many articles were published in 2009 (year)? How many of these had at least one Tweet (backtweetsCount) or Facebook comment (facebookCommentCount)? How many were in at least one Mendeley library (mendeleyReadersCount)?

research_2009 <- filter(research, year == 2009)
nrow(research_2009)
[1] 6036
research_2009_fb_tweet <- filter(research, year == 2009,
                                 facebookCommentCount > 0 |
                                 backtweetsCount > 0)
nrow(research_2009_fb_tweet)
[1] 830
research_2009_mendeley <- filter(research, year == 2009,
                                 mendeleyReadersCount > 0)
nrow(research_2009_mendeley)
[1] 5078

What are people reading but not citing?

One potential use of altmetrics data is recognizing articles that are widely read among the scientific community but are not cited as highly as similarly influential papers. Compile a data set named low_cite that contains the journal, title, and year of each research article that meets the following criteria:

  • Published in 2008 or prior (year)
  • Has more than 1,000 pdf downloads (pdfDownloadsCount)
  • Is contained in more than 15 Mendeley libraries (mendeleyReadersCount)
  • Has fewer than 10 citations as of 2011 (wosCountThru2011)

How many articles did you find?

summary(research$pdfDownloadsCount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0   152.0   261.0   432.5   481.0 35720.0 
summary(research$mendeleyReadersCount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   4.128   5.000 171.000 
summary(research$wosCountThru2011)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00    7.00   14.54   16.00  737.00 
low_cite <- filter(research, pdfDownloadsCount > 1000,
                   mendeleyReadersCount > 15,
                   year < 2009, wosCountThru2011 < 10)
low_cite <- select(low_cite, journal, title, year)
nrow(low_cite)
[1] 25

13 Chaining commands with dplyr

Titles of most cited articles

Using a chain of pipes, output the titles of the three research articles with the largest 2011 citation count (wosCountThru2011).

Lots of authors

Using a chain of pipes, output the author count (authorsCount), title, journal, and subject tags (plosSubjectTags) of the three research articles with the largest number of authors.

14 Summarizing with dplyr

Summarizing the number of tweets per journal

Create a new data frame, tweets_per_journal, that for each journal contains the total number of articles, the mean number of tweets (backtweetsCount) received by articles in that journal, and the standard error of the mean (SEM) of the number of tweets. The SEM is the standard deviation divided by the square root of the sample size (i.e. the number of articles).

tweets_per_journal <- research %>%
  group_by(journal) %>%
  summarize(num = n(),
            mean = mean(backtweetsCount),
            sem = sd(backtweetsCount) / sqrt(num))
tweets_per_journal
Source: local data frame [7 x 4]

  journal   num       mean         sem
   <fctr> <int>      <dbl>       <dbl>
1    pbio  1325 0.05811321 0.020153395
2    pcbi  1351 0.12657291 0.052177184
3    pgen  1619 0.06547251 0.020408525
4    pmed   643 0.31104199 0.187868371
5    pntd   621 0.02576490 0.009057697
6    pone 14078 0.49303878 0.034484187
7    ppat  1459 0.02604524 0.008807428

15 Mapping data to plot aesthetics

Citations versus days since publication

Create a scatter plot with daysSincePublished mapped to the x-axis and wosCountThru2011 mapped to the y-axis. Include a loess fit of the data. Set the transparency level (alpha) of the points to 0.5 and color the points according to the journal where the article was published. Make the loess curve red.

p <- ggplot(research, aes(x = daysSincePublished,
                          y = wosCountThru2011)) +
  geom_point(aes(color = journal), alpha = 0.5) +
  geom_smooth(color = "red")
p

plot of chunk 15

16 Controlling the plot scales

Modifying the scales

Update the plot to use a square root transformation instead of log10. Also color the points using the ColorBrewer palette “Accent”.

p <- ggplot(research, aes(x = pdfDownloadsCount,
                          y = wosCountThru2011)) +
  geom_point(aes(color = journal)) +
  geom_smooth() +
  scale_x_sqrt() +
  scale_y_sqrt() +
  scale_color_brewer(palette = "Accent")
p

plot of chunk 16

17 Creating subplots with facets

Using facets

Add another variable to research called evolution, which is a logical vector indicating if the article has the PLOS subject tag “Evolutionary Biology”. Use facet_grid to create subplots based on the variables evolution and immuno.

# From earlier in the lesson
research <- mutate(research, immuno = grepl("Immunology", plosSubjectTags))
# Solution:
research <- mutate(research,
                   evolution = grepl("Evolutionary Biology",
                                     plosSubjectTags))
p <- ggplot(research, aes(x = log10(pdfDownloadsCount + 1),
                          y = log10(wosCountThru2011 + 1))) +
  geom_point(aes(color = journal)) +
  geom_smooth() +
  scale_x_continuous(breaks = c(1, 3), labels = c(10, 1000)) +
  scale_y_continuous(breaks = c(1, 3), labels = c(10, 1000)) +
  facet_grid(evolution~immuno)
p

plot of chunk 17

18 Creating different plots with geoms

Mean number of tweets per journal per year

Modify the dplyr code above to calculate the mean, SEM, and sample size of the number of article tweets per journal and per year. Use facet_wrap to make a separate subplot per year.

tweets_per_journal_year <- research %>%
  group_by(journal, year) %>%
  summarize(num = n(),
            mean = mean(backtweetsCount),
            sem = sd(backtweetsCount) / sqrt(num))
tweets_per_journal_year
Source: local data frame [42 x 5]
Groups: journal [?]

   journal  year   num        mean         sem
    <fctr> <int> <int>       <dbl>       <dbl>
1     pbio  2003    33 0.000000000 0.000000000
2     pbio  2004   174 0.000000000 0.000000000
3     pbio  2005   176 0.011363636 0.008012313
4     pbio  2006   184 0.010869565 0.007664915
5     pbio  2007   203 0.004926108 0.004926108
6     pbio  2008   197 0.030456853 0.018916893
7     pbio  2009   181 0.005524862 0.005524862
8     pbio  2010   177 0.367231638 0.146996329
9     pcbi  2005    55 0.000000000 0.000000000
10    pcbi  2006   136 0.000000000 0.000000000
..     ...   ...   ...         ...         ...
ggplot(tweets_per_journal_year, aes(x = journal, y = mean)) +
  geom_bar(stat = "identity") +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 0.1) +
  geom_text(aes(label = num), hjust = 0, vjust = 0) +
  facet_wrap(~year)

plot of chunk 18-01

Visualizing a single distribution

The geoms geom_histogram and geom_density can be used to create histograms and density plots, respectively. Using these geoms, visualize the distribution of 2011 citations (wosCountThru2011). Compare the raw distribution to log10 and square root transformations.

p <- ggplot(research, aes(x = wosCountThru2011))
p + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk 18-02

p + geom_histogram() + scale_x_log10() 
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1654 rows containing non-finite values (stat_bin).

plot of chunk 18-02

p + geom_histogram() + scale_x_sqrt()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk 18-02

p + geom_density()

plot of chunk 18-02

p + geom_density() + scale_x_log10() 
Warning: Removed 1654 rows containing non-finite values (stat_density).

plot of chunk 18-02

p + geom_density() + scale_x_sqrt()

plot of chunk 18-02

20 Debugging with debug

Limit to a subset of levels

What if we were only interested in the mean of the number of tweets in the journals PLOS Biology (pbio) and PLOS One (pone)? We could subset to only pass values for these journals to the function mean_metric_per_var.

mean_metric_per_var(counts_raw$backtweetsCount[counts_raw$journal %in% c("pbio", "pone")],
                   counts_raw$journal[counts_raw$journal %in% c("pbio", "pone")])
     pbio      pcbi      pgen      pmed      pntd      pone      ppat 
0.0555770       NaN       NaN       NaN       NaN 0.4942194       NaN 

Unfortunately this still gives us results for the other journals. And their result is NaN, a special value indiciating “Not a Number”.

Use debug to isolate and diagnose the problem.

As an added challenge, can you fix the bug?

# Use droplevels
mean_metric_per_var <- function(metric, variable) {
  if (!is.factor(variable)) {
    variable <- as.factor(variable)
  }
  variable <- droplevels(variable)
  result <- numeric(length = length(levels(variable)))
  names(result) <- levels(variable)
  for (v in levels(variable)) {
    result[v] <- mean(metric[variable == v])
  }
  return(result)
}
mean_metric_per_var(counts_raw$backtweetsCount[counts_raw$journal %in% c("pbio", "pone")],
                   counts_raw$journal[counts_raw$journal %in% c("pbio", "pone")])
     pbio      pone 
0.0555770 0.4942194 

23 Defensive programming with stopifnot

Practice defensive programming

Use defensive programming techniques to make the function calc_sum_stat more robust.

calc_sum_stat <- function(df, cols) {
  df_sub <- df[, cols, drop = FALSE]
  sum_stat <- apply(df_sub, 1, mean)
  return(sum_stat)
}

Some specific ideas:

  • Assert that the input df is not empty (hint: use dim)
  • Assert that cols is a character vector
  • Assert that the columns listed in cols are in df (hint: use %in% and colnames)
  • Assert that df_sub is a data frame (hint: use is.data.frame)
  • Assert that sum_stat is not NA
  • Issue a warning if cols only contains one column (since taking the mean of one column isn’t very useful, the user may have made a mistake)

After you add your assertion statements, test out the following inputs to calc_sum_stat. Do your assertion statements catch these errors? Should we update the function based on some of these results?

# Empty data frame
sum_stat <- calc_sum_stat(data.frame(), c("wosCountThru2010", "f1000Factor"))
# Non-character cols
sum_stat <- calc_sum_stat(counts_raw, 1:3)
# Bad column names
sum_stat <- calc_sum_stat(counts_raw, c("a", "b"))
# Issue warning since only one column
sum_stat <- calc_sum_stat(counts_raw, "mendeleyReadersCount")
# NA output
sum_stat <- calc_sum_stat(counts_raw, c("wosCountThru2010", "facebookLikeCount"))
# Because of the `NA` result, use `na.rm = TRUE` with `mean`
calc_sum_stat <- function(df, cols) {
  stopifnot(dim(df) > 0,
            is.character(cols),
            cols %in% colnames(df))
  if (length(cols) == 1) {
    warning("Only one column specified. Calculating the mean will not change anything.")
  }
  df_sub <- df[, cols, drop = FALSE]
  stopifnot(is.data.frame(df_sub))
  sum_stat <- apply(df_sub, 1, mean, na.rm = TRUE)
  stopifnot(!is.na(sum_stat))
  return(sum_stat)
}
# Proper
sum_stat <- calc_sum_stat(counts_raw, c("wosCountThru2010", "f1000Factor"))

24 Testing with testit

Write some tests

Write unit tests for the function my_mean that you wrote in an earlier lesson. It should look something like this:

my_mean <- function(x) {
  result <- sum(x) / length(x)
  return(result)
}

The input x is a numeric vector, and the output is the mean of the vector of numbers. Some ideas to get started:

  • Pass a vector where you know what the mean is, and assert that the result is correct.
  • Add some assertion statments to check the input x. Use has_error to test that the function throws an error when given bad input.
  • Issue a warning if the user passes a vector of length one. Test that the warning is properly issued using has_warning.
  • Include an NA in the vector where you know the result to see what happens. Do you need to modify the code to pass the test?
my_mean <- function(x) {
  assert("x is numeric", is.numeric(x))
  if (length(x) == 1) {
    warning("Input vector had only one element")
  }
  result <- sum(x, na.rm = TRUE) / length(x[!is.na(x)])
  return(result)
}

assert("Mean is calculated correctly",
       my_mean(1:3) == 2, my_mean(c(2, 4, 6)) == 4)
assert("Non-numeric input throws error",
       has_error(my_mean("hello")))
assertion failed: x is numeric
assert("Input vector with one element issues warning",
       has_warning(my_mean(5)))
assert("Mean is calculated correctly when given NAs",
       my_mean(c(2, 4, 6, NA)) == 4)