Last updated: 2015-07-07

Code version: dd7c5e7a33a14f1e73896fbe0b7bcea2e9219c42

Ding et al. 2015 developed GRM to perform ERCC-based normalization using a gamma regression model. I downloaded GRM-0.2.1.tgz on 2015-06-29.

cd ~/src
wget http://wanglab.ucsd.edu/star/GRM/files/GRM-0.2.1.tgz
tar xzf GRM-0.2.1.tgz

Unfortunately it is not an R package, but a standalone R script. Furthermore, it loads unnecessary libraries (needed for making the plots in their paper) and loads the main function via source using a relative path. Luckily it appears it will be easier to just use that function directly since the script loads the data as csv files in a specific format.

The main function is gammareg in include/GRM_lib.r. The only external library it requires is MASS (for the function gamma.dispersion).

I looked at the tutorial files. The ERCC data it uses are the relative concentrations from the ERCC documentation (it uses Mix 2, whereas our data uses Mix 1).

Setup

library("edgeR")
source("functions.R")

Input and prepare single cell data.

molecules <- read.table("../data/molecules.txt", header = TRUE,
                        stringsAsFactors = FALSE)
anno <- read.table("../data/annotation.txt", header = TRUE,
                   stringsAsFactors = FALSE)
molecules_single <- molecules[, grep("bulk", colnames(molecules), invert = TRUE)]
quality_single_cells <- scan("../data/quality-single-cells.txt",
                             what = "character")
# Remove low quality cells
molecules_single <- molecules_single[, colnames(molecules_single) %in%
                                       quality_single_cells]
anno_single <- anno[anno$sample_id %in% quality_single_cells, ]
# Remove unuexpressed genes
expressed_single <- rowSums(molecules_single) > 0
molecules_single <- molecules_single[expressed_single, ]
# Remove "overexpressed" genes
# This removes some ERCC b/c of 19098 batch 2
overexpressed_genes <- apply(molecules_single, 1, function(x) any(x >= 1024))
molecules_single <- molecules_single[!overexpressed_genes, ]
# Correct for collision probability
molecules_single <- -1024 * log(1 - molecules_single / 1024)
# Calculate cpm
molecules_single_cpm <- cpm(molecules_single)

The main function is gammareg.

source("GRM.R")

It has four required arguments.

formals(gammareg)
$ercc_response


$ercc


$gene


$filename


$SE
[1] FALSE

Below is documentation the authors provided interspersed with my code to prepare the data.

# ercc_response file: the fpkm reads for ERCC across all the samples
#             format: each row is each ERCC, each column is each sample
#                     especially sorted by the ERCC names
ercc_response <- molecules_single_cpm[grep("ERCC", rownames(molecules_single_cpm)), ]
stopifnot(nrow(ercc_response) <= 92)

77 ERCC spike-ins have greater than 0 molecules but less than 1,024.

# ercc file: the add-in standard ERCC concentration, gotten from the experiments directly
#    format: each row is each ERCC, each column is each sample
#            the first column is ERCC name, the second column is ERCC standard concentration.
#            especially sorted by the ERCC names
ercc <- read.table("../data/ercc-info.txt", header = TRUE, sep = "\t",
                   stringsAsFactors = FALSE)
colnames(ercc) <- c("num", "id", "subgroup", "conc_mix1", "conc_mix2",
                    "expected_fc", "log2_mix1_mix2")
head(ercc)
  num         id subgroup conc_mix1  conc_mix2 expected_fc log2_mix1_mix2
1   1 ERCC-00130        A 30000.000 7500.00000           4              2
2   2 ERCC-00004        A  7500.000 1875.00000           4              2
3   3 ERCC-00136        A  1875.000  468.75000           4              2
4   4 ERCC-00108        A   937.500  234.37500           4              2
5   5 ERCC-00116        A   468.750  117.18750           4              2
6   6 ERCC-00092        A   234.375   58.59375           4              2
ercc <- ercc[, c("id", "conc_mix1")]
stopifnot(nrow(ercc) == 92)
# Sort by ERCC id
ercc <- ercc[order(ercc$id), ]
# Remove those the had been filtered
ercc <- ercc[ercc$id %in% rownames(ercc_response), ]
stopifnot(nrow(ercc) == nrow(ercc_response))
# gene file: the fpkm reads for genes across all the sampels
#    format: each row is each gene, each column is each sample
#            especially the column names should be in the same order with ercc_response file
gene <- molecules_single_cpm[grep("ERCC", rownames(molecules_single_cpm),
                                  invert = TRUE), ]
# filename: the vector for all the samples' names
#    format: it should be a 1*n vector in character variable type
filename <- colnames(molecules_single_cpm)

Gamma regression model

This takes a long time to run.

grm_rds <- "../data/grm.rds"
if (file.exists(grm_rds)) {
  results <- readRDS(grm_rds)
} else {
  results <- gammareg(ercc_response = ercc_response, ercc = ercc, gene = gene,
                      filename = filename)
  saveRDS(results, grm_rds)
}

PCA

invariant_genes <- apply(results$predictset, 1, var) == 0

0 genes are invariant.

pca_grm <- run_pca(results$predictset[!invariant_genes, ])
p <- plot_pca(pca_grm$PCs, explained = pca_grm$explained,
              metadata = anno_single, color = "individual",
              shape = "batch", factors = c("individual", "batch"))
p

p + facet_grid(batch~individual)

Session information

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testit_0.4    ggplot2_1.0.1 MASS_7.3-40   edgeR_3.10.2  limma_3.24.9 
[6] knitr_1.10.5 

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6      magrittr_1.5     munsell_0.4.2    colorspace_1.2-6
 [5] stringr_1.0.0    plyr_1.8.2       tools_3.2.0      grid_3.2.0      
 [9] gtable_0.1.2     htmltools_0.2.6  yaml_2.1.13      digest_0.6.8    
[13] reshape2_1.4.1   formatR_1.2      evaluate_0.7     rmarkdown_0.6.1 
[17] labeling_0.3     stringi_0.4-1    scales_0.2.4     proto_0.3-10