Last updated: 2016-02-27

Code version: e913fc0a3f954acf436955c670b4cdf1d052c075

library("limma")
library("edgeR")
library("ggplot2")
theme_set(theme_bw(base_size = 12))
source("functions.R")
library("Humanzee")

This file uses a mixed effects model to remove technical batch effects. The modeling function is adapted from the package limma and is implemented in the package Humanzee.

Creates the following file:

molecules-final.txt - Molecules in high quality single cells after removing unwanted variation with mixed model

Input

Input filtered annotation.

anno_filter <- read.table("../data/annotation-filter.txt", header = TRUE,
                   stringsAsFactors = FALSE)
head(anno_filter)
  individual replicate well      batch      sample_id
1    NA19098        r1  A01 NA19098.r1 NA19098.r1.A01
2    NA19098        r1  A02 NA19098.r1 NA19098.r1.A02
3    NA19098        r1  A04 NA19098.r1 NA19098.r1.A04
4    NA19098        r1  A05 NA19098.r1 NA19098.r1.A05
5    NA19098        r1  A06 NA19098.r1 NA19098.r1.A06
6    NA19098        r1  A07 NA19098.r1 NA19098.r1.A07

Input Poisson GLM transformed molecule counts per million.

molecules_cpm_trans <- read.table("../data/molecules-cpm-trans.txt", header = TRUE,
                               stringsAsFactors = FALSE)
stopifnot(ncol(molecules_cpm_trans) == nrow(anno_filter),
          colnames(molecules_cpm_trans) == anno_filter$sample_id)

Mixed model for batch-effect correction

Because the Poisson transformation with the ERCC controls was not sufficient to remove all the unwanted technical variation, we used a mixed model to correct for batch effects.

We adapted limma’s algorithm for estimating variance components due to random effects. This analysis operates under the assumption that biological replicates (or batches within an individual in this case) share similar correlation across genes. Morever, the analysis permits negative correlation between replicates.

Crossed Model

For every single gene, we will fit a mixed model assuming differences between batches are not individual-specific as follows

\[ y_{ijk} = \mu + \alpha_i + b_j + \epsilon_{ijk} \],

where \(y_{ijk}\) is the log2 counts-per-million (cpm) for any gene in individual \(i\), batch \(j\), and cell \(k\), \(\mu\) is the gene-specific expression level across all cells, \(\alpha_i\) is the expression level specific to individual \(i\), \(b_j\) is batch \(j\)‘s deviation of expression level from gene-specific expression levels, and \(\epsilon_{ijk}\) is the models’ residual error.

We assume that \(b_j\) follows a normal distribution with \(b_j \sim N(0, \sigma^2_b)\) for \(j = 1, \dots, 9\), and \(\epsilon_{ijk} \sim N(0, \sigma^2_\epsilon)\) for \(i = 1, 2, 3; j = 1, \dots, 9; and k = 1, \dots, n_{ij}\), where \(n_ij\) denotes the number of cells in individual \(i\), batch \(j\).

Remove unwanted variation

Create design matrix and compute a consensus correlation coefficient using limma’s duplicateCorrelation function.

block <- anno_filter$batch
design <- model.matrix(~ 1 + individual, data = anno_filter)
dup_corrs <- duplicateCorrelation(molecules_cpm_trans, design = design,
                                  block = block)

Fit a mixed model with the 9 batches being the random effect.

gls_fit <- Humanzee::ruv_mixed_model(molecules_cpm_trans,
                                     ndups = 1,
                                     design = design,
                                     block = block,
                                     correlation = dup_corrs$cons)

Compute expression levels after removing variation due to random effects.

molecules_final <- t( design %*% t(gls_fit$coef) ) + gls_fit$resid

Output the cleaned data.

colnames(molecules_final) <- colnames(molecules_cpm_trans)
write.table(round(molecules_final, digits = 6), "../data/molecules-final.txt", quote = FALSE,
            sep = "\t", col.names = NA)

PCA

pca_final <- run_pca(molecules_final)
pca_final_plot <- plot_pca(pca_final$PCs, explained = pca_final$explained,
         metadata = anno_filter, color = "individual",
         shape = "replicate") +
  labs(title = "Batch corrected")
pca_final_plot

Session information

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] testit_0.4     Humanzee_0.1.0 ggplot2_1.0.1  edgeR_3.10.2  
[5] limma_3.24.9   knitr_1.10.5  

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.0      magrittr_1.5     MASS_7.3-40      munsell_0.4.2   
 [5] colorspace_1.2-6 stringr_1.0.0    httr_0.6.1       plyr_1.8.3      
 [9] tools_3.2.0      grid_3.2.0       gtable_0.1.2     htmltools_0.2.6 
[13] yaml_2.1.13      digest_0.6.8     reshape2_1.4.1   formatR_1.2     
[17] codetools_0.2-11 bitops_1.0-6     RCurl_1.95-4.6   evaluate_0.7    
[21] rmarkdown_0.6.1  labeling_0.3     stringi_0.4-1    scales_0.2.4    
[25] proto_0.3-10