PCA of highly variable genes

Last updated: 2018-01-29

Code version: f6b7f76

Setup

library("cowplot")
library("dplyr")
library("DT")
library("edgeR")
library("ggplot2")
library("knitr")
theme_set(theme_cowplot())
source("../code/functions.R")
library("Biobase") # has to be loaded last to use `combine`

Import data.

eset <- readRDS("../data/eset.rds")
dim(eset)

Features  Samples 
   54792     4992

Filter

Only keep high-quality single cells.

quality <- read.table("../data/quality-single-cells.txt", stringsAsFactors = FALSE)
colnames(quality) <- c("sample", "quality")
eset <- eset[, quality$quality]
dim(eset)

Features  Samples 
   54792     3910

Isolate the human genes.

eset <- eset[fData(eset)$source == "H. sapiens", ]
dim(eset)

Features  Samples 
   20327     3910

Remove zeros.

eset <- eset[rowSums(exprs(eset)) != 0, ]
dim(eset)

Features  Samples 
   19656     3910

Only keep genes which are observed in at least 50% of the samples.

# Function `present` is defined in ../code/functions.R
eset <- eset[apply(exprs(eset), 1, present), ]
dim(eset)

Features  Samples 
    6979     3910

Select most variable genes

Convert to log2 counts per million.

log2cpm <- cpm(exprs(eset), log = TRUE)
dim(log2cpm)

[1] 6979 3910

Calculate coefficient of variation.

compute_cv <- function(x) sd(x) / mean(x)
cv <- apply(log2cpm, 1, compute_cv)
summary(cv)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.02658 0.19235 0.28446 0.25889 0.33827 0.42273

Select 25% of genes with highest CV.

cutoff <- 0.25
summary(cv[rank(cv) / length(cv) > 1 - cutoff])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3383  0.3483  0.3560  0.3558  0.3628  0.4227

variable <- log2cpm[rank(cv) / length(cv) > 1 - cutoff, ]
dim(variable)

[1] 1745 3910

PCA

pca <- run_pca(variable)
head(pca$explained * 100)

[1] 3.0263690 0.9880011 0.6295410 0.4630944 0.3477797 0.3030006

plot_pca(pca$PCs, pcx = 1, pcy = 2, explained = pca$explained,
         metadata = pData(eset), color = "batch")

PCA versus covariates

Calculate the adjusted R-squared for various covariates and the first 6 principal components.

get_r2 <- function(x, y) {
  stopifnot(length(x) == length(y))
  model <- lm(y ~ x)
  stats <- summary(model)
  return(stats$adj.r.squared)
}
covariates <- pData(eset) %>% select(experiment, batch, concentration, tra1.60,
                                     index, raw:unmapped, starts_with("detect"),
                                     chipmix, freemix)
pcs <- pca$PCs[, 1:6]
r2 <- matrix(NA, nrow = ncol(covariates), ncol = ncol(pcs),
             dimnames = list(colnames(covariates), colnames(pcs)))
for (cov in colnames(covariates)) {
  for (pc in colnames(pcs)) {
    r2[cov, pc] <- get_r2(covariates[, cov], pcs[, pc])
  }
}

PC1 is most highly correlated with the percentage of detected genes and other metrics of sequencing depth, which is consistent with the observations of Hicks et al., 2017. PCs 2-6 most highly correlate with variation across C1 chips.

heatmap(r2)

datatable(r2)

Here is the description of all the experimental variables that were correlated with the PCs.

kable(varMetadata(eset)[colnames(covariates), , drop = FALSE])

	labelDescription
experiment	ID of C1 chip (i.e. processing date in MMDDYYYY)
batch	Batch the C1 chip was processed in (b1, b2, …)
concentration	The cDNA concentration of the well prior to library prep
tra1.60	Did the cell stain positive for TRA-1-60? (test of pluripotency)
index	The set of indexes used for library prep (of the 3 sets of 96)
raw	The number of raw reads
umi	The number of reads with a valid UMI
mapped	The number of reads with a valid UMI that mapped to a genome
unmapped	The number of reads with a valid UMI that did not map to a genome
detect_ce	The number of C. elegans genes with at least one molecule
detect_dm	The number of D. melanogaster genes with at least one molecule
detect_ercc	The number of ERCC genes with at least one molecule
detect_hs	The number of H. sapiens genes with at least one molecule
chipmix	verifyBamID: chipmix is a metric for detecting sample swaps
freemix	verifyBamID: freemix is a measure of contamination. 0 == good & 0.5 == bad

Session information

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.2 (Nitrogen)

Matrix products: default
BLAS: /project2/gilad/jdblischak/miniconda3/envs/scqtl/lib/R/lib/libRblas.so
LAPACK: /project2/gilad/jdblischak/miniconda3/envs/scqtl/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  methods   stats     graphics  grDevices utils     datasets 
[8] base     

other attached packages:
 [1] testit_0.6          Biobase_2.38.0      BiocGenerics_0.24.0
 [4] knitr_1.16          edgeR_3.20.1        limma_3.34.1       
 [7] DT_0.2              dplyr_0.7.4         cowplot_0.9.1      
[10] ggplot2_2.2.1      

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13     highr_0.6        compiler_3.4.1   git2r_0.19.0    
 [5] plyr_1.8.4       bindr_0.1        tools_3.4.1      digest_0.6.12   
 [9] jsonlite_1.4     evaluate_0.10.1  tibble_1.3.3     gtable_0.2.0    
[13] lattice_0.20-34  pkgconfig_2.0.1  rlang_0.1.2      yaml_2.1.14     
[17] bindrcpp_0.2     stringr_1.2.0    htmlwidgets_0.9  locfit_1.5-9.1  
[21] rprojroot_1.2    grid_3.4.1       glue_1.1.1       R6_2.2.0        
[25] rmarkdown_1.6    magrittr_1.5     backports_1.0.5  scales_0.5.0    
[29] htmltools_0.3.6  assertthat_0.1   colorspace_1.3-2 labeling_0.3    
[33] stringi_1.1.2    lazyeval_0.2.0   munsell_0.4.3

This R Markdown site was created with workflowr