Last updated: 2015-09-28
Code version: 3f442a23f405d3564d9ab3197ec337df22fe9383
First, compared counts via three methods:
Then investigated the relationship between sequencing depth and total molecule count per sample. Found that sequencing depth affects the total molecule count, which in turn affects PC1. Will use TMM-normalize molecule counts per million mapped (cpm) for downstream analyses.
Therefore reran the original comparisons between reads and molecules, but this time using TMM-normalized counts per million for the molecules similar to the reads. The correlation of the mean expression improved.
library("dplyr")
library("ggplot2")
theme_set(theme_bw(base_size = 16))
library("edgeR")
source("functions.R")
Input annotation.
anno <- read.table("../data/annotation.txt", header = TRUE,
stringsAsFactors = FALSE)
head(anno)
individual batch well sample_id
1 19098 1 A01 NA19098.1.A01
2 19098 1 A02 NA19098.1.A02
3 19098 1 A03 NA19098.1.A03
4 19098 1 A04 NA19098.1.A04
5 19098 1 A05 NA19098.1.A05
6 19098 1 A06 NA19098.1.A06
Input read counts.
reads <- read.table("../data/reads.txt", header = TRUE,
stringsAsFactors = FALSE)
Input molecule counts.
molecules <- read.table("../data/molecules.txt", header = TRUE,
stringsAsFactors = FALSE)
Input molecule counts summed across lanes.
molecules_per_lane <- read.table("../data/molecules-per-lane.txt", header = TRUE,
stringsAsFactors = FALSE)
Input list of quality single cells.
quality_single_cells <- scan("../data/quality-single-cells.txt",
what = "character")
Keep only the single cells that passed the QC filters and the bulk samples.
reads <- reads[, grepl("bulk", colnames(reads)) |
colnames(reads) %in% quality_single_cells]
molecules <- molecules[, grepl("bulk", colnames(molecules)) |
colnames(molecules) %in% quality_single_cells]
molecules_per_lane <- molecules_per_lane[, grepl("bulk", colnames(molecules_per_lane)) |
colnames(molecules_per_lane) %in% quality_single_cells]
anno <- anno[anno$well == "bulk" | anno$sample_id %in% quality_single_cells, ]
stopifnot(dim(reads) == dim(molecules),
nrow(anno) == ncol(molecules_per_lane))
Remove genes with zero read or molecule counts in the single cell or bulk samples.
expressed <- rowSums(reads[anno$well == "bulk"]) > 0 &
rowSums(reads[anno$well != "bulk"]) > 0 &
rowSums(molecules[anno$well == "bulk"]) > 0 &
rowSums(molecules[anno$well != "bulk"]) > 0
reads <- reads[expressed, ]
molecules <- molecules[expressed, ]
molecules_per_lane <- molecules_per_lane[expressed, ]
Calculate the number of reads per molecule of each gene in each cell.
reads_per_molecule <- as.matrix(reads/molecules)
hist(reads_per_molecule, breaks=100)
plot(density(reads_per_molecule, na.rm = TRUE))
Calculate cpm for the reads data using TMM-normalization.
norm_factors_reads <- calcNormFactors(reads, method = "TMM")
reads_cpm <- cpm(reads, lib.size = colSums(reads) * norm_factors_reads)
And for the molecules.
norm_factors_mol <- calcNormFactors(molecules, method = "TMM")
molecules_cpm <- cpm(molecules, lib.size = colSums(molecules) * norm_factors_mol)
And for the molecules summed per lane.
norm_factors_mol_per_lane <- calcNormFactors(molecules_per_lane, method = "TMM")
molecules_per_lane_cpm <- cpm(molecules_per_lane,
lib.size = colSums(molecules_per_lane) *
norm_factors_mol_per_lane)
Compare the means of each gene obtained via the different methods.
mean_data <- data.frame(reads_cpm = rowMeans(reads_cpm),
molecules = rowMeans(molecules),
molecules_per_lane = rowMeans(molecules_per_lane))
cor(mean_data)
reads_cpm molecules molecules_per_lane
reads_cpm 1.0000000 0.8909973 0.8769356
molecules 0.8909973 1.0000000 0.9944794
molecules_per_lane 0.8769356 0.9944794 1.0000000
All three are highly correlated.
mean_data$type <- ifelse(grepl("ERCC", rownames(mean_data)), "ERCC", "gene")
ggplot(mean_data, aes(x = reads_cpm, y = molecules)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type)
There are only a few genes with molecule counts greater than the number of UMIs.
rownames(molecules)[rowMeans(molecules) > 1024]
[1] "ENSG00000198712" "ENSG00000198938" "ENSG00000198886"
They are highly expressed mitochondrial genes.
ggplot(mean_data, aes(x = reads_cpm, y = molecules)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type) +
ylim(0, 1100)
ggplot(mean_data, aes(x = reads_cpm, y = molecules_per_lane)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type)
The molecule counts and the molecule counts summed per sequencing lane are highly correlated. This indicates that most of the bias is introduced in the library preparation step and not during sequencing.
ggplot(mean_data, aes(x = molecules_per_lane, y = molecules)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type)
How dependent are the molecule counts on the total molecule count for a given sample? Should we standardize by the total molecule count per sample? Islam et al. 2014 argue that this is not necessary, “scales for molecule-counting scatterplots (Fig. 2d,e) are absolute and would not change appreciably if the number of reads were increased.” Let’s check this assumption.
Does the total number of molecules per sample vary with the total number of reads? If it is not necessary to standardize the molecule counts, the molecule counts should be consistent across varying read depths.
total_counts_data <- data.frame(total_reads = colSums(reads) / 10^6,
total_molecules = colSums(molecules) / 10^3,
anno)
str(total_counts_data)
'data.frame': 587 obs. of 6 variables:
$ total_reads : num 1.89 1.99 1.27 1.99 1.75 ...
$ total_molecules: num 93.1 95.1 74.7 104.7 98.2 ...
$ individual : int 19098 19098 19098 19098 19098 19098 19098 19098 19098 19098 ...
$ batch : int 1 1 1 1 1 1 1 1 1 1 ...
$ well : chr "A01" "A02" "A04" "A05" ...
$ sample_id : chr "NA19098.1.A01" "NA19098.1.A02" "NA19098.1.A04" "NA19098.1.A05" ...
total_counts_single <- ggplot(total_counts_data[total_counts_data$well != "bulk", ],
aes(x = total_reads, y = total_molecules)) +
geom_point(aes(col = as.factor(individual), shape = as.factor(batch))) +
geom_smooth(method = "lm") +
labs(x = "Total number of reads (x10^6)",
y = "Total number of molecules (x10^3)",
title = "Effect of read depth on single cells")
total_counts_single
total_counts_bulk <- total_counts_single %+%
total_counts_data[total_counts_data$well == "bulk", ] +
labs(x = "Total number of reads (x10^6)",
y = "Total number of molecules (x10^3)",
title = "Effect of read depth on bulk samples")
total_counts_bulk
So this is clearly not the case. Perhaps in the ideal case where all the cells are sequenced to saturation, then any increasing sequencing would not make a difference in the molecule counts.
Also, there is a difference in total molecule count between the three individuals.
total_molecules_ind <- ggplot(total_counts_data[total_counts_data$well != "bulk", ],
aes(x = as.factor(individual), y = total_molecules)) +
geom_boxplot(aes(fill = as.factor(batch))) +
scale_fill_brewer(type = "qual", palette = "Dark2", name ="Batch") +
labs(x = "Individual",
y = "Molecules (x10^3)",
title = "Total molecule counts vary across individuals")
total_molecules_ind
But not for the total number of reads, as expected from the plot above of the effect of read depth where all three individuals span the x-axis of the total number of reads.
total_reads_ind <- total_molecules_ind %+% aes(y = total_reads) +
labs(y = "Reads (x10^6)",
title = "Total read counts vary across individuals")
total_reads_ind
There is a clear difference between individuals. Is there a difference between the full 9 batches?
total_counts_single_batch <- ggplot(total_counts_data[total_counts_data$well != "bulk", ],
aes(x = total_reads, y = total_molecules,
col = paste(individual, batch, sep = "."))) +
geom_point() +
scale_color_brewer(palette = "Set1", name = "9 batches") +
labs(x = "Total number of reads (x10^6)",
y = "Total number of molecules (x10^3)",
title = "Effect of read depth on single cells")
total_counts_single_batch
It is diffcult to see when all 9 are plotted at once. Here are the batches split by individual.
total_counts_single_ind <- ggplot(total_counts_data[total_counts_data$well != "bulk", ],
aes(x = total_reads, y = total_molecules)) +
geom_point(aes(color = as.factor(batch))) +
facet_wrap(~individual, nrow = 3) +
scale_color_brewer(palette = "Dark2", name = "batch") +
labs(x = "Total number of reads (x10^6)",
y = "Total number of molecules (x10^3)",
title = "Effect of read depth on single cells")
total_counts_single_ind
Pairwise distance in (total reads, total molecules) between batches or individuals.
total_counts_single <- total_counts_data[total_counts_data$well != "bulk", ]
total_counts_single_dist <- as.matrix( dist(total_counts_single[ , c("total_reads", "total_molecules")]) )
rownames(total_counts_single_dist) <- with(total_counts_single,
paste(individual, batch, sep = "_"))
colnames(total_counts_single_dist) <- rownames(total_counts_single_dist)
total_counts_single[1:2, 1:2]
total_reads total_molecules
NA19098.1.A01 1.892562 93.117
NA19098.1.A02 1.988496 95.125
ind_index <- (total_counts_single$individual)
ind_batch_index <- with(total_counts_single,paste(individual, batch, sep = "_"))
same_ind_index <- outer(ind_index,ind_index,function(x,y) x==y)
same_batch_index <- outer(ind_batch_index,ind_batch_index,function(x,y) x==y)
dim_temp <- dim(total_counts_single_dist)
dist_index_matrix <- matrix("diff_ind",nrow=dim_temp[1],ncol=dim_temp[2])
dist_index_matrix[same_ind_index & !same_batch_index] <- "same_ind_diff_batch"
dist_index_matrix[same_batch_index] <- "same_batch"
ans <- lapply(unique(c(dist_index_matrix)),function(x){
temp <- c(total_counts_single_dist[(dist_index_matrix==x)&(upper.tri(dist_index_matrix,diag=FALSE))])
data.frame(dist=temp,type=rep(x,length(temp)))
})
ans1 <- do.call(rbind,ans)
boxplot(dist~type,data=ans1)
plot(density(ans1$dist[ans1$type=="same_batch"]))
lines(density(ans1$dist[ans1$type=="same_ind_diff_batch"]),col=2)
lines(density(ans1$dist[ans1$type=="diff_ind"]),col=3)
ggplot(ans1, aes(x= factor(type), y = dist, col = factor(type)), height = 600, width = 2000) +
geom_boxplot(outlier.shape = NA, alpha = .01, width = .2, position = position_dodge(width = .9)) +
ylim(0, 100) +
labs(title = "cell-cell distance (total molecule and total reads)") +
theme(axis.text.x = element_text(hjust=1, angle = 45))
Warning: Removed 1285 rows containing non-finite values (stat_boxplot).
Warning: Removed 292 rows containing missing values (geom_point).
Warning: Removed 613 rows containing missing values (geom_point).
Warning: Removed 1195 rows containing missing values (geom_point).
summary(lm(dist~type,data=ans1))
Call:
lm(formula = dist ~ type, data = ans1)
Residuals:
Min 1Q Median 3Q Max
-29.126 -16.646 -4.327 12.006 138.499
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 25.0283 0.1512 165.572 < 2e-16 ***
typesame_ind_diff_batch 0.8673 0.1883 4.606 4.1e-06 ***
typediff_ind 4.1509 0.1644 25.254 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.43 on 166750 degrees of freedom
Multiple R-squared: 0.006382, Adjusted R-squared: 0.00637
F-statistic: 535.5 on 2 and 166750 DF, p-value: < 2.2e-16
## calculate conversion rate
total_counts_single$conversion_rate <- total_counts_single$total_reads/ total_counts_single$total_molecules
plot(total_counts_single$conversion_rate,col=total_counts_single$batch)
## create 9 batches
total_counts_single$batch_num <- paste(total_counts_single$individual,total_counts_single$batch,sep="_")
## calculate individual mean for standardization
avg_conversion_rate <- do.call(rbind,
lapply(unique(total_counts_single$individual),function(x){
c(individual=x,avg_conv_rate=mean(total_counts_single$conversion_rate[total_counts_single$individual==x]))
}))
total_counts_single <-merge(total_counts_single,avg_conversion_rate,by="individual")
## create std_conv_rate within individual to remove individual effect
total_counts_single$std_conv_rate <- total_counts_single$conversion_rate - total_counts_single$avg_conv_rate
plot(total_counts_single$std_conv_rate,col=total_counts_single$batch)
summary(lm(std_conv_rate~batch_num,data=total_counts_single))
Call:
lm(formula = std_conv_rate ~ batch_num, data = total_counts_single)
Residuals:
Min 1Q Median 3Q Max
-0.017995 -0.001046 0.000454 0.001550 0.007489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.439e-04 3.055e-04 -0.798 0.425037
batch_num19098_2 -3.580e-03 7.888e-04 -4.539 6.91e-06 ***
batch_num19098_3 1.614e-03 4.822e-04 3.347 0.000871 ***
batch_num19101_1 3.796e-05 4.431e-04 0.086 0.931757
batch_num19101_2 1.144e-03 4.583e-04 2.496 0.012834 *
batch_num19101_3 -6.282e-04 4.959e-04 -1.267 0.205748
batch_num19239_1 -1.306e-03 4.416e-04 -2.956 0.003242 **
batch_num19239_2 5.405e-04 4.601e-04 1.175 0.240640
batch_num19239_3 1.522e-03 4.402e-04 3.458 0.000584 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.002817 on 569 degrees of freedom
Multiple R-squared: 0.1427, Adjusted R-squared: 0.1307
F-statistic: 11.84 on 8 and 569 DF, p-value: 1.14e-15
## violin plots
ggplot(total_counts_single, aes(x= factor(batch_num), y = conversion_rate, fill = factor(batch_num)), height = 600, width = 2000) +
geom_violin(alpha = .5) +
geom_boxplot(alpha = .01, width = .2, position = position_dodge(width = .9)) +
labs(x = "batch", y = "conversion rate", title = "individual batch effect of conversion rate") +
theme(axis.text.x = element_text(hjust=1, angle = 45))
ggplot(total_counts_single, aes(x= factor(batch_num), y = std_conv_rate, fill = factor(batch_num)), height = 600, width = 2000) +
geom_violin(alpha = .5) +
geom_boxplot(alpha = .01, width = .2, position = position_dodge(width = .9)) +
labs(x = "batch", y = "std conversion rate", title = "batch effect of conversion rate") +
theme(axis.text.x = element_text(hjust=1, angle = 45))
What effect does this difference in total molecule count have in PCA?
pca_single <- run_pca(molecules[, anno$well != "bulk"])
plot_pca(pca_single$PCs, explained = pca_single$explained,
metadata = anno[anno$well != "bulk", ], color = "individual",
shape = "batch", factors = c("individual", "batch")) +
labs(title = "PCA single cells uncorrected for depth")
pc1_v_total_mol_uncorrected <- ggplot(cbind(total_counts_data[total_counts_data$well != "bulk", ], pca_single$PCs),
aes(x = total_molecules, y = PC1, col = as.factor(individual),
shape = as.factor(batch))) +
geom_point(alpha = 0.5) +
labs(x = "Total number of molecules (x10^3)")
pc1_v_total_mol_uncorrected
pc2_v_total_mol_uncorrected <- pc1_v_total_mol_uncorrected %+% aes(y = PC2)
pc2_v_total_mol_uncorrected
The total molecule depth per sample is highly correlated with PC1. However, it did not affect PC2, which captures the individual effect.
What happens to the PCA results when depth is properly accounted for using TMM-normalized counts per million?
norm_factors_mol_single <- calcNormFactors(molecules[, anno$well != "bulk"],
method = "TMM")
molecules_cpm_single <- cpm(molecules[, anno$well != "bulk"],
lib.size = colSums(molecules[, anno$well != "bulk"]) *
norm_factors_mol_single)
pca_single_cpm <- run_pca(molecules_cpm_single)
plot_pca(pca_single_cpm$PCs, explained = pca_single_cpm$explained,
metadata = anno[anno$well != "bulk", ], color = "individual",
shape = "batch", factors = c("individual", "batch")) +
labs(title = "PCA single cells *corrected* for depth")
pc1_v_total_mol_corrected <- pc1_v_total_mol_uncorrected %+%
cbind(total_counts_data[total_counts_data$well != "bulk", ], pca_single_cpm$PCs) +
labs(title = "PCA single cells *corrected* for depth")
pc1_v_total_mol_corrected
pc2_v_total_mol_corrected <- pc1_v_total_mol_corrected %+% aes(y = PC2)
pc2_v_total_mol_corrected
PC1 is no longer associated with sequencing depth!
This time standardize the molecule counts for the sequencing depth.
Compare the means of each gene obtained via the different methods.
mean_data_std <- data.frame(reads_cpm = rowMeans(reads_cpm),
molecules_cpm = rowMeans(molecules_cpm),
molecules_per_lane_cpm = rowMeans(molecules_per_lane_cpm))
cor(mean_data_std)
reads_cpm molecules_cpm molecules_per_lane_cpm
reads_cpm 1.0000000 0.9704740 0.9703383
molecules_cpm 0.9704740 1.0000000 0.9986011
molecules_per_lane_cpm 0.9703383 0.9986011 1.0000000
All three are even more highly correlated now that the molecules are standardized.
mean_data_std$type <- ifelse(grepl("ERCC", rownames(mean_data_std)), "ERCC", "gene")
ggplot(mean_data_std, aes(x = reads_cpm, y = molecules_cpm)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type)
Examining the lower range where most genes are:
ggplot(mean_data_std, aes(x = reads_cpm, y = molecules_cpm)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type) +
ylim(0, 10000)
ggplot(mean_data_std, aes(x = reads_cpm, y = molecules_per_lane_cpm)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type)
And as above, the molecule counts and the molecule counts summed per sequencing lane are highly correlated, which indicates that most of the bias is introduced in the library preparation step and not during sequencing.
ggplot(mean_data_std, aes(x = molecules_per_lane_cpm, y = molecules_cpm)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(~ type)
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] testit_0.4 edgeR_3.10.2 limma_3.24.9 ggplot2_1.0.1 dplyr_0.4.2
[6] knitr_1.10.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 magrittr_1.5 MASS_7.3-40
[4] munsell_0.4.2 colorspace_1.2-6 R6_2.1.1
[7] stringr_1.0.0 httr_0.6.1 plyr_1.8.3
[10] tools_3.2.0 parallel_3.2.0 grid_3.2.0
[13] gtable_0.1.2 DBI_0.3.1 htmltools_0.2.6
[16] yaml_2.1.13 assertthat_0.1 digest_0.6.8
[19] RColorBrewer_1.1-2 reshape2_1.4.1 formatR_1.2
[22] bitops_1.0-6 RCurl_1.95-4.6 evaluate_0.7
[25] rmarkdown_0.6.1 labeling_0.3 stringi_0.4-1
[28] scales_0.2.4 proto_0.3-10