Last updated: 2015-07-13
Code version: a4d865fb33bba771e8c3bdcd182882556aedab55
The mean counts per million in a subsampled set of single cells (both sequencing depth and number of cells is varied) is compared to the mean counts per million of the bulk cells (mean across the three batches). The mean counts are log2 transformed with an added pseudocount of 1.
library("dplyr")
library("ggplot2")
theme_set(theme_bw(base_size = 14))
Run 10 iterations for each individual for each sequencing depth for each subsample of cells. The analysis is performed by correlate-single-to-bulk.R.
cd $ssd/subsampled
mkdir -p correlation correlation-quantiles
mkdir -p ~/log/correlate-single-to-bulk.R
for IND in 19098 19101 19239
do
for NUM in 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 2000000 2400000 2800000 3200000 3600000 4000000
do
for CELLS in 5 10 15 20 25 50 75 100 125 150
do
for SEED in {1..10}
do
# Correlation across all genes
CMD="correlate-single-to-bulk.R $CELLS $SEED molecule-counts-$NUM.txt $ssc/data/reads.txt --individual=$IND --good_cells=/mnt/lustre/home/jdblischak/singleCellSeq/data/quality-single-cells.txt"
DEST="correlation/$IND-$CELLS-$SEED-$NUM.txt"
echo "$CMD > $DEST" | qsub -l h_vmem=2g -cwd -V -N cor-$IND-$CELLS-$SEED-$NUM -j y -o ~/log/correlate-single-to-bulk.R -l 'hostname=!bigmem01'
sleep .01s
# Correlation for genes separated by the specified quantiles
CMD="correlate-single-to-bulk.R $CELLS $SEED molecule-counts-$NUM.txt $ssc/data/reads.txt --individual=$IND --good_cells=/mnt/lustre/home/jdblischak/singleCellSeq/data/quality-single-cells.txt -q .25 -q .5 -q .75"
DEST="correlation-quantiles/$IND-$CELLS-$SEED-$NUM.txt"
echo "$CMD > $DEST" | qsub -l h_vmem=2g -cwd -V -N cor-$IND-$CELLS-$SEED-$NUM-quantiles -j y -o ~/log/correlate-single-to-bulk.R -l 'hostname=!bigmem01'
sleep .01s
done
done
done
done
Convert to one file using Python. Run from $ssd/subsampled
.
import os
import glob
def gather(files, outfile):
out = open(outfile, "w")
out.write("ind\tdepth\tnum_cells\tseed\tquantiles\tr\tnum_genes\n")
for fname in files:
fname_parts = os.path.basename(fname).rstrip(".txt").split("-")
ind = fname_parts[0]
depth = fname_parts[3]
f = open(fname, "r")
for line in f:
out.write(ind + "\t" + depth + "\t" + line)
f.close()
out.close()
files = glob.glob("correlation/*txt")
gather(files, "correlation.txt")
files = glob.glob("correlation-quantiles/*txt")
gather(files, "correlation-quantiles.txt")
r_data <- read.table("/mnt/gluster/data/internal_supp/singleCellSeq/subsampled/correlation.txt",
header = TRUE, sep = "\t", stringsAsFactors = FALSE)
Calculate the mean and standard error of the mean (sem) for each of the 10 iterations.
r_data_plot <- r_data %>%
group_by(ind, depth, num_cells) %>%
summarize(mean = mean(r), sem = sd(r) / sqrt(length(r)))
p <- ggplot(r_data_plot, aes(x = num_cells, y = mean, color = as.factor(depth))) +
geom_line() +
geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 1) +
facet_wrap(~ind) +
labs(x = "Number of subsampled cells",
y = "Pearson's r (mean +/- sem)",
color = "Depth",
title = "Subsample: Correlation of single and bulk cells")
p
Focus on the lower end of the distribution of the number of subsampled cells.
p + xlim(0, 30)
Pollen et al. 2014 observed that the correlation between the average expression in a group of single cells and the expression in a bulk sample of 100 cells quickly saturates at about 10 single cells (Figure 2ab). Each point is the mean of 10 subsamples. The high coverage samples were sequenced at a depth of ~8.9 × 106 reads per cell, and the low coverage at ~2.7 × 105 reads per cell. The data comes from 46 K562 cells.
In the analysis above, all genes were used. By default, the expression level in the bulk cells is used to remove the bottom 25% of genes from the analysis. This basically removes genes that are clearly unexpressed. Starting from this list of “expressed” genes, I split them into their expression quartiles. Thus we can see how the correlation is affected by the expression level.
r_q_data <- read.table("/mnt/gluster/data/internal_supp/singleCellSeq/subsampled/correlation-quantiles.txt",
header = TRUE, sep = "\t", stringsAsFactors = FALSE)
Calculate the mean and standard error of the mean (sem) for each of the 10 iterations.
r_q_data_plot <- r_q_data %>%
group_by(ind, depth, num_cells, quantiles) %>%
summarize(mean = mean(r), sem = sd(r) / sqrt(length(r)), num_genes = num_genes[1])
Some single cells did not have enough reads for the higher subsamples.
r_q_data_plot <- na.omit(r_q_data_plot)
The full results.
p <- ggplot(r_q_data_plot, aes(x = num_cells, y = mean, color = as.factor(depth))) +
geom_line() +
geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 1) +
facet_grid(quantiles~ind) +
labs(x = "Number of subsampled cells",
y = "Pearson's r (mean +/- sem)",
color = "Depth",
title = "Subsample: Correlation of single and bulk cells")
p
Focus on one cell line, two sequencing depths, and the lower end of the number cells.
p %+% r_q_data_plot[r_q_data_plot$ind == 19098 &
r_q_data_plot$depth %in% c(2e5, 4e6), ] +
xlim(0, 75)
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_1.0.1 dplyr_0.4.1 knitr_1.10.5
loaded via a namespace (and not attached):
[1] Rcpp_0.11.6 magrittr_1.5 MASS_7.3-40 munsell_0.4.2
[5] colorspace_1.2-6 stringr_1.0.0 plyr_1.8.2 tools_3.2.0
[9] parallel_3.2.0 grid_3.2.0 gtable_0.1.2 DBI_0.3.1
[13] htmltools_0.2.6 yaml_2.1.13 lazyeval_0.1.10 assertthat_0.1
[17] digest_0.6.8 reshape2_1.4.1 formatR_1.2 evaluate_0.7
[21] rmarkdown_0.6.1 labeling_0.3 stringi_0.4-1 scales_0.2.4
[25] proto_0.3-10