Subsample: Correlation of single and bulk cells

Batch process each subsampled data set
Results
Pollen et al. 2014
Results by expression quantiles
Session information

Last updated: 2015-07-13

Code version: a4d865fb33bba771e8c3bdcd182882556aedab55

The mean counts per million in a subsampled set of single cells (both sequencing depth and number of cells is varied) is compared to the mean counts per million of the bulk cells (mean across the three batches). The mean counts are log₂ transformed with an added pseudocount of 1.

library("dplyr")
library("ggplot2")
theme_set(theme_bw(base_size = 14))

Batch process each subsampled data set

Run 10 iterations for each individual for each sequencing depth for each subsample of cells. The analysis is performed by correlate-single-to-bulk.R.

cd $ssd/subsampled
mkdir -p correlation correlation-quantiles
mkdir -p ~/log/correlate-single-to-bulk.R
for IND in 19098 19101 19239
do
  for NUM in 200000 400000 600000 800000 1000000 1200000 1400000 1600000 1800000 2000000 2400000 2800000 3200000 3600000 4000000
  do
    for CELLS in 5 10 15 20 25 50 75 100 125 150
    do
      for SEED in {1..10}
      do
        # Correlation across all genes
        CMD="correlate-single-to-bulk.R $CELLS $SEED molecule-counts-$NUM.txt $ssc/data/reads.txt --individual=$IND --good_cells=/mnt/lustre/home/jdblischak/singleCellSeq/data/quality-single-cells.txt"
        DEST="correlation/$IND-$CELLS-$SEED-$NUM.txt"
        echo "$CMD > $DEST" | qsub -l h_vmem=2g -cwd -V -N cor-$IND-$CELLS-$SEED-$NUM -j y -o ~/log/correlate-single-to-bulk.R -l 'hostname=!bigmem01'
        sleep .01s
        # Correlation for genes separated by the specified quantiles
        CMD="correlate-single-to-bulk.R $CELLS $SEED molecule-counts-$NUM.txt $ssc/data/reads.txt --individual=$IND --good_cells=/mnt/lustre/home/jdblischak/singleCellSeq/data/quality-single-cells.txt -q .25 -q .5 -q .75"
        DEST="correlation-quantiles/$IND-$CELLS-$SEED-$NUM.txt"
        echo "$CMD > $DEST" | qsub -l h_vmem=2g -cwd -V -N cor-$IND-$CELLS-$SEED-$NUM-quantiles -j y -o ~/log/correlate-single-to-bulk.R -l 'hostname=!bigmem01'
        sleep .01s
      done
    done
  done
done

Convert to one file using Python. Run from $ssd/subsampled.

import os
import glob

def gather(files, outfile):
    out = open(outfile, "w")
    out.write("ind\tdepth\tnum_cells\tseed\tquantiles\tr\tnum_genes\n")
    for fname in files:
        fname_parts = os.path.basename(fname).rstrip(".txt").split("-")
        ind = fname_parts[0]
        depth = fname_parts[3]
        f = open(fname, "r")
        for line in f:
            out.write(ind + "\t" + depth + "\t" + line)
        f.close()
    out.close()

files = glob.glob("correlation/*txt")
gather(files, "correlation.txt")

files = glob.glob("correlation-quantiles/*txt")
gather(files, "correlation-quantiles.txt")

Results

r_data <- read.table("/mnt/gluster/data/internal_supp/singleCellSeq/subsampled/correlation.txt",
                     header = TRUE, sep = "\t", stringsAsFactors = FALSE)

Calculate the mean and standard error of the mean (sem) for each of the 10 iterations.

r_data_plot <- r_data %>%
  group_by(ind, depth, num_cells) %>%
  summarize(mean = mean(r), sem = sd(r) / sqrt(length(r)))

p <- ggplot(r_data_plot, aes(x = num_cells, y = mean, color = as.factor(depth))) +
  geom_line() +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 1) +
  facet_wrap(~ind) +
  labs(x = "Number of subsampled cells",
       y = "Pearson's r (mean +/- sem)",
       color = "Depth",
       title = "Subsample: Correlation of single and bulk cells")
p

Focus on the lower end of the distribution of the number of subsampled cells.

p + xlim(0, 30)

Pollen et al. 2014

Pollen et al. 2014 observed that the correlation between the average expression in a group of single cells and the expression in a bulk sample of 100 cells quickly saturates at about 10 single cells (Figure 2ab). Each point is the mean of 10 subsamples. The high coverage samples were sequenced at a depth of ~8.9 × 10⁶ reads per cell, and the low coverage at ~2.7 × 10⁵ reads per cell. The data comes from 46 K562 cells.

Pollen et al. 2014 - Figure 2ab

Results by expression quantiles

In the analysis above, all genes were used. By default, the expression level in the bulk cells is used to remove the bottom 25% of genes from the analysis. This basically removes genes that are clearly unexpressed. Starting from this list of “expressed” genes, I split them into their expression quartiles. Thus we can see how the correlation is affected by the expression level.

r_q_data <- read.table("/mnt/gluster/data/internal_supp/singleCellSeq/subsampled/correlation-quantiles.txt",
                     header = TRUE, sep = "\t", stringsAsFactors = FALSE)

Calculate the mean and standard error of the mean (sem) for each of the 10 iterations.

r_q_data_plot <- r_q_data %>%
  group_by(ind, depth, num_cells, quantiles) %>%
  summarize(mean = mean(r), sem = sd(r) / sqrt(length(r)), num_genes = num_genes[1])

Some single cells did not have enough reads for the higher subsamples.

r_q_data_plot <- na.omit(r_q_data_plot)

The full results.

p <- ggplot(r_q_data_plot, aes(x = num_cells, y = mean, color = as.factor(depth))) +
  geom_line() +
  geom_errorbar(aes(ymin = mean - sem, ymax = mean + sem), width = 1) +
  facet_grid(quantiles~ind) +
  labs(x = "Number of subsampled cells",
       y = "Pearson's r (mean +/- sem)",
       color = "Depth",
       title = "Subsample: Correlation of single and bulk cells")
p

Focus on one cell line, two sequencing depths, and the lower end of the number cells.

p %+% r_q_data_plot[r_q_data_plot$ind == 19098 &
                    r_q_data_plot$depth %in% c(2e5, 4e6), ] +
  xlim(0, 75)

Session information

sessionInfo()

R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_1.0.1 dplyr_0.4.1   knitr_1.10.5 

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6      magrittr_1.5     MASS_7.3-40      munsell_0.4.2   
 [5] colorspace_1.2-6 stringr_1.0.0    plyr_1.8.2       tools_3.2.0     
 [9] parallel_3.2.0   grid_3.2.0       gtable_0.1.2     DBI_0.3.1       
[13] htmltools_0.2.6  yaml_2.1.13      lazyeval_0.1.10  assertthat_0.1  
[17] digest_0.6.8     reshape2_1.4.1   formatR_1.2      evaluate_0.7    
[21] rmarkdown_0.6.1  labeling_0.3     stringi_0.4-1    scales_0.2.4    
[25] proto_0.3-10