Last updated: 2016-06-29

Code version: c59e3a495e7257caa3180d1dfdaf315cdd79d715

PoYuan performed the original troubleshooting of the UMI protocol with LCLs from individual NA19239. One flow cell worked well and contains data that we can use. Lanes 1-4 each contain 24 single cells from a 96-well C1 chip. Lanes 5-8 each contain one single cell from a different C1 chip. Thus they have been extremely over sequenced. We can use these to address the number of sequenced reads required to completely exhaust the observation of any new molecules. In order to make these comparisons, we need to process them through the same pipeline as the iPSC data.

Setting up

The plan is to keep all the LCL data in a subdirectory of the main data directory. All the commands below are run from this new directory.

cd /mnt/gluster/home/jdblischak/ssd/
mkdir lcl
cd lcl

In order to keep the scripts simple, the paths to the genome file and the exons file are hard-coded as relative paths. Thus I created a symlink in the subdirectory that points to the directory genome which contains these files.

ln -s ../genome/ genome

Transfer fastq files

The fastq files are found here:


Conveniently, the new version of Casava sorts the fastq files by sample so there is no need to consult the sample sheet. The subdirectories for the 4 full lane single cells are:

The new version of Casava also splits the data into separate files such that each file contains at most 4 million reads. This has its pros and cons. The con is that we will have to later manually combine these samples for the purpose of quantifying molecules with UMIs. The pro is that it will be easier to parallelize the processing of many small chunks.

zcat /rawdata/Illumina_Runs/150116_SN_0795_0416_AC5V7FACXX/Demultiplexed/Unaligned/Project_N/Sample_19239_LCL_A9E1/19239_LCL_A9E1_ATTAGACG_L005_R1_001.fastq.gz | grep "@D7L" | wc -l

Creating symlinks in the new fastq directory.

mkdir fastq
for LANE in A9E1 B2E2 B4H1 D2H2
  find /rawdata/Illumina_Runs/150116_SN_0795_0416_AC5V7FACXX/Demultiplexed/Unaligned/Project_N/Sample_19239_LCL_${LANE} -name "*fastq.gz" -exec ln -s {} fastq/ \;

There are a total of 148 fastq files.

ls fastq | wc -l

All processing scripts continue to be run from the directory lcl.

Trim UMI 2g fastq/*fastq.gz

To confirm that the jobs ran successfully:

ls trim/*fastq.gz | wc -l
grep -w success ~/log/* | wc -l
grep -w failure ~/log/* | wc -l

To re-run failed jobs, I re-ran the original command. If the output file already exists, the code is not run and “success” is not echo’d to the log file.

Quality trim 3’ end of reads 2g trim/*fastq.gz

To confirm that the jobs ran successfully:

ls sickle/*fastq.gz | wc -l
grep -w success ~/log/* | wc -l
grep -w failure ~/log/* | wc -l

Map to genome 12g sickle/*fastq.gz
ls bam/*bam | wc -l
grep -w success ~/log/* | wc -l
grep -w failure ~/log/* | wc -l

Process bam files 8g bam/*bam
ls bam-processed/*bam | wc -l
grep -w success ~/log/* | wc -l
grep -w failure ~/log/* | wc -l

Check for the presence of intermediate files output during sorting.

ls bam-processed/*sorted*0*bam

Combine bam files per sample

Merge and index each single cell. Also update the names so that they match the iPSC naming scheme so that they can be processed similarly in later stages.

# From head node
mkdir -p bam-combined
mkdir -p ~/log/
for WELL in A9E1 B2E2 B4H1 D2H2
  echo "samtools merge $TARGET_FILE bam-processed/19239_LCL_$WELL*trim.sickle.sorted.bam; samtools index $TARGET_FILE" | qsub -l h_vmem=32g -N $WELL.lcl.combine -cwd -o ~/log/ -j y -V -l 'hostname=!(bigmem01|bigmem02)'
ls bam-combined/*bam | wc -l
cat ~/log/*

Remove duplicate UMIs 16g bam-combined/*bam
ls bam-rmdup-umi/*bam | wc -l
grep -w success ~/log/* | wc -l
grep -w failure ~/log/* | wc -l

Count reads per gene 8g bam-combined/*bam bam-rmdup-umi/*bam
ls counts/*genecounts.txt | wc -l
grep -w success ~/log/* | wc -l
grep -w failure ~/log/* | wc -l

Remove the *.featureCounts files created by the -R flag. These report the assignment of each read, which is only useful for detailed diagnostics. Because each file is data from a whole lane, these files are large.

rm counts/*.featureCounts

Gather gene counts

The counts for each gene for each sequencing lane. Have to use because has been specialized for the output from the full pipeline for the iPSCs.

mkdir counts-matrix counts-matrix/ counts/*genecounts.txt