Process sequence data

Setting up
Transfer fastq files
Trim UMI
Quality trim 3’ end of reads
Map to genome
Process bam files
Combine bam files per sample
Remove duplicate UMIs
Count reads per gene
Gather gene counts

Last updated: 2016-06-29

Code version: c59e3a495e7257caa3180d1dfdaf315cdd79d715

PoYuan performed the original troubleshooting of the UMI protocol with LCLs from individual NA19239. One flow cell worked well and contains data that we can use. Lanes 1-4 each contain 24 single cells from a 96-well C1 chip. Lanes 5-8 each contain one single cell from a different C1 chip. Thus they have been extremely over sequenced. We can use these to address the number of sequenced reads required to completely exhaust the observation of any new molecules. In order to make these comparisons, we need to process them through the same pipeline as the iPSC data.

Setting up

The plan is to keep all the LCL data in a subdirectory of the main data directory. All the commands below are run from this new directory.

cd /mnt/gluster/home/jdblischak/ssd/
mkdir lcl
cd lcl

In order to keep the scripts simple, the paths to the genome file and the exons file are hard-coded as relative paths. Thus I created a symlink in the subdirectory that points to the directory genome which contains these files.

ln -s ../genome/ genome

Transfer fastq files

The fastq files are found here:

/rawdata/Illumina_Runs/150116_SN_0795_0416_AC5V7FACXX/Demultiplexed/Unaligned/Project_N/

Conveniently, the new version of Casava sorts the fastq files by sample so there is no need to consult the sample sheet. The subdirectories for the 4 full lane single cells are:

Sample_19239_LCL_A9E1
Sample_19239_LCL_B2E2
Sample_19239_LCL_B4H1
Sample_19239_LCL_D2H2

The new version of Casava also splits the data into separate files such that each file contains at most 4 million reads. This has its pros and cons. The con is that we will have to later manually combine these samples for the purpose of quantifying molecules with UMIs. The pro is that it will be easier to parallelize the processing of many small chunks.

zcat /rawdata/Illumina_Runs/150116_SN_0795_0416_AC5V7FACXX/Demultiplexed/Unaligned/Project_N/Sample_19239_LCL_A9E1/19239_LCL_A9E1_ATTAGACG_L005_R1_001.fastq.gz | grep "@D7L" | wc -l
4000000

Creating symlinks in the new fastq directory.

mkdir fastq
for LANE in A9E1 B2E2 B4H1 D2H2
do
  find /rawdata/Illumina_Runs/150116_SN_0795_0416_AC5V7FACXX/Demultiplexed/Unaligned/Project_N/Sample_19239_LCL_${LANE} -name "*fastq.gz" -exec ln -s {} fastq/ \;
done

There are a total of 148 fastq files.

ls fastq | wc -l
148

All processing scripts continue to be run from the directory lcl.

Trim UMI

submit-array.sh trim.sh 2g fastq/*fastq.gz

To confirm that the jobs ran successfully:

ls trim/*fastq.gz | wc -l
grep -w success ~/log/trim.sh/* | wc -l
grep -w failure ~/log/trim.sh/* | wc -l

To re-run failed jobs, I re-ran the original command. If the output file already exists, the code is not run and “success” is not echo’d to the log file.

Quality trim 3’ end of reads

submit-array.sh sickle.sh 2g trim/*fastq.gz

To confirm that the jobs ran successfully:

ls sickle/*fastq.gz | wc -l
grep -w success ~/log/sickle.sh/* | wc -l
grep -w failure ~/log/sickle.sh/* | wc -l

Map to genome

submit-array.sh map-subjunc.sh 12g sickle/*fastq.gz

ls bam/*bam | wc -l
grep -w success ~/log/map-subjunc.sh/* | wc -l
grep -w failure ~/log/map-subjunc.sh/* | wc -l

Process bam files

Sort bam
Index bam

submit-array.sh process-bam.sh 8g bam/*bam

ls bam-processed/*bam | wc -l
grep -w success ~/log/process-bam.sh/* | wc -l
grep -w failure ~/log/process-bam.sh/* | wc -l

Check for the presence of intermediate files output during sorting.

ls bam-processed/*sorted*0*bam

Combine bam files per sample

Merge and index each single cell. Also update the names so that they match the iPSC naming scheme so that they can be processed similarly in later stages.

# From head node
mkdir -p bam-combined
mkdir -p ~/log/combine.sh
for WELL in A9E1 B2E2 B4H1 D2H2
do
  TARGET_FILE=bam-combined/19239.1.$WELL.trim.sickle.sorted.combined.bam
  echo $TARGET_FILE
  echo "samtools merge $TARGET_FILE bam-processed/19239_LCL_$WELL*trim.sickle.sorted.bam; samtools index $TARGET_FILE" | qsub -l h_vmem=32g -N $WELL.lcl.combine -cwd -o ~/log/combine.sh -j y -V -l 'hostname=!(bigmem01|bigmem02)'
done

ls bam-combined/*bam | wc -l
cat ~/log/combine.sh/*

Remove duplicate UMIs

submit-array.sh rmdup-umi.sh 16g bam-combined/*bam

ls bam-rmdup-umi/*bam | wc -l
grep -w success ~/log/rmdup-umi.sh/* | wc -l
grep -w failure ~/log/rmdup-umi.sh/* | wc -l

Count reads per gene

submit-array.sh count-reads-per-gene.sh 8g bam-combined/*bam bam-rmdup-umi/*bam

ls counts/*genecounts.txt | wc -l
grep -w success ~/log/count-reads-per-gene.sh/* | wc -l
grep -w failure ~/log/count-reads-per-gene.sh/* | wc -l

Remove the *.featureCounts files created by the -R flag. These report the assignment of each read, which is only useful for detailed diagnostics. Because each file is data from a whole lane, these files are large.

rm counts/*.featureCounts

Gather gene counts

The counts for each gene for each sequencing lane. Have to use gather-gene-counts-subsample.py because gather-gene-counts.py has been specialized for the output from the full pipeline for the iPSCs.

mkdir counts-matrix
gather-gene-counts-subsample.py counts-matrix/ counts/*genecounts.txt

Process sequence data - LCLs

John Blischak

2015-04-11