Last updated: 2017-04-11
Code version: b26dce3b41a0b3d6e330e880422006017b6a5f48
All processing scripts were run from the data directory.
cd /mnt/gluster/data/internal_supp/singleCellSeq
The raw fastq files were write-protected to avoid accidental deletion.
chmod uga=r fastq/*fastq.gz
Most steps are saved in shell scripts. This code should be able to run in any Unix-like environment (of course after installing the necessary software). To process the data efficiently on our HPC, we wrote a wrapper script, submit-array.sh, to combine the many parallel jobs into one array job submitted to the Sun Grid Engine (SGE) scheduler. Not only does this make it easier for the scheduler (and you) to manage thousands of tasks, we also made it so that it won’t re-run jobs if the output file already exists. This is convenient when a small fraction of jobs fail and need to be re-run. However, if your HPC uses a scheduler other than SGE, you will have to decide how best to submit the jobs.
create-genome.sh downloads the fasta files for human genome hg19 (chromosomes 1-22, X, Y, M) and the ERRC spike-ins. It indexes the genome with subread-buildindex
. The output is saved in the directory genome
.
submit-array.sh create-genome.sh 8g genome
run-fastqc.sh runs FastQC on each raw fastq file. It also counts the number of reads. The output is saved in the directory fastqc
.
submit-array.sh run-fastqc.sh 2g fastq/*fastq.gz
ls fastqc/*zip | wc -l
ls fastqc/*count.txt | wc -l
grep -w success ~/log/run-fastqc.sh/* | wc -l
grep -w failure ~/log/run-fastqc.sh/* | wc -l
trim.sh removes the 5 bp UMI at the 5’ end of the read using the program umitools trim
. We used umitools v2.1.1. Output fastq files are written to the directory trim
. Reads without a valid UMI are written to the directory invalid
.
submit-array.sh trim.sh 2g fastq/*fastq.gz
To confirm that the jobs ran successfully:
ls trim/*fastq.gz | wc -l
ls invalid/*fastq.gz | wc -l
grep -w success ~/log/trim.sh/* | wc -l
grep -w failure ~/log/trim.sh/* | wc -l
To re-run failed jobs, I re-ran the original command. If the output file already exists, the code is not run and “success” is not echo’d to the log file.
sickle.sh performs quality trimming of the 3’ end of the reads (flag -x
). We used the default quality thresholds. We used sickle version 1.33. Output is written to the directory sickle
.
submit-array.sh sickle.sh 2g trim/*fastq.gz
To confirm that the jobs ran successfully:
ls sickle/*fastq.gz | wc -l
grep -w success ~/log/sickle.sh/* | wc -l
grep -w failure ~/log/sickle.sh/* | wc -l
map-subjunc.sh maps the reads to the combined genome described above. We use Subjunc to ensure precise mapping of the 5’ end of the read. This is necessary because the combination of the 5’ start position and the UMI sequence are used to convert reads to molecules. We used Subread version 1.5.0-p1 with the default thresholds and reporting only uniquely mapping reads (flag -u
). Output BAM files are written to the directory bam
.
submit-array.sh map-subjunc.sh 12g sickle/*fastq.gz
ls bam/*bam | wc -l
grep -w success ~/log/map-subjunc.sh/* | wc -l
grep -w failure ~/log/map-subjunc.sh/* | wc -l
process-bam.sh uses samtools version 0.1.18-dev (r982:313) to sort and index each BAM file. The output is written to the directory bam-processed
.
submit-array.sh process-bam.sh 8g bam/*bam
ls bam-processed/*bam | wc -l
grep -w success ~/log/process-bam.sh/* | wc -l
grep -w failure ~/log/process-bam.sh/* | wc -l
Check for the presence of intermediate files output during sorting.
ls bam-processed/*sorted*0*bam
Molecule counts are obtained by counting the number of unique UMIs observed at each start position. In order for this to be accurate for a given single cell, we need to combine all the sequencing data for a given sample. We merge the sorted BAM files using samtools merge
and save the output BAM files in the directory bam-combined
.
# From head node
mkdir -p bam-combined
mkdir -p ~/log/combine
for IND in 19098 19101 19239
do
for REP in {1..3}
do
for ROW in {A..H}
do
for COL in {1..12}
do
ID=`printf "%s.%s.%s%02d\n" $IND $REP $ROW $COL`
echo $ID
echo "samtools merge bam-combined/$ID.trim.sickle.sorted.combined.bam bam-processed/$ID*trim.sickle.sorted.bam" | qsub -l h_vmem=4g -N $ROW$COL.$IND.$REP.combine -cwd -o ~/log/combine -j y -V
done
done
done
done
Confirm that it worked. We only expect 864 samples (3 individuals * 3 replicates * 96 wells). The bulk samples are not combined because they violate the assumptions of a UMI protocol. In other words, they contain too many unique molecules for the 1,024 UMIs to properly tag them all.
ls bam-combined/*bam | wc -l
# There should be no output from samtools merge
cat ~/log/combine/* | head
Then index each merged sample:
# from head node
mkdir -p ~/log/index
for IND in 19098 19101 19239
do
for REP in {1..3}
do
for ROW in {A..H}
do
for COL in {1..12}
do
ID=`printf "%s.%s.%s%02d\n" $IND $REP $ROW $COL`
echo $ID
echo "samtools index bam-combined/$ID.trim.sickle.sorted.combined.bam" | qsub -l h_vmem=2g -N $ROW$COL.$IND.$REP.index -cwd -o ~/log/index -j y -V -l 'hostname=!bigmem02'
done
done
done
done
Confirm that it worked.
ls bam-combined/*bai | wc -l
# There should be no output from samtools index
cat ~/log/index/* | head
rmdup-umi.sh counts the number of UMIs at each start position using UMI-tools (version e0ade5d). To account for errors introduced during PCR amplification and sequencing, UMIs are merged using the method “directional adjacency” (flags --method="directional-adjacency"
and --edit-distance-threshold=1
), which is described in this blog post. The output is written to the directory bam-rmdup-umi
.
Two different sets of BAM files are processed in this way. The BAM files in bam-processed
contain the mapped reads for a given sample for a given sequencing lane. We exclude the bulk samples because we do not want their molecule counts (grep -v bulk
). These lane-level molecule counts will be used to assess the technical effects of sequencing on different flow cells and lanes (anticipated to be minimal). The molecule counts from the combined single cell samples in bam-combined
are used in downstream analysis.
submit-array.sh rmdup-umi.sh 2g `ls bam-processed/*bam | grep -v bulk` bam-combined/*bam
Expect 3,456 BAM files (864 combined samples + 3 individuals * 3 reps * 96 wells * 3 lanes).
ls bam-rmdup-umi/*bam | wc -l
grep -w success ~/log/rmdup-umi.sh/* | wc -l
grep -w failure ~/log/rmdup-umi.sh/* | wc -l
count-reads-per-gene.sh uses featureCounts version 1.5.0-p1 to count the number or reads or molecules per gene. We perform strand-specific counting (flag -s 1
) because the UMI protocol preserves sequence strand information. The per lane reads for both the bulk and single cell samples are in bam-processed
. The per lane and combined molecules for the single cells only are contained in bam-rmdup-umi
. The output is written to the directory counts
.
submit-array.sh count-reads-per-gene.sh 2g bam-processed/*bam bam-rmdup-umi/*bam
Expect 6120 files:
ls counts/*genecounts.txt | wc -l
grep -w success ~/log/count-reads-per-gene.sh/* | wc -l
grep -w failure ~/log/count-reads-per-gene.sh/* | wc -l
gather-total-counts.py gathers the total number of reads at each stage of the processing pipeline. The output is saved in total-counts.txt.
gather-total-counts.py > total-counts.txt
gather-summary-counts.py gathers the classification of reads assigned by featureCounts. It only gathers from the single cell samples, because this information is used for quality control filtering. The output is saved in summary-counts.txt.
gather-summary-counts.py > summary-counts.txt
gather-gene-counts.py gathers the featureCounts results and creates count matrices. The output is saved in the directory counts-matrix
.
mkdir -p counts-matrix
gather-gene-counts.py counts-matrix/ counts/*genecounts.txt
It creates six files: