Process sequence data

Create genome for mapping
Run FastQC
Trim UMI
Quality trim 3’ end of reads
Map to genome
Process bam files
Combine single cell samples
Remove duplicate UMIs
Count reads per gene
Gather total counts
Gather summary counts
Gather gene counts

Last updated: 2017-04-11

Code version: b26dce3b41a0b3d6e330e880422006017b6a5f48

All processing scripts were run from the data directory.

cd /mnt/gluster/data/internal_supp/singleCellSeq

The raw fastq files were write-protected to avoid accidental deletion.

chmod uga=r fastq/*fastq.gz

Most steps are saved in shell scripts. This code should be able to run in any Unix-like environment (of course after installing the necessary software). To process the data efficiently on our HPC, we wrote a wrapper script, submit-array.sh, to combine the many parallel jobs into one array job submitted to the Sun Grid Engine (SGE) scheduler. Not only does this make it easier for the scheduler (and you) to manage thousands of tasks, we also made it so that it won’t re-run jobs if the output file already exists. This is convenient when a small fraction of jobs fail and need to be re-run. However, if your HPC uses a scheduler other than SGE, you will have to decide how best to submit the jobs.

Create genome for mapping

create-genome.sh downloads the fasta files for human genome hg19 (chromosomes 1-22, X, Y, M) and the ERRC spike-ins. It indexes the genome with subread-buildindex. The output is saved in the directory genome.

submit-array.sh create-genome.sh 8g genome

Run FastQC

run-fastqc.sh runs FastQC on each raw fastq file. It also counts the number of reads. The output is saved in the directory fastqc.

submit-array.sh run-fastqc.sh 2g fastq/*fastq.gz

ls fastqc/*zip | wc -l
ls fastqc/*count.txt | wc -l
grep -w success ~/log/run-fastqc.sh/* | wc -l
grep -w failure ~/log/run-fastqc.sh/* | wc -l

Trim UMI

trim.sh removes the 5 bp UMI at the 5’ end of the read using the program umitools trim. We used umitools v2.1.1. Output fastq files are written to the directory trim. Reads without a valid UMI are written to the directory invalid.

submit-array.sh trim.sh 2g fastq/*fastq.gz

To confirm that the jobs ran successfully:

ls trim/*fastq.gz | wc -l
ls invalid/*fastq.gz | wc -l
grep -w success ~/log/trim.sh/* | wc -l
grep -w failure ~/log/trim.sh/* | wc -l

To re-run failed jobs, I re-ran the original command. If the output file already exists, the code is not run and “success” is not echo’d to the log file.

Quality trim 3’ end of reads

sickle.sh performs quality trimming of the 3’ end of the reads (flag -x). We used the default quality thresholds. We used sickle version 1.33. Output is written to the directory sickle.

submit-array.sh sickle.sh 2g trim/*fastq.gz

To confirm that the jobs ran successfully:

ls sickle/*fastq.gz | wc -l
grep -w success ~/log/sickle.sh/* | wc -l
grep -w failure ~/log/sickle.sh/* | wc -l

Map to genome

map-subjunc.sh maps the reads to the combined genome described above. We use Subjunc to ensure precise mapping of the 5’ end of the read. This is necessary because the combination of the 5’ start position and the UMI sequence are used to convert reads to molecules. We used Subread version 1.5.0-p1 with the default thresholds and reporting only uniquely mapping reads (flag -u). Output BAM files are written to the directory bam.

submit-array.sh map-subjunc.sh 12g sickle/*fastq.gz

ls bam/*bam | wc -l
grep -w success ~/log/map-subjunc.sh/* | wc -l
grep -w failure ~/log/map-subjunc.sh/* | wc -l

Process bam files

process-bam.sh uses samtools version 0.1.18-dev (r982:313) to sort and index each BAM file. The output is written to the directory bam-processed.

submit-array.sh process-bam.sh 8g bam/*bam

ls bam-processed/*bam | wc -l
grep -w success ~/log/process-bam.sh/* | wc -l
grep -w failure ~/log/process-bam.sh/* | wc -l

Check for the presence of intermediate files output during sorting.

ls bam-processed/*sorted*0*bam

Combine single cell samples

Molecule counts are obtained by counting the number of unique UMIs observed at each start position. In order for this to be accurate for a given single cell, we need to combine all the sequencing data for a given sample. We merge the sorted BAM files using samtools merge and save the output BAM files in the directory bam-combined.

# From head node
mkdir -p bam-combined
mkdir -p ~/log/combine
for IND in 19098 19101 19239
do
  for REP in {1..3}
  do
    for ROW in {A..H}
    do
      for COL in {1..12}
      do
        ID=`printf "%s.%s.%s%02d\n" $IND $REP $ROW $COL`
        echo $ID
        echo "samtools merge bam-combined/$ID.trim.sickle.sorted.combined.bam bam-processed/$ID*trim.sickle.sorted.bam" | qsub -l h_vmem=4g -N $ROW$COL.$IND.$REP.combine -cwd -o ~/log/combine -j y -V
      done
    done
  done
done

Confirm that it worked. We only expect 864 samples (3 individuals * 3 replicates * 96 wells). The bulk samples are not combined because they violate the assumptions of a UMI protocol. In other words, they contain too many unique molecules for the 1,024 UMIs to properly tag them all.

ls bam-combined/*bam | wc -l
# There should be no output from samtools merge
cat ~/log/combine/* | head

Then index each merged sample:

# from head node
mkdir -p ~/log/index
for IND in 19098 19101 19239
do
  for REP in {1..3}
  do
    for ROW in {A..H}
    do
      for COL in {1..12}
      do
        ID=`printf "%s.%s.%s%02d\n" $IND $REP $ROW $COL`
        echo $ID
        echo "samtools index bam-combined/$ID.trim.sickle.sorted.combined.bam" | qsub -l h_vmem=2g -N $ROW$COL.$IND.$REP.index -cwd -o ~/log/index -j y -V -l 'hostname=!bigmem02'
      done
    done
  done
done

Confirm that it worked.

ls bam-combined/*bai | wc -l
# There should be no output from samtools index
cat ~/log/index/* | head

Remove duplicate UMIs

rmdup-umi.sh counts the number of UMIs at each start position using UMI-tools (version e0ade5d). To account for errors introduced during PCR amplification and sequencing, UMIs are merged using the method “directional adjacency” (flags --method="directional-adjacency" and --edit-distance-threshold=1), which is described in this blog post. The output is written to the directory bam-rmdup-umi.

Two different sets of BAM files are processed in this way. The BAM files in bam-processed contain the mapped reads for a given sample for a given sequencing lane. We exclude the bulk samples because we do not want their molecule counts (grep -v bulk). These lane-level molecule counts will be used to assess the technical effects of sequencing on different flow cells and lanes (anticipated to be minimal). The molecule counts from the combined single cell samples in bam-combined are used in downstream analysis.

submit-array.sh rmdup-umi.sh 2g `ls bam-processed/*bam | grep -v bulk` bam-combined/*bam

Expect 3,456 BAM files (864 combined samples + 3 individuals * 3 reps * 96 wells * 3 lanes).

ls bam-rmdup-umi/*bam | wc -l
grep -w success ~/log/rmdup-umi.sh/* | wc -l
grep -w failure ~/log/rmdup-umi.sh/* | wc -l

Count reads per gene

count-reads-per-gene.sh uses featureCounts version 1.5.0-p1 to count the number or reads or molecules per gene. We perform strand-specific counting (flag -s 1) because the UMI protocol preserves sequence strand information. The per lane reads for both the bulk and single cell samples are in bam-processed. The per lane and combined molecules for the single cells only are contained in bam-rmdup-umi. The output is written to the directory counts.

submit-array.sh count-reads-per-gene.sh 2g bam-processed/*bam bam-rmdup-umi/*bam

Expect 6120 files:

bulk reads per lane: 3 individuals * 3 reps * 2 indexes * 4 lanes = 72
single cell reads per lane: 3 individuals * 3 reps * 96 wells * 3 lanes = 2,592
single cell molecules per lane: 3 individuals * 3 reps * 96 wells * 3 lanes = 2,592
single cell molecules per sample: 3 individuals * 3 reps * 96 wells = 864

ls counts/*genecounts.txt | wc -l
grep -w success ~/log/count-reads-per-gene.sh/* | wc -l
grep -w failure ~/log/count-reads-per-gene.sh/* | wc -l

Gather total counts

gather-total-counts.py gathers the total number of reads at each stage of the processing pipeline. The output is saved in total-counts.txt.

gather-total-counts.py > total-counts.txt

Gather summary counts

gather-summary-counts.py gathers the classification of reads assigned by featureCounts. It only gathers from the single cell samples, because this information is used for quality control filtering. The output is saved in summary-counts.txt.

gather-summary-counts.py > summary-counts.txt

Gather gene counts

gather-gene-counts.py gathers the featureCounts results and creates count matrices. The output is saved in the directory counts-matrix.

mkdir -p counts-matrix
gather-gene-counts.py counts-matrix/ counts/*genecounts.txt

It creates six files:

reads-raw-bulk-per-lane.txt
- read counts from each lane of sequencing for the bulk samples
- 3 individuals * 3 reps * 2 indexes * 4 lanes = 72 entries
reads-raw-bulk-per-sample.txt
- read counts for each bulk sample
- 3 individuals * 3 reps = 9 entries
reads-raw-single-per-lane.txt
- read counts from each sequencing lane for the single cell samples
- 3 individuals * 3 reps * 96 wells * 3 lanes = 2,592 entries
reads-raw-single-per-sample.txt
- read counts for each single cell sample
- 3 individuals * 3 reps * 96 wells = 864 entries
molecules-raw-single-per-lane.txt
- molecule counts from each sequencing lane for the single cell samples (QC only)
- 3 individuals * 3 reps * 96 wells * 3 lanes = 2,592 entries
molecules-raw-single-per-sample.txt
- molecule counts for each single cell sample
- 3 individuals * 3 reps * 96 wells = 864 entries