Last updated: 2016-02-18

Code version: 4552abe3b660e2153f2b995df56c8f48b825a67d

We expect that any genes with at least one read will have at least one molecule. Conversely, any gene with zero reads should have zero molecules. There are a small number of instances where these relationships do not hold. This anaysis tracks down this bug in the sequence processsing pipeline.

Input

Input annotation.

anno <- read.table("../data/annotation.txt", header = TRUE,
                   stringsAsFactors = FALSE)

Input read counts.

reads <- read.table("../data/reads.txt", header = TRUE,
                    stringsAsFactors = FALSE)
stopifnot(ncol(reads) == nrow(anno),
          colnames(reads) == anno$sample_id)

Input molecule counts.

molecules <- read.table("../data/molecules.txt", header = TRUE,
                    stringsAsFactors = FALSE)
stopifnot(ncol(molecules) == nrow(anno),
          colnames(molecules) == anno$sample_id)

Identifying problem genes

How often is this a problem? How many genes does it affect? For affected genes, how many samples are affected?

discordant_zeros <- (reads == 0) != (molecules == 0)
all_genes <- rowSums(discordant_zeros)
names(all_genes) <- rownames(reads)
problem_genes <- all_genes[all_genes > 0]
length(problem_genes)
[1] 1347
length(problem_genes) / length(all_genes)
[1] 0.07951594
summary(problem_genes)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   3.334   2.000 206.000 

This problem affects 1347 out of the 16940 total genes (7.95%). For these problem genes, the median number of affected samples is 1. However, there is a long tail, with the max number of sample affected being 206.

Next I identify a problem gene-sample combination so that I can track down what happened.

x <- names(problem_genes)[1]
plot(as.numeric(reads[x, ]), as.numeric(molecules[x, ]))

problem_sample <- colnames(reads)[discordant_zeros[x, ] == TRUE]
reads[x, problem_sample]
[1] 22
molecules[x, problem_sample]
[1] 0

For gene ENSG00000187583, the sample NA19098.r1.G11 was assigned 22 reads but 0 molecules. What happened?

Searching for where the reads are lost during the processing pipeline

Now I am going to search for the bug by inspecting the intermediate data files from the sequence processsing pipeline.

The following chunks are all Bash commands run from the data directory.

opts_chunk$set(engine = "bash")
opts_knit$set(root.dir = "/mnt/gluster/home/jdblischak/ssd")

Strange things are happening at the featureCounts step:

# reads per lane
grep ENSG00000187583 counts/19098.1.G11*trim.sickle.sorted.genecounts.txt | cut -f1,7
counts/19098.1.G11.GGCAGACT.L002.R1.C6WYKACXX.trim.sickle.sorted.genecounts.txt:ENSG00000187583 6
counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.genecounts.txt:ENSG00000187583 4
counts/19098.1.G11.GGCAGACT.L006.R1.C723YACXX.trim.sickle.sorted.genecounts.txt:ENSG00000187583 12
# molecules per lane
grep ENSG00000187583 counts/19098.1.G11*trim.sickle.sorted.rmdup.genecounts.txt | cut -f1,7
counts/19098.1.G11.GGCAGACT.L002.R1.C6WYKACXX.trim.sickle.sorted.rmdup.genecounts.txt:ENSG00000187583   1
counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.genecounts.txt:ENSG00000187583   0
counts/19098.1.G11.GGCAGACT.L006.R1.C723YACXX.trim.sickle.sorted.rmdup.genecounts.txt:ENSG00000187583   1
# molecules per sample
grep ENSG00000187583 counts/19098.1.G11.trim.sickle.sorted.combined.rmdup.genecounts.txt | cut -f1,7
ENSG00000187583 0

So for two of the per lane reads files, they get reduced to one molecule each. This seems reasonable. But one gets reduced to zero! And the combined file also contains zero. What is happening to these reads? Searching the featureCounts assignments (-R flag).

First investigating the lane that was reduced from 6 reads to 1 molecule.

# reads per lane
grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L002.R1.C6WYKACXX.trim.sickle.sorted.bam.featureCounts
HWI-700819F:303:C6WYKACXX:2:1311:13724:70217:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:303:C6WYKACXX:2:2114:18388:14175:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:303:C6WYKACXX:2:2214:2454:36975:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:303:C6WYKACXX:2:2214:13077:41687:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:303:C6WYKACXX:2:2312:13148:100643:UMI_AGAAGGGG  Assigned    ENSG00000187583 *
HWI-700819F:303:C6WYKACXX:2:2314:4367:3532:UMI_AGAAGGGG Assigned    ENSG00000187583 *

As expected, there are 6 reads. Also they all clearly have the same UMI sequence, so it makes sense they are reduced to one molecule.

# molecules per lane
grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L002.R1.C6WYKACXX.trim.sickle.sorted.rmdup.bam.featureCounts
HWI-700819F:303:C6WYKACXX:2:2114:18388:14175:UMI_AGAAGGGG   Assigned    ENSG00000187583 *

Wonderful. Only one of the reads from that lane is assigned as a molecule.

Next the lane that was reduced from 12 reads to 1 molecule.

# reads per lane
grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L006.R1.C723YACXX.trim.sickle.sorted.bam.featureCounts
HWI-700819F:305:C723YACXX:6:1111:8035:48471:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:1113:6618:42326:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:1114:6879:12369:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:1208:12206:62466:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:1210:14240:43340:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:1311:4828:8532:UMI_AGAAGGGG Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:2110:20427:91561:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:2111:18585:18781:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:2112:12505:22635:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:2215:1185:41439:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:2303:9085:61288:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:305:C723YACXX:6:2311:11357:44914:UMI_AGAAGGGG   Assigned    ENSG00000187583 *

Similarly, all reads in this lane have the same UMI sequence. Furthermore, it is the same UMI as in the other lane, so these are likely are originating from the same original molecule.

# molecules per lane
grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L006.R1.C723YACXX.trim.sickle.sorted.rmdup.bam.featureCounts
HWI-700819F:305:C723YACXX:6:2215:1185:41439:UMI_AGAAGGGG    Assigned    ENSG00000187583 *

Now for the problem lane. There were 4 reads, but zero molecules.

# reads per lane
grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.bam.featureCounts
HWI-700819F:304:C6WURACXX:3:1110:14216:42424:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:304:C6WURACXX:3:1204:19248:93197:UMI_AGAAGGGG   Assigned    ENSG00000187583 *
HWI-700819F:304:C6WURACXX:3:1210:7316:85169:UMI_AGAAGGGG    Assigned    ENSG00000187583 *
HWI-700819F:304:C6WURACXX:3:1306:14259:83216:UMI_AGAAGGGG   Assigned    ENSG00000187583 *

Once again the same UMI sequence!

# molecules per lane
grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam.featureCounts
# I have to add the following to change the exit status. When grep does not find
# a match, its exit status is 1. Since this is an error exit status, knitr
# assumes something has failed and stops execution.
exit 0

But no molecules! What could have possibly gone wrong only for this lane!

So where were these reads lost? Here are the problem reads.

grep ENSG00000187583 counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.bam.featureCounts | cut -f1
HWI-700819F:304:C6WURACXX:3:1110:14216:42424:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1204:19248:93197:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1210:7316:85169:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1306:14259:83216:UMI_AGAAGGGG

Was one of these 4 sequences passed to featureCounts, but not assigned for some reason?

grep HWI-700819F:304:C6WURACXX:3:1110:14216:42424:UMI_AGAAGGGG counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam.featureCounts
grep HWI-700819F:304:C6WURACXX:3:1204:19248:93197:UMI_AGAAGGGG counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam.featureCounts
grep HWI-700819F:304:C6WURACXX:3:1210:7316:85169:UMI_AGAAGGGG counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam.featureCounts
grep HWI-700819F:304:C6WURACXX:3:1306:14259:83216:UMI_AGAAGGGG counts/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam.featureCounts
exit 0

No, that is not the reason. I expected to see 1 of the 4 reads, but featureCounts does not report having seen any of these reads.

Are any of these 4 reads in the molecules bam file, i.e. “rmdup”? These are the output files from the step to remove duplicate reads (i.e. convert reads to molecules).

samtools view bam-rmdup-umi/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam | grep HWI-700819F:304:C6WURACXX:3:1110:14216:42424:UMI_AGAAGGGG
samtools view bam-rmdup-umi/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam | grep HWI-700819F:304:C6WURACXX:3:1204:19248:93197:UMI_AGAAGGGG
samtools view bam-rmdup-umi/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam | grep HWI-700819F:304:C6WURACXX:3:1210:7316:85169:UMI_AGAAGGGG
samtools view bam-rmdup-umi/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.rmdup.bam | grep HWI-700819F:304:C6WURACXX:3:1306:14259:83216:UMI_AGAAGGGG
exit 0

So it is clear that featureCounts is not the problem. The input molecule bam file did not contain them.

Next possibility is that it was lost in the reads to molecule conversion step using UMI-tools dedup_umi.py. The input files for this step are in bam-processed.

samtools view bam-processed/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.bam | grep HWI-700819F:304:C6WURACXX:3:1110:14216:42424:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1110:14216:42424:UMI_AGAAGGGG   0   chr1    911014  59  92M *   0   0   CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG    FBBFFIFFFFF<FFFBFFFIFIFIFIF'BFFFBFFIBFFFFBFFFFFFFBFBBBFBBFBBBBBBBBBBBBBBBBBB<BBBBBBBB<BB<<7<    HI:i:1  NH:i:1  NM:i:0
samtools view bam-processed/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.bam | grep HWI-700819F:304:C6WURACXX:3:1204:19248:93197:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1204:19248:93197:UMI_AGAAGGGG   0   chr1    911014  59  92M *   0   0   CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG    FFFFFIIIIIIIIIFIFFFIIFIIIIIIFFIIIIIIIIIIIBFFIFFFFFFFFBBBFBBFBFFFFB<BBBFBBBBFBFBBBFFFF<BBFFFF    HI:i:1  NH:i:1  NM:i:0
samtools view bam-processed/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.bam | grep HWI-700819F:304:C6WURACXX:3:1210:7316:85169:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1210:7316:85169:UMI_AGAAGGGG    0   chr1    911014  59  81M11S  *   0   0   CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCCTGTCTCTTAT    FFFFFI<BFFFFFFIIFIIFIIIIFIIIIIIIIIFIIIIIIIIFIFFFFBBFFFFBFFFFFFFFFFBBFBFFFFFFFFFFFFFFFFBFFFBB    HI:i:1  NH:i:1  NM:i:0
samtools view bam-processed/19098.1.G11.GGCAGACT.L003.R1.C6WURACXX.trim.sickle.sorted.bam | grep HWI-700819F:304:C6WURACXX:3:1306:14259:83216:UMI_AGAAGGGG
HWI-700819F:304:C6WURACXX:3:1306:14259:83216:UMI_AGAAGGGG   0   chr1    911014  58  92M *   0   0   CAGCCCACCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG    FFFFFIF0BFFFFFIIIIIIIIIIFIBFIIIIIIIIIIIIIFFFFFFFFFFFFBB<BBBFFFFFFF<<BBBBBFFFFBFBFFFFF<BB777<    HI:i:1  NH:i:1  NM:i:1

They are erased during the remove duplicate step!

Now I want to confirm that all these reads from the 3 lanes are in the combined file. There should be 22 (6 + 4 + 12).

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 -c
40

That is more than I expected. How many have the exact same UMI sequence?

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | grep UMI_AGAAGGGG | wc -l
40

Strange.

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | grep UMI_AGAAGGGG | cut -f1 | sort | uniq | cut -d":" -f3 | uniq -c
      9 C6WYKACXX
     13 C6WURACXX
     18 C723YACXX

So there are actually more than 22 reads that have the UMI AGAAGGGG and start at chr1:911014, however, only 22 of the 40 get assigned to gene ENSG00000187583.

Exploring locus on UCSC genome browser

Could this be a length issue? Since some reads are longer than others due to the quality trimming, perhaps the longer reads intersect an exon from another gene. There is variation in read length.

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | cut -f6 | sort | uniq -c
      1 36M56S
      1 40M35S
      1 40M52S
      3 42M50S
      1 43M
      1 47M45S
      2 51M41S
      1 57M35S
      1 58M16S
      2 62M30S
      1 64M28S
      1 75M17S
      1 75M4S
      1 76M16S
      2 81M11S
      1 84M
     19 92M

How many genes are in this locus?

I searched one of the 92M sequences with BLAT, and it does overlap two different Ensembl genes (red). The bottom three Ensembl transcripts (ex) are for the protein_coding gene C1orf170, or ENSG00000187642, and is on the - strand. The transcript on the top is the protein_coding gene PLEKHN1, or ENSG00000187583, and it is on the + strand.

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | grep 92M | cut -f10 | head -n 1 
CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG

chr1:911014-92M

Next I searched for the smallest fragment, which is 43 bp long. 43M means 43 Match (post).

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | grep 43M | cut -f10 | head -n 1 
CAGCCCAGCCCACCTGTGAGGAGGTTGCCCATCCCCTCTTTGA

chr1:911014-43M

It is in the intron of ENSG00000187642 (C1orf170), but it overlaps ENSG00000187583 (PLEKHN1). Maybe it is a strand issue.

These reads are all on the + strand (directions).

# + strand = -F 1x10
samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 -F 0x10 | wc -l
40
# - strand = -f 0x10
samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 -f 0x10 | wc -l
0

Because the reads map to the + strand, they can’t be assigned to ENSG00000187642 (C1orf170). Similarly RefSeq has an extra gene annotation PERM1, but it is also on the - strand.

What is going on with these reads? Maybe a soft clipped read will be informative. Using the 36M56S sequence.

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | grep 36M56S | cut -f10 | head -n 1 
CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCCGTCTTCTGCTTGAAA

chr1:911014-36M56S

Again, this overlaps nothing. It is also strange that the last 56 bp do not map to this locus at all!

Looking at all the sequences that with this UMI and start position.

samtools view bam-combined/19098.1.G11.trim.sickle.sorted.combined.bam chr1:911014-911014 | cut -f10 | sort | uniq -c
      1 CAGCCCACCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCAGCCCCTCTTTGCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCCGTCTTCTGC
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCCGTCTTCTGCTTGAAA
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCCGTCTTCTGCTTGA
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTCTGTCTCTTATACACATCTGACGCGGCAGACTTCG
      2 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCC
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCCTGTCTCTTATACACA
      2 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCCTGTCTCTTAT
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCC
     18 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCCTGTCTCTTATACACA
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGTCTCTT
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGTCTCTTATACACATCTGAC
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCTGTCTCTTATACACATCTGACGCGGCAG
      2 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTGTCTCTTATACACATCTGACGCGGCAGACTTCG
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCTGTCTCTTATACACATCTGACGCGGCAGACTTCGT
      1 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCCGTCTT
      2 CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGCTGTCTCTTATACACATCTGACGCGGCAGACTTCGTATGCCGTCTTCTGC
      1 CAGCCCAGCCCACCTGTGAGGAGGTTGCCCATCCCCTCTTTGA

What a mess! So here is my current interpretation, which is all educated guesses. This molecule arises from some unannotated transcript. This sort of thing happens since most of the genome is transcribed at some low level. The varying in length is due to the fragmentation during the sonication step. The soft clipped reads are due to sequencing of the adapter. Overall there should just be one molecule here, and it shouldn’t map to any gene. featureCounts assigns the longer fragments to ENSG00000187583 (PLEKHN1) because their 3’ end overlaps it. But this makes no sense since the read arose from nowhere. I can fix this in featureCounts using the following option to only map the 5’ end of the sequence.

# Have to use 2>&1 redirection trick because featureCounts only sends its help
# to stderr
featureCounts 2>&1 | grep read2pos -A 2
  --read2pos <5:3>    Reduce reads to their 5' most base or 3' most base. Read
                      counting is then performed based on the single base the 
                      read is reduced to.

Checking if sequence length is causing the problem

So my current guess is that of those 40 sequences, the one that is kept is one of the shorter ones. Since these do not map to anything, the molecules report zero while the reads report sequences.

But no! There is a 92M sequence in the bam-rmdup-umi file!!!!

samtools view bam-rmdup-umi/19098.1.G11.trim.sickle.sorted.combined.rmdup.bam | grep UMI_AGAAGGGG | grep 911014
HWI-700819F:303:C6WYKACXX:2:1311:13724:70217:UMI_AGAAGGGG   0   chr1    911014  59  92M *   0   0   CAGCCCAGCCCACCTGTGAGGAGGCTGCCCATCCCCTCTTTGAGGCCACCCTGTGTCCTCTCCCCTGCCTCTGCCCAGAGCTCCAGCCGGAG    FFFFFIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIFFFFFFFFFFFFFBFBFFFFFBBBFFFFFFFFFFFFFFFFFFBFBBB    HI:i:1  NH:i:1  NM:i:0

Why wasn’t it assigned to ENSG00000187583 (PLEKHN1) as expected?

grep HWI-700819F:303:C6WYKACXX:2:1311:13724:70217:UMI_AGAAGGGG counts/19098.1.G11.trim.sickle.sorted.combined.rmdup.bam.featureCounts
exit 0

For some reason, featureCounts has no record of having processed this read. And this isn’t the only read that is not accounted for.

samtools view -c bam-rmdup-umi/19098.1.G11.trim.sickle.sorted.combined.rmdup.bam
140494
wc -l counts/19098.1.G11.trim.sickle.sorted.combined.rmdup.bam.featureCounts
138771 counts/19098.1.G11.trim.sickle.sorted.combined.rmdup.bam.featureCounts

This is super confusing. For the problem lane, the 4 reads were lost in the remove duplicate step. But for the combined file, the problem is at the featureCounts step.

I really don’t know what is happening here, and I want this molecule removed from the analysis anyways. I’m going to re-run the gene counting with featureCounts uisng the flag --read2pos 5 to remove this problem gene and any others, and then assess the extent of the problem.

Session information

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] knitr_1.10.5

loaded via a namespace (and not attached):
 [1] httr_0.6.1       magrittr_1.5     formatR_1.2      htmltools_0.2.6 
 [5] tools_3.2.0      RCurl_1.95-4.6   yaml_2.1.13      codetools_0.2-11
 [9] rmarkdown_0.6.1  stringi_0.4-1    digest_0.6.8     stringr_1.0.0   
[13] bitops_1.0-6     evaluate_0.7