Last updated: 2016-03-15
Code version: 1db9308a307e8e4185c5d233188ba21e5996e560
ngsplot calculates coverage of genomic regions. It provides the option to calcalate coverage over the exons for each gene, which is ideal for RNA-seq data. Furthermore, it has the option to filter the reads based on which strand they map to. See here for an explanation of all the program arguments.
library("biomaRt")
I initially could not run ngsplot because of the ERCC chromosomes in my BAM files. ngsplot requires all the chromosomes names to either start with “chr” or not start with “chr”. They do this to accomodate reads mapped to UCSC or Ensembl named chromosomes, but I found this causes problems with mitochondrial reads. Because the ngsplot annotation uses the UCSC chromosome names like in my BAM files, I can overide this check, which I did by having the function chrTag
always return TRUE
(see lib/coverage.r).
The above worked when focusing on endogenous genes. For the ERCC, I provided my own custom BED file with the ERCC regions. Custom BED files get passed through the function chromFormat, which adds “chr” to the beginning of each chromosome name. One option would have been to have chrTag
always return FALSE
, but then the code would have to be edited by hand before running either the endogenous or ERCC gene analysis. Thus I simply disabled chromFormat so that it does not add “chr” to the ERCC chromosome names.
For the ngsplot analyses, I use the BAM files of single cell molecules created in the coverage analysis where I used the genomation library.
Instead of using all genes, you can provide ngsplot with a list of Ensembl genes to subset (flag -E
).
First I create a list of all genes that passed the expression filter, separate files for the endogenous and ERCC genes.
Input filtered molecule counts.
molecules_filter <- read.table("../data/molecules-filter.txt", header = TRUE,
stringsAsFactors = FALSE)
# Endogenous
molecules_filter_endo <- molecules_filter[grep("ENSG", rownames(molecules_filter)), ]
cat(rownames(molecules_filter_endo),
file = "../data/endogenous-filter.txt", sep = "\n")
Next I create four gene lists corresponding to the quartiles of their mean expression level.
Endogenous genes
mean_expr <- rowMeans(molecules_filter_endo)
expr_q <- quantile(mean_expr)
genes_by_quartile <- cut(mean_expr, breaks = expr_q, include.lowest = TRUE)
# Purposely converting the factor to the underlying numeric representation
genes_by_quartile <- as.numeric(genes_by_quartile)
names(genes_by_quartile) <- names(mean_expr)
for (i in 1:4) {
genes <- names(genes_by_quartile)[genes_by_quartile == i]
q_fname <- paste0("../data/endogenous-filter-q", i, ".txt")
cat(genes, file = q_fname, sep = "\n")
}
Next I create four gene lists corresponding to the quartiles of their maximum transcript length.
ensembl <- useMart(host = "grch37.ensembl.org",
biomart = "ENSEMBL_MART_ENSEMBL",
dataset = "hsapiens_gene_ensembl")
len_transcripts <- getBM(attributes = c("ensembl_gene_id", "transcript_length"),
filters = "ensembl_gene_id",
values = rownames(molecules_filter_endo),
mart = ensembl)
len_genes <- tapply(len_transcripts$transcript_length,
len_transcripts$ensembl_gene_id, max)
stopifnot(rownames(molecules_filter_endo) %in% names(len_genes),
nrow(molecules_filter_endo) == length(len_genes))
len_q <- quantile(len_genes)
genes_by_len <- cut(len_genes, breaks = len_q, include.lowest = TRUE)
# Purposely converting the factor to the underlying numeric representation
genes_by_len <- as.numeric(genes_by_len)
names(genes_by_len) <- names(len_genes)
for (i in 1:4) {
genes <- names(genes_by_len)[genes_by_len == i]
q_fname <- paste0("../data/endogenous-filter-len-q", i, ".txt")
cat(genes, file = q_fname, sep = "\n")
}
To run multiple BAM files or gene lists at the same time, you can create a configuration file to pass to ngsplot instead of a single BAM file (flag -C
).
The first configuration file contains the BAM files for one example single cell per individual. The gene set is all filtered genes.
cat ../data/ngsplot-molecules.txt
# base command: ngs.plot.r -G hg19 -C ../data/ngsplot-molecules.txt -L 1000 -FL 100
# For different genomic features: -R tss/genebody/tes
# For different strands: -SS both/same/opposite
# For rnaseq mode for genebody: -F rnaseq
# For more info: https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101
../data/19098.1.A01.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter.txt "NA19098.r1.A01"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter.txt "NA19101.r1.A02"
../data/19239.1.A01.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter.txt "NA19239.r1.A01"
The second runs the BAM file for 19101 for different subsets of expression level.
cat ../data/ngsplot-expression.txt
# Run:
# ngs.plot.r -G hg19 -C ../data/ngsplot-expression.txt -L 1000 -FL 100 \
# -R genebody -F rnaseq
# For different strands: -SS both/same/opposite
# For more info: https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-q1.txt "q1"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-q2.txt "q2"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-q3.txt "q3"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-q4.txt "q4"
The third runs the BAM file for 19101 for different subsets of gene length.
cat ../data/ngsplot-length.txt
# Run:
# ngs.plot.r -G hg19 -C ../data/ngsplot-length.txt -L 1000 -FL 100 \
# -R genebody -F rnaseq
# For different strands: -SS both/same/opposite
# For more info: https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-len-q1.txt "q1"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-len-q2.txt "q2"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-len-q3.txt "q3"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/endogenous-filter-len-q4.txt "q4"
I wrote a Makefile, make-ngsplot, that runs the three configuration files for different features and strand specificity.
For the ERCC genes, I do the same as above. The main difference is that I need to create a BED file because the ERCC genes are not in the database.
Import the SAF file of exons used with featureCounts. Filter to only include ERCC genes and convert to BED format.
exons <- read.delim("../data/exons.saf", stringsAsFactors = FALSE)
molecules_filter_ercc <- molecules_filter[grep("ERCC", rownames(molecules_filter)), ]
exons_ercc <- exons[exons$GeneID %in% rownames(molecules_filter_ercc), ]
ercc_bed <- data.frame(chr = exons_ercc$Chr,
start = exons_ercc$Start - 1,
end = exons_ercc$End,
name = exons_ercc$GeneID,
score = 0,
strand = exons_ercc$Strand,
stringsAsFactors = FALSE)
Filter the BED file to only include ERCC that passed the expression filter.
ercc_bed_filter <- ercc_bed[ercc_bed$name %in% rownames(molecules_filter_ercc), ]
Save a file with all ERCC that passed the filter.
write.table(ercc_bed_filter, file = "../data/ercc-filter.bed",
quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
Split by expression level.
stopifnot(ercc_bed_filter$name == rownames(molecules_filter_ercc))
mean_expr <- rowMeans(molecules_filter_ercc)
expr_q <- quantile(mean_expr)
genes_by_quartile <- cut(mean_expr, breaks = expr_q, include.lowest = TRUE)
# Purposely converting the factor to the underlying numeric representation
genes_by_quartile <- as.numeric(genes_by_quartile)
names(genes_by_quartile) <- names(mean_expr)
for (i in 1:4) {
genes <- names(genes_by_quartile)[genes_by_quartile == i]
q_fname <- paste0("../data/ercc-filter-q", i, ".bed")
write.table(ercc_bed_filter[ercc_bed_filter$name %in% genes, ], file = q_fname,
quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}
Split by length. Because the coordinates are 0-based and the end position is exlusive, the length for the ERCC is just the end position.
len_genes <- ercc_bed_filter$end
names(len_genes) <- ercc_bed_filter$name
stopifnot(rownames(molecules_filter_ercc) %in% names(len_genes),
nrow(molecules_filter_ercc) == length(len_genes))
len_q <- quantile(len_genes)
genes_by_len <- cut(len_genes, breaks = len_q, include.lowest = TRUE)
# Purposely converting the factor to the underlying numeric representation
genes_by_len <- as.numeric(genes_by_len)
names(genes_by_len) <- names(len_genes)
for (i in 1:4) {
genes <- names(genes_by_len)[genes_by_len == i]
q_fname <- paste0("../data/ercc-filter-len-q", i, ".bed")
write.table(ercc_bed_filter[ercc_bed_filter$name %in% genes, ], file = q_fname,
quote = FALSE, sep = "\t", row.names = FALSE, col.names = FALSE)
}
Made configuration files.
cat ../data/ngsplot-molecules-ercc.txt
# base command: ngs.plot.r -G hg19 -L 100 -FL 100 -R bed -C ../data/ngsplot-molecules-ercc.txt
# For different strands: -SS both/same/opposite
# For more info: https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101
../data/19098.1.A01.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter.bed "NA19098.r1.A01"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter.bed "NA19101.r1.A02"
../data/19239.1.A01.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter.bed "NA19239.r1.A01"
The second runs the BAM file for 19101 for different subsets of expression level.
cat ../data/ngsplot-expression-ercc.txt
# Run:
# ngs.plot.r -G hg19 -L 100 -FL 100 -R bed -C ../data/ngsplot-expression-ercc.txt
# For different strands: -SS both/same/opposite
# For more info: https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-q1.bed "q1"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-q2.bed "q2"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-q3.bed "q3"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-q4.bed "q4"
The third runs the BAM file for 19101 for different subsets of gene length.
cat ../data/ngsplot-length-ercc.txt
# Run:
# ngs.plot.r -G hg19 -L 100 -FL 100 -R bed -C ../data/ngsplot-length-ercc.txt
# For different strands: -SS both/same/opposite
# For more info: https://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-len-q1.bed "q1"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-len-q2.bed "q2"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-len-q3.bed "q3"
../data/19101.1.A02.trim.sickle.sorted.combined.rmdup.sorted.bam ../data/ercc-filter-len-q4.bed "q4"
And made a similar Makefile: make-ngsplot-ercc.
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] biomaRt_2.24.0 cowplot_0.3.1 ggplot2_1.0.1 tidyr_0.2.0
[5] knitr_1.10.5 rmarkdown_0.6.1
loaded via a namespace (and not attached):
[1] Rcpp_0.12.0 AnnotationDbi_1.30.1 magrittr_1.5
[4] IRanges_2.2.4 BiocGenerics_0.14.0 MASS_7.3-40
[7] munsell_0.4.2 colorspace_1.2-6 GenomeInfoDb_1.4.0
[10] stringr_1.0.0 httr_0.6.1 plyr_1.8.3
[13] tools_3.2.0 parallel_3.2.0 grid_3.2.0
[16] Biobase_2.28.0 gtable_0.1.2 DBI_0.3.1
[19] htmltools_0.2.6 yaml_2.1.13 digest_0.6.8
[22] reshape2_1.4.1 formatR_1.2 S4Vectors_0.6.0
[25] bitops_1.0-6 RCurl_1.95-4.6 RSQLite_1.0.0
[28] evaluate_0.7 labeling_0.3 stringi_0.4-1
[31] scales_0.2.4 stats4_3.2.0 XML_3.98-1.2
[34] proto_0.3-10