Gene filtering and sample filtering

Summary
Import data
Over-expressed genes
Filter out lowly-expressed genes
Combine filters
Make filtered data
Session information

Last updated: 2018-04-10

Code version: 77d10d4

Summary

I performed gene filtering based on the criterion set forth in our previous paper.

Remove outlier genes: molecule counts > 4,096 in any sample (x is the theoretical maximum of UMI count for 6-bp UMI)

Results There’s one, and turns out this over-expressed gene is one of the mitochrondrial genes.

Output: gene annotation saved in ../output/gene-filtering.Rmd/over-expressed-genes-info.csv

\(~\)

Remove lowly expressed genes: Lowly-expressed genes := gene mean < 2 CPM.

Results: * Of 20,421 genes, 7,628 genes are classifed as lowly-expressed. Of these, 34 are ERCC control genes and 7,594 are endogenoeus genes.

Output: gene annotation saved in ../output/gene-filtering.Rmd/lowly-expressed-genes-info.csv

Include 1025 samples classifed as high quality samples (of all the 1536 samples).
Finally, create a filtered dataset (eset_filtered) that includes 1025 samples and 12792 genes and save in ../output/gene-filtering.Rmd/eset-gene-filtering.rds.

\(~\)

Import data

Combine eSet objects.

library(knitr)
library(Biobase)
library(dplyr)
#library(gdata)
library(heatmap3)
library(testit)
library(cowplot)
library(biomaRt)
library(knitr)
library(data.table)

source("../code/utility.R")
eset <- readRDS("../data/eset-raw.rds")

Filter out low-quality single cell samples.

pdata_filter <- pData(eset)[pData(eset)$filter_all == TRUE,]
count_filter <- exprs(eset[,pData(eset)$filter_all == TRUE])
dim(count_filter)

[1] 20421  1025

\(~\)

Over-expressed genes

There’s one, and turns out this over-expressed gene.

which_over_expressed <- which(apply(count_filter, 1, function(x) any(x>(4^6)) ))
over_expressed_genes <- rownames(count_filter)[which_over_expressed]
over_expressed_genes

[1] "ENSG00000198886"

Get over-expressed gene info via biomaRt.

over_expressed_genes_info <- getBM(
  attributes = c("ensembl_gene_id", "chromosome_name",
                 "external_gene_name", "transcript_count",
                 "description"),
  filters = "ensembl_gene_id",
  values = over_expressed_genes,
  mart = ensembl)

fwrite(over_expressed_genes_info, 
       file = "../output/gene-filtering.Rmd/over-expressed-genes-info.csv")

\(~\)

Filter out lowly-expressed genes

Of 20,421 genes, 7,864 genes are classifed as lowly-expressed. Of these, 34 are ERCC control genes and 7,830 are endogenoeus genes.

Compute CPM

cpm <- t(t(count_filter)/colSums(count_filter))*(10^6)

Lowly-expressed genes := gene mean < 2 CPM

which_lowly_expressed <- which(rowMeans(cpm) < 2)
length(which_lowly_expressed)

[1] 7628

which_lowly_expressed_genes <- rownames(cpm)[which_lowly_expressed]

length(grep("ERCC", which_lowly_expressed_genes))

[1] 34

length(grep("ENSG", which_lowly_expressed_genes))

[1] 7594

Get gene info via biomaRt.

lowly_expressed_genes_info <- getBM(
  attributes = c("ensembl_gene_id", "chromosome_name",
                 "external_gene_name", "transcript_count",
                 "description"),
  filters = "ensembl_gene_id",
  values = which_lowly_expressed_genes[grep("ENSG", which_lowly_expressed_genes)],
  mart = ensembl)

fwrite(lowly_expressed_genes_info, 
       file = "../output/gene-filtering.Rmd/lowly-expressed-genes-info.csv")

\(~\)

Combine filters

Including 16,460 genes.

gene_filter <- unique(c(which_lowly_expressed, which_over_expressed))

genes_to_include <- setdiff(1:nrow(count_filter), gene_filter)
length(genes_to_include)

[1] 12792

\(~\)

Make filtered data

eset_filtered <- eset[genes_to_include, pData(eset)$filter_all==TRUE]
eset_filtered

ExpressionSet (storageMode: lockedEnvironment)
assayData: 12792 features, 1025 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 20170905-A01 20170905-A02 ... 20170924-H12 (1025
    total)
  varLabels: experiment well ... filter_all (43 total)
  varMetadata: labelDescription
featureData
  featureNames: EGFP ENSG00000000003 ... mCherry (12792 total)
  fvarLabels: chr start ... source (6 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

saveRDS(eset_filtered, 
        file = "../output/gene-filtering.Rmd/eset-gene-filtering.rds")

Session information

R version 3.4.1 (2017-06-30)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Scientific Linux 7.2 (Nitrogen)

Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] data.table_1.10.4-3 biomaRt_2.34.2      cowplot_0.9.2      
 [4] ggplot2_2.2.1       testit_0.7          heatmap3_1.1.1     
 [7] dplyr_0.7.4         Biobase_2.38.0      BiocGenerics_0.24.0
[10] knitr_1.20         

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16         pillar_1.2.1         compiler_3.4.1      
 [4] git2r_0.21.0         plyr_1.8.4           bindr_0.1.1         
 [7] prettyunits_1.0.2    progress_1.1.2       bitops_1.0-6        
[10] tools_3.4.1          bit_1.1-12           digest_0.6.15       
[13] memoise_1.1.0        RSQLite_2.0          evaluate_0.10.1     
[16] tibble_1.4.2         gtable_0.2.0         pkgconfig_2.0.1     
[19] rlang_0.2.0          DBI_0.8              yaml_2.1.18         
[22] bindrcpp_0.2         httr_1.3.1           stringr_1.3.0       
[25] IRanges_2.12.0       S4Vectors_0.16.0     bit64_0.9-7         
[28] stats4_3.4.1         rprojroot_1.3-2      grid_3.4.1          
[31] glue_1.2.0           R6_2.2.2             AnnotationDbi_1.40.0
[34] XML_3.98-1.10        rmarkdown_1.9        blob_1.1.0          
[37] fastcluster_1.1.24   magrittr_1.5         backports_1.1.2     
[40] scales_0.5.0         htmltools_0.3.6      assertthat_0.2.0    
[43] colorspace_1.3-2     stringi_1.1.7        RCurl_1.95-4.10     
[46] lazyeval_0.2.1       munsell_0.4.3

This R Markdown site was created with workflowr

Gene filtering and sample filtering

Joyce Hsiao

Summary

Import data

Over-expressed genes

Filter out lowly-expressed genes

Combine filters

Make filtered data

Session information