Last updated: 2016-07-08
Code version: f3271c626c586079652907b55fa357b6f569fd26
Non-detection rate descriptives: We found that the median of non-detected cells rate is 26% across the three individuals and ranges between 21 to 30% between the three individuals.
Non-detection rate and Mean, Var, CV: Expectedly, gene abundance and variance were both associated with percent of non-detected cells; even genes with as low as 10 percent of non-detected cells were restricted in the dynamic range of abudance levels. Gene coefficient of variation was not correlated with percent of non-detected genes (Spearman’s correlation = .04) across the three individuals. Further analysis indicated that when including only cells with less than 50 percent of undetected cells, the gene CV decreases as a concave function of abundance, a commonly observed pattern in bulk RNA-seq data and in scRNA-seq when undetected cells are included in the analysis.
Compare non-detection rates: 1,958 genes were found to have significant differences in non-detection rates between the three individuals at false discovery rate less than .01.
Individual difference in non-detection rate and gene abundance: We found no correlation between p-values of tests comparing gene mean abundance and p-values of tests comparing non-detection rate, for all three pairiwse between-individual comparisons. The results also held when we investigated the correlation between fold change and p-value of tests comparing non-detection rate between indivdiuals. Furthermore, we evaluated genes with absolute log2 fold change greater than 1, 1.5, and 2 in the pairwise comparison tests, most genes with our nominated fold change difference were found to not have a significant difference in the proportion of non-detected cells.
*Note: I don’t trust the differential expression analysis results here. Over 95% of the genes in the overall F-test were to be significant at FDR < .01; same results were found for two of the pairwise comparisons (NA19098-NA19101, NA19098-NA19239). When looking at the fold change of gene expression, I found only a very small portion of the genes with log2 fold change more than 2; NA19098-NA19101: 5 genes, NA19098-NA19239: 4 genes, and NA19101-NA19101: 2 genes.
Note: An interesting observation when computing fold change suggests that on average gene abundance levels of NA19239 > NA19101> NA19098
library("dplyr")
library("limma")
library("edgeR")
library("ggplot2")
library("scales")
library("grid")
theme_set(theme_bw(base_size = 12))
source("functions.R")
library("Humanzee")
library("cowplot")
library("MASS")
library("matrixStats")
source("../code/cv-functions.r")
source("../code/plotting-functions.R")
library("mygene")
library("aod")
We import molecule counts before standardizing and transformation and also log2-transformed counts after batch-correction. Biological variation analysis of the individuals is performed on the batch-corrected and log2-transformed counts.
# Import filtered annotations
anno_filter <- read.table("../data/annotation-filter.txt",
header = TRUE,
stringsAsFactors = FALSE)
# Import filtered molecule counts
molecules_filter <- read.table("../data/molecules-filter.txt",
header = TRUE, stringsAsFactors = FALSE)
stopifnot(NROW(anno_filter) == NCOL(molecules_filter))
# Import final processed molecule counts of endogeneous genes
molecules_final <- read.table("../data/molecules-final.txt",
header = TRUE, stringsAsFactors = FALSE)
stopifnot(NROW(anno_filter) == NCOL(molecules_final))
# Import gene symbols
gene_symbols <- read.table(file = "../data/gene-info.txt", sep = "\t",
header = TRUE, stringsAsFactors = FALSE, quote = "")
# Import cell-cycle gene list
cell_cycle_genes <- read.table("../data/cellcyclegenes.txt",
header = TRUE, sep = "\t",
stringsAsFactors = FALSE)
# Import pluripotency gene list
pluripotency_genes <- read.table("../data/pluripotency-genes.txt",
header = TRUE, sep = "\t",
stringsAsFactors = FALSE)$To
Load gene CVs computed over all cells (link1), CV computed only over the expressed cells (link2), and differential CV results when including only the expressed cells (link3).
load("../data/cv-all-cells.rda")
load("../data/cv-expressed-cells.rda")
load("../data/permuted-pval-expressed-set1.rda")
Compute a matrix of 0’s and 1’s indicating non-detected and detected cells, respectively.
molecules_expressed <- molecules_filter
molecules_expressed[which(molecules_filter > 0 , arr.ind = TRUE)] <- 1
molecules_expressed <- as.matrix((molecules_expressed))
Take the gene subset included in the final data.
genes_included <- Reduce(intersect,
list(rownames(molecules_final),
rownames(expressed_cv$NA19098),
names(perm_pval_set1)) )
molecules_filter_subset <- molecules_filter[
which(rownames(molecules_filter) %in% genes_included), ]
molecules_final_subset <- molecules_final[
which(rownames(molecules_final) %in% genes_included), ]
molecules_expressed_subset <- molecules_expressed[
which(rownames(molecules_expressed) %in% genes_included), ]
molecules_final_expressed_subset <- molecules_final_subset*molecules_expressed_subset
molecules_final_expressed_subset[which(molecules_expressed_subset == 0, arr.ind = TRUE)] <- NA
permuted_pval_subset <- perm_pval_set1[which(names(perm_pval_set1) %in% genes_included)]
names(permuted_pval_subset) <- names(perm_pval_set1)[which(names(perm_pval_set1) %in% genes_included)]
expressed_cv_subset <- lapply(expressed_cv, function(x)
x[which(rownames(x) %in% genes_included), ] )
names(expressed_cv_subset) <- names(expressed_cv)
expressed_dm_subset <- expressed_dm[which(rownames(expressed_dm) %in% genes_included), , ]
dim(molecules_final_subset)
[1] 13043 564
dim(molecules_expressed_subset)
[1] 13043 564
all.equal(rownames(expressed_cv_subset$NA19098),
rownames(molecules_final_expressed_subset) )
[1] TRUE
all.equal(names(permuted_pval_subset),
rownames(molecules_expressed_subset) )
[1] TRUE
Compute drop-out rates.
drop_out <- lapply(unique(anno_filter$individual), function(ind) {
temp_df <- molecules_filter_subset[,anno_filter$individual == ind]
zero_df <- array(0, dim = dim(temp_df))
rownames(zero_df) <- rownames(temp_df)
zero_df[which(temp_df == 0, arr.ind = TRUE)] <- 1
zero_count <- rowMeans(zero_df)
return(zero_count)
})
names(drop_out) <- unique(anno_filter$individual)
drop_out$all <- rowMeans(as.matrix(molecules_filter_subset) == 0)
summary(drop_out$NA19098)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02113 0.21130 0.32060 0.59860 0.99300
summary(drop_out$NA19101)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.04975 0.29350 0.36500 0.66170 0.99500
summary(drop_out$NA19239)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.02715 0.25340 0.35060 0.65610 0.99550
summary(drop_out$all)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00000 0.03723 0.26060 0.34820 0.64540 0.93090
Histogram of drop-out percents.
par(mfrow = c(2,2))
for (i in 1:4) {
h1 <- hist(drop_out[[i]], breaks = seq(0,1, by = .1),
plot = FALSE)
h1$density <- h1$counts/sum(h1$counts)*100
plot(h1, ylab = "Percent",
xlab = "Proportion of non-detected cells",
main = names(drop_out)[i],
freq=FALSE)
}
Proportion of genes with 50 percent or more undetected cells.
sapply(drop_out, function(x) mean(x>=.5))
NA19098 NA19101 NA19239 all
0.3091313 0.3592732 0.3378057 0.3384957
NA19098, expressed cells
par(mfrow = c(2,2))
plot(y = log10(expressed_cv_subset$NA19098$expr_mean),
x = drop_out$NA19098,
xlab = "Percent of drop-outs",
ylab = "log10 mean expression",
pch = 16, cex =.3)
lines(lowess( expressed_cv_subset$NA19098$expr_mean ~ drop_out$NA19098),
col = "red")
plot(y = log10(expressed_cv_subset$NA19098$expr_var),
x = drop_out$NA19098,
xlab = "Percent of drop-outs",
ylab = "log10 variance of expression",
pch = 16, cex =.3)
lines(lowess( log10(expressed_cv_subset$NA19098$expr_var) ~ drop_out$NA19098),
col = "red")
plot(y = expressed_cv_subset$NA19098$expr_cv,
x = drop_out$NA19098,
xlab = "Percent of drop-outs",
ylab = "CV",
pch = 16, cex =.3)
lines(lowess( expressed_cv_subset$NA19098$expr_cv ~ drop_out$NA19098),
col = "red")
plot(y = expressed_dm_subset$NA19098,
x = drop_out$NA19098,
xlab = "Percent of drop-outs",
ylab = "Corrected CV",
pch = 16, cex =.3)
lines(lowess( expressed_dm_subset$NA19098 ~ drop_out$NA19098),
col = "red")
title(main = "Expressed cells, NA19098", outer = TRUE, line = -1)
All individuals, expressed cells
par(mfrow = c(2,2))
# abundance
plot(y = log10(expressed_cv_subset$all$expr_mean),
x = drop_out$all,
xlab = "Percent of drop-outs",
ylab = "log10 mean expression",
pch = 16, cex =.3)
lines(lowess( log10(expressed_cv_subset$all$expr_mean) ~ drop_out$all),
col = "red")
# variance
plot(y = log10(expressed_cv_subset$all$expr_var),
x = drop_out$all,
xlab = "Percent of drop-outs",
ylab = "log10 variance of expression",
pch = 16, cex =.3)
lines(lowess( log10(expressed_cv_subset$all$expr_var) ~ drop_out$all),
col = "red")
# CV
plot(y = expressed_cv_subset$all$expr_cv,
x = drop_out$all,
xlab = "Percent of drop-outs",
ylab = "CV",
pch = 16, cex =.3)
lines(lowess( expressed_cv_subset$all$expr_cv ~ drop_out$all),
col = "red")
title(main = "All 3 individuals", outer = TRUE, line = -1)
CV and mean plus drop-out rates.
par(mfrow = c(2,2))
plot(y = expressed_cv_subset$NA19098$expr_cv,
x = log10(expressed_cv_subset$NA19098$expr_mean),
cex = .9, col = alpha("grey40", .8), lwd = .6,
xlab = "log10 gene mean abundance",
ylab = "CV",
main = "Expressed cells, NA19098",
xlim = c(1,5), ylim = c(0,2.5))
points(y = expressed_cv_subset$NA19098$expr_cv,
x = log10(expressed_cv_subset$NA19098$expr_mean),
col = rev(RColorBrewer::brewer.pal(10, "RdYlBu"))[
cut(drop_out$NA19098, breaks = seq(0, 1, by = .1),
include.lowest = TRUE)],
cex = .6, pch = 16)
plot(y = expressed_cv_subset$all$expr_cv,
x = log10(expressed_cv_subset$all$expr_mean),
cex = .9, col = alpha("grey40", .8), lwd = .6,
xlab = "log10 gene mean abundance",
ylab = "CV",
main = "Expressed cells, All 3 individuals",
xlim = c(1,5), ylim = c(0,2.5))
points(y = expressed_cv_subset$all$expr_cv,
x = log10(expressed_cv_subset$all$expr_mean),
col = rev(RColorBrewer::brewer.pal(10, "RdYlBu"))[
cut(drop_out$all, breaks = seq(0, 1, by = .1),
include.lowest = TRUE)],
cex = .6, pch = 16)
plot(y = expressed_dm_subset$NA19098,
x = log10(expressed_cv_subset$NA19098$expr_mean),
cex = .9, col = alpha("grey40", .8), lwd = .6,
xlab = "log10 gene mean abundance",
ylab = "Adjusted CV",
main = "Expressed cells, NA19098",
xlim = c(1,5))
points(y = expressed_dm_subset$NA19098,
x = log10(expressed_cv_subset$NA19098$expr_mean),
col = rev(RColorBrewer::brewer.pal(10, "RdYlBu"))[
cut(drop_out$NA19098, breaks = seq(0, 1, by = .1),
include.lowest = TRUE)],
cex = .6, pch = 16)
plot(1:10, pch = 15, cex = 2, axes = F, xlab = "", ylab = "",
col = rev(RColorBrewer::brewer.pal(10, "RdYlBu")), xlim = c(0, 13))
text(labels = levels(cut(drop_out$NA19098, breaks = seq(0, 1, by = .1),
include.lowest = TRUE)),
y = 1:10, x = 1:10 + 2, cex = .7)
title(main = "drop-out rate")
Fisher’s exact test for between individual comparisons. For some genes, we observed no undetected cells, a scenario that calls for Fisher’s exact test instead of Chi-square test or logistic regression.
file_name <- "../data/sig-test-zeros.rda"
if (file.exists(file_name)) {
load(file_name)
} else {
ind <- factor(anno_filter$individual,
levels = unique(anno_filter$individual),
labels = unique(anno_filter$individual) )
fit_zero <- do.call(rbind,
lapply(1:NROW(molecules_expressed_subset), function(ii) {
# lapply(1:100, function(ii) {
tab <- table(molecules_expressed_subset[ii, ], ind)
if (dim(tab)[1] == 1) {
if (as.numeric(rownames(tab)) == 1) {
tab <- rbind(rep(0,3), tab)
rownames(tab) <- c(0,1)
} else {
tab <- rbind(rep(1,3), tab)
rownames(tab) <- c(0,1)
}
}
pval <- lapply(rev(1:dim(tab)[2]),
function(i) {
fisher.test(tab[,-i])$p.value
})
names(pval) <- sapply(rev(1:dim(tab)[2]),
function(i) {
paste(colnames(tab[,-i])[1],
colnames(tab[,-i])[2],
sep = "-") })
df <- data.frame(
do.call(cbind, pval),
row.names = rownames(molecules_expressed_subset)[ii] )
colnames(df) <- names(pval)
return(df)
}) )
fit_zero_qval <- do.call(cbind,
lapply(1:3, function(i) {
stats::p.adjust(fit_zero[,i], method = "fdr")
}))
colnames(fit_zero_qval) <- colnames(fit_zero)
save(fit_zero, fit_zero_qval, file = file_name)
}
par(mfrow = c(2,2))
for (i in 1:3) {
hist(fit_zero[,i],
main = colnames(fit_zero)[i],
xlab = "p-value")
}
title(main = "p-value histogram", outer = TRUE, line = -1)
# genes with almost 1 p-value
table(rowSums(fit_zero > .999))
0 1 2 3
8865 2188 282 1708
par(mfrow = c(2,2))
for (i in 1:3) {
plot(ecdf(fit_zero[,i]),
main = colnames(fit_zero)[i],
xlab = "p-value",
axes = F)
axis(1); axis(2)
}
title(main = "p-value empirical distribution function", outer = TRUE, line = -1)
par(mfrow = c(2,2))
for (i in 1:3) {
plot(ecdf(fit_zero_qval[,i]),
main = colnames(fit_zero_qval)[i],
xlab = "q-value",
axes = F)
axis(1); axis(2)
}
title(main = "q-value empirical distribution function", outer = TRUE, line = -1)
table(rowSums(fit_zero_qval < .01))
0 1 2 3
9877 1746 1347 73
library(VennDiagram)
library(gridExtra)
genes <- rownames(molecules_expressed_subset)
overlap_list <- lapply(1:3, function(i)
genes[ which( fit_zero_qval[ ,i] < .01 ) ] )
names(overlap_list) <- colnames(fit_zero_qval)
grid.arrange(gTree(children = venn.diagram(overlap_list,filename = NULL,
category.names = names(overlap_list),
name = "q-val < .01")) )
log2fc <- data.frame(
NA19098_NA19101 = rowMeans(molecules_final_expressed_subset[,anno_filter$individual == "NA19098"], na.rm = TRUE) - rowMeans(molecules_final_expressed_subset[,anno_filter$individual == "NA19101"], na.rm = TRUE),
NA19098_NA19239 = rowMeans(molecules_final_expressed_subset[,anno_filter$individual == "NA19098"], na.rm = TRUE) - rowMeans(molecules_final_expressed_subset[,anno_filter$individual == "NA19239"], na.rm = TRUE),
NA19101_NA19239 = rowMeans(molecules_final_expressed_subset[,anno_filter$individual == "NA19101"], na.rm = TRUE) - rowMeans(molecules_final_expressed_subset[,anno_filter$individual == "NA19239"], na.rm = TRUE) )
par(mfrow = c(2,2))
plot(rank(fit_zero$`NA19098-NA19101`),
rank(log2fc$NA19098_NA19101),
pch = 16, cex = .3,
xlab = "Differential non-detection rate p-value (rank)",
ylab = "log2 fold change (rank)",
main = "NA19098-NA19101")
plot(rank(fit_zero$`NA19098-NA19239`),
rank(log2fc$NA19098_NA19239),
pch = 16, cex = .3,
xlab = "Differential non-detection rate p-value (rank)",
ylab = "log2 fold change (rank)",
main = "NA19098-NA19239")
plot(rank(fit_zero$`NA19101-NA19239`),
rank(log2fc$NA19101_NA19239),
pch = 16, cex = .3,
xlab = "Differential non-detection rate p-value (rank)",
ylab = "log2 fold change (rank)",
main = "NA19101-NA19239")
print("abs. log2 FC > 1")
[1] "abs. log2 FC > 1"
sapply(log2fc, function(x) sum(abs(x) > 1))
NA19098_NA19101 NA19098_NA19239 NA19101_NA19239
178 164 32
library(VennDiagram)
library(gridExtra)
genes <- rownames(molecules_final_subset)
overlap_list <- lapply(1:3, function(i)
list(diffNonDetection = genes[which(fit_zero_qval[,i] < .01)],
diffAbundance = genes[ which( abs(log2fc[[i]]) > 1) ]) )
names(overlap_list) <- colnames(fit_zero_qval)
grid.arrange(gTree(children = venn.diagram(overlap_list[[1]],filename = NULL,
category.names = names(overlap_list[[1]]))),
gTree(children = venn.diagram(overlap_list[[2]],filename = NULL,
category.names = names(overlap_list[[1]]))),
gTree(children = venn.diagram(overlap_list[[3]],filename = NULL,
category.names = names(overlap_list[[3]]))),
ncol = 2)
print("abs. log2 FC > 1.5")
[1] "abs. log2 FC > 1.5"
sapply(log2fc, function(x) sum(abs(x) > 1.5))
NA19098_NA19101 NA19098_NA19239 NA19101_NA19239
12 12 10
genes <- rownames(molecules_final_subset)
overlap_list <- lapply(1:3, function(i)
list(diffNonDetection = genes[which(fit_zero_qval[,i] < .01)],
diffAbundance = genes[ which( abs(log2fc[[i]]) > 1.5) ]) )
names(overlap_list) <- colnames(fit_zero_qval)
grid.arrange(gTree(children = venn.diagram(overlap_list[[1]],filename = NULL,
category.names = names(overlap_list[[1]]))),
gTree(children = venn.diagram(overlap_list[[2]],filename = NULL,
category.names = names(overlap_list[[1]]))),
gTree(children = venn.diagram(overlap_list[[3]],filename = NULL,
category.names = names(overlap_list[[3]]))),
ncol = 2)
print("abs. log2 FC > 2")
[1] "abs. log2 FC > 2"
sapply(log2fc, function(x) sum(abs(x) > 2))
NA19098_NA19101 NA19098_NA19239 NA19101_NA19239
5 6 2
genes <- rownames(molecules_final_subset)
overlap_list <- lapply(1:3, function(i)
list(diffNonDetection = genes[which(fit_zero_qval[,i] < .01)],
diffAbundance = genes[ which( abs(log2fc[[i]]) > 2) ]) )
names(overlap_list) <- colnames(fit_zero_qval)
grid.arrange(gTree(children = venn.diagram(overlap_list[[1]],filename = NULL,
category.names = names(overlap_list[[1]]))),
gTree(children = venn.diagram(overlap_list[[2]],filename = NULL,
category.names = names(overlap_list[[1]]))),
gTree(children = venn.diagram(overlap_list[[3]],filename = NULL,
category.names = names(overlap_list[[3]]))),
ncol = 2)
NA19098 vs. NA19101: diff non-detection,log2FC > 1.5
# genes with the largest difference in non-detection rates
gene_elect <- intersect(
rownames(fit_zero)[which(fit_zero_qval[,1] < .01)],
rownames(log2fc)[ which(abs(log2fc$NA19098_NA19101) > 1.5) ] )
par(mfrow = c(3,2))
for (i in 1:6) {
plot_density_overlay(
molecules = molecules_final_expressed_subset,
annotation = anno_filter,
which_gene = gene_elect[i],
labels = "",
xlims = c(0,15), ylims = NULL,
gene_symbols = gene_symbols)
}
par(mfrow = c(2,2))
plot(x = rank(matrixStats::rowProds(as.matrix(fit_zero), na.rm =TRUE)),
y = rank(permuted_pval_subset),
pch = 16, cex = .3,
ylab ="Diff. CV p-value (rank)",
xlab = "Diff non-detection rate p-values product (rank)",
main = "All 3 individuals")
for (i in 1:3) {
plot(x = rank(fit_zero[,i]),
y = rank(permuted_pval_subset),
pch = 16, cex = .3,
ylab ="Diff. CV p-value (rank)",
xlab = "Diff non-detection rate p-values (rank)",
main = colnames(fit_zero)[i] )
}
Overlap of genes with at least one significant pairwise difference in proportion of detected genes, and genes with significant differential CV.
# subset expression data of genes that were included
# in the significance testing
rownames(fit_zero_qval) <- rownames(fit_zero)
library(VennDiagram)
library(gridExtra)
overlap_list_diff <- list(
diffCV = names(permuted_pval_subset)[which(permuted_pval_subset == 0)],
diffNonDetection = rownames(fit_zero_qval)[
which(rowSums(fit_zero_qval < .01, na.rm = TRUE) > 1)
])
grid.arrange(gTree(children = venn.diagram(overlap_list_diff,filename = NULL,
category.names = names(overlap_list_diff),
name = "Diff. CV & non-detected rate")))
# genes with the largest difference in non-detection rates
gene_elect <- intersect(
rownames(fit_zero)[which(rowSums(fit_zero_qval < .01, na.rm = TRUE) == 1)],
rownames(log2fc)[ which(permuted_pval_subset < .01) ] )
par(mfrow = c(3,2))
for (i in 1:6) {
plot_density_overlay(
molecules = molecules_final_expressed_subset,
annotation = anno_filter,
which_gene = gene_elect[i],
labels = "",
xlims = c(0,15), ylims = NULL,
gene_symbols = gene_symbols)
}
Overlap of genes with significant difference in proportion of undetected genes between all three pairs and overlap of genes with significant differential CV genes.
# subset expression data of genes that were included
# in the significance testing
rownames(fit_zero_qval) <- rownames(fit_zero)
library(VennDiagram)
library(gridExtra)
overlap_list_diff <- list(
diffCV = names(permuted_pval_subset)[which(permuted_pval_subset == 0)],
diffNonDetection = rownames(fit_zero_qval)[
which(rowSums(fit_zero_qval < .01, na.rm = TRUE) == 3)
])
grid.arrange(gTree(children = venn.diagram(overlap_list_diff,filename = NULL,
category.names = names(overlap_list_diff),
name = "Diff. CV & non-detected rate")))
# genes with the largest difference in non-detection rates
gene_elect <- intersect(
rownames(fit_zero)[which(rowSums(fit_zero_qval < .01, na.rm = TRUE) == 3)],
rownames(log2fc)[ which(permuted_pval_subset < .01) ] )
par(mfrow = c(3,2))
for (i in 1:6) {
plot_density_overlay(
molecules = molecules_final_expressed_subset,
annotation = anno_filter,
which_gene = gene_elect[i],
labels = "",
xlims = c(0,15), ylims = NULL,
gene_symbols = gene_symbols)
}
# subset expression data of genes that were included
# in the significance testing
rownames(fit_zero_qval) <- rownames(fit_zero)
library(VennDiagram)
library(gridExtra)
overlap_list_diff <- list(
difflog2FC = rownames(log2fc)[which(abs(log2fc$NA19098_NA19101) > 1)],
diffCV = names(permuted_pval_subset)[which(permuted_pval_subset == 0)],
diffNonDetection = rownames(fit_zero_qval)[
which(fit_zero_qval[,1] < .01) ] )
grid.arrange(gTree(children = venn.diagram(overlap_list_diff,filename = NULL,
category.names = names(overlap_list_diff))))
pluri <- rownames(fit_zero)[which(rownames(fit_zero) %in% pluripotency_genes)]
symbol <- gene_symbols$external_gene_name[match(pluri, gene_symbols$ensembl_gene_id)]
zero_prop <- do.call(cbind,
lapply(unique(anno_filter$individual), function(ind) {
rowMeans(molecules_expressed_subset[, anno_filter$individual == ind] == 0,
na.rm = TRUE)
}) )
colnames(zero_prop) <- unique(anno_filter$individual)
names(zero_prop) <- unique(anno_filter$individual)
summary_table <- data.frame(
gene_symbo = symbol,
diffCVpval = permuted_pval_subset[which(names(permuted_pval_subset) %in% pluripotency_genes)] < .01,
fit_zero_qval[which(rownames(fit_zero) %in% pluripotency_genes), ] < .01,
round(log2fc[which(rownames(log2fc) %in% pluripotency_genes), ], digits = 2),
round(zero_prop[which(rownames(log2fc) %in% pluripotency_genes), ], digits = 2),
stringsAsFactors = FALSE)
# The two significant CV genes
#DNMT3B
table(molecules_expressed_subset[which( rownames(molecules_expressed_subset) %in% "ENSG00000088305"), ])
1
564
summary(unlist(molecules_final_expressed_subset[which( rownames(molecules_final_expressed_subset) %in% "ENSG00000088305"), ]) )
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.80 12.03 12.35 12.36 12.71 13.72
# N66A1
table(molecules_expressed_subset[which( rownames(molecules_expressed_subset) %in% "ENSG00000148200"), ])
1
564
summary(unlist(molecules_final_expressed_subset[which( rownames(molecules_final_expressed_subset) %in% "ENSG00000148200"), ]) )
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.23 11.21 11.50 11.50 11.79 12.91
# NANOG
table(molecules_expressed_subset[which( rownames(molecules_expressed_subset) %in% "ENSG00000111704"), ])
0 1
125 439
summary(unlist(molecules_final_expressed_subset[which( rownames(molecules_final_expressed_subset) %in% "ENSG00000111704"), ]) )
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
5.977 7.184 7.678 7.784 8.368 10.550 125
#ZFP42
table(molecules_expressed_subset[which( rownames(molecules_expressed_subset) %in% "ENSG00000179059"), ])
0 1
1 563
summary(unlist(molecules_final_expressed_subset[which( rownames(molecules_final_expressed_subset) %in% "ENSG00000179059"), ]) )
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
6.556 9.722 10.380 10.220 10.810 11.770 1
kable(summary_table)
gene_symbo | diffCVpval | NA19098.NA19101 | NA19098.NA19239 | NA19101.NA19239 | NA19098_NA19101 | NA19098_NA19239 | NA19101_NA19239 | NA19098 | NA19101 | NA19239 | |
---|---|---|---|---|---|---|---|---|---|---|---|
ENSG00000187140 | FOXD3 | FALSE | FALSE | TRUE | FALSE | -0.49 | -0.10 | 0.39 | 0.30 | 0.46 | 0.53 |
ENSG00000156574 | NODAL | FALSE | TRUE | FALSE | FALSE | -0.15 | -0.49 | -0.34 | 0.82 | 0.96 | 0.93 |
ENSG00000171794 | UTF1 | FALSE | FALSE | TRUE | TRUE | -0.68 | -0.51 | 0.17 | 0.89 | 0.89 | 0.98 |
ENSG00000184344 | GDF3 | FALSE | FALSE | TRUE | TRUE | -0.37 | -0.10 | 0.26 | 0.18 | 0.29 | 0.45 |
ENSG00000111704 | NANOG | FALSE | FALSE | FALSE | TRUE | -0.84 | -0.25 | 0.58 | 0.18 | 0.13 | 0.33 |
ENSG00000102554 | KLF5 | TRUE | FALSE | TRUE | FALSE | -0.91 | -0.41 | 0.50 | 0.99 | 0.96 | 0.87 |
ENSG00000088305 | DNMT3B | TRUE | FALSE | FALSE | FALSE | -0.51 | 0.31 | 0.81 | 0.00 | 0.00 | 0.00 |
ENSG00000101115 | SALL4 | TRUE | FALSE | FALSE | FALSE | -0.33 | -0.09 | 0.24 | 0.00 | 0.00 | 0.00 |
ENSG00000163530 | DPPA2 | TRUE | TRUE | TRUE | FALSE | -0.60 | -0.41 | 0.19 | 0.67 | 0.84 | 0.89 |
ENSG00000121570 | DPPA4 | FALSE | FALSE | FALSE | FALSE | -0.13 | -0.11 | 0.03 | 0.00 | 0.00 | 0.00 |
ENSG00000181449 | SOX2 | FALSE | FALSE | FALSE | FALSE | -0.44 | -0.32 | 0.12 | 0.00 | 0.00 | 0.00 |
ENSG00000179059 | ZFP42 | FALSE | FALSE | FALSE | FALSE | -1.37 | -1.34 | 0.03 | 0.01 | 0.00 | 0.00 |
ENSG00000164362 | TERT | FALSE | FALSE | FALSE | FALSE | -0.42 | -0.18 | 0.24 | 0.62 | 0.54 | 0.58 |
ENSG00000204531 | POU5F1 | FALSE | FALSE | FALSE | FALSE | -0.60 | -0.29 | 0.31 | 0.00 | 0.00 | 0.00 |
ENSG00000124762 | CDKN1A | FALSE | FALSE | TRUE | TRUE | -0.38 | -0.32 | 0.06 | 0.59 | 0.60 | 0.91 |
ENSG00000203909 | DPPA5 | FALSE | FALSE | TRUE | TRUE | -0.54 | -0.14 | 0.39 | 0.66 | 0.65 | 0.99 |
ENSG00000128567 | PODXL | FALSE | FALSE | FALSE | FALSE | -0.46 | 0.10 | 0.56 | 0.00 | 0.00 | 0.00 |
ENSG00000136997 | MYC | FALSE | FALSE | FALSE | TRUE | -1.07 | -0.03 | 1.04 | 0.04 | 0.03 | 0.11 |
ENSG00000136826 | KLF4 | TRUE | FALSE | FALSE | FALSE | -0.70 | -0.78 | -0.09 | 0.96 | 0.93 | 0.86 |
ENSG00000148200 | NR6A1 | FALSE | FALSE | FALSE | FALSE | -0.35 | -0.25 | 0.10 | 0.00 | 0.00 | 0.00 |
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel grid stats graphics grDevices utils
[8] datasets methods base
other attached packages:
[1] broman_0.59-5 gridExtra_2.0.0 VennDiagram_1.6.9
[4] aod_1.3 mygene_1.2.3 GenomicFeatures_1.20.1
[7] AnnotationDbi_1.30.1 Biobase_2.28.0 GenomicRanges_1.20.5
[10] GenomeInfoDb_1.4.0 IRanges_2.2.4 S4Vectors_0.6.0
[13] BiocGenerics_0.14.0 matrixStats_0.14.0 MASS_7.3-40
[16] cowplot_0.3.1 Humanzee_0.1.0 scales_0.4.0
[19] ggplot2_1.0.1 edgeR_3.10.2 limma_3.24.9
[22] dplyr_0.4.2 knitr_1.10.5
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 lattice_0.20-31
[3] Rsamtools_1.20.4 Biostrings_2.36.1
[5] assertthat_0.1 digest_0.6.8
[7] chron_2.3-45 R6_2.1.1
[9] plyr_1.8.3 futile.options_1.0.0
[11] acepack_1.3-3.3 RSQLite_1.0.0
[13] evaluate_0.7 highr_0.5
[15] sqldf_0.4-10 httr_0.6.1
[17] zlibbioc_1.14.0 rpart_4.1-9
[19] gsubfn_0.6-6 rmarkdown_0.6.1
[21] proto_0.3-10 splines_3.2.0
[23] BiocParallel_1.2.2 stringr_1.0.0
[25] foreign_0.8-63 RCurl_1.95-4.6
[27] biomaRt_2.24.0 munsell_0.4.3
[29] rtracklayer_1.28.4 htmltools_0.2.6
[31] nnet_7.3-9 Hmisc_3.16-0
[33] XML_3.98-1.2 GenomicAlignments_1.4.1
[35] bitops_1.0-6 jsonlite_0.9.16
[37] gtable_0.1.2 DBI_0.3.1
[39] magrittr_1.5 formatR_1.2
[41] stringi_1.0-1 XVector_0.8.0
[43] reshape2_1.4.1 latticeExtra_0.6-26
[45] futile.logger_1.4.1 Formula_1.2-1
[47] lambda.r_1.1.7 RColorBrewer_1.1-2
[49] tools_3.2.0 survival_2.38-1
[51] yaml_2.1.13 colorspace_1.2-6
[53] cluster_2.0.1