Last updated: 2018-08-15
Code version: ede57c0
The initial setup only has to be performed once for each computer that you use.
To ensure all contributors are using the same computational environment, we use conda to manage software dependencies (made possible by the bioconda and conda-forge projects). Please complete the following steps to replicate the computing environment. Note that this is only guaranteed to work on a Linux-64 based architecture, but in theory should be able to work on macOS as well. All commands shown below are intended to be run in a Bash shell from the root of the project directory.
Install Git and register for an account on GitHub
bash Downloads/Miniconda3-latest-MacOSX-x86_64.sh
ENTER
to start installationq
when finished reading the license and then type yes
ENTER
to accept the default of ~/miniconda3
)yes
to add Miniconda to your PATH
variablesource ~/.bash_profile
on macOS or source ~/.bashrc
on Linux to update PATH
which conda
to confirm that conda
was installedClone this repository (or your personal fork)
git clone https://github.com/jdblischak/singlecell-qtl.git
cd singlecell-qtl
Create the conda environment "scqtl" using environment.yaml
conda env create --file environment.yaml
To use the conda environment, you must first activate it by running source activate scqtl
. This will override your default settings for R, Python, and various other software packages. When you are done working on this project, you can either logout of the current session or deactivate the environment.
# Activate conda environment scqtl
source activate scqtl
# Deactivate conda environment
source deactivate
Initialize git-lfs and download the latest version of large data files
git lfs install
git lfs pull
Warning: If you are using RStudio, you need to ensure that it recognizes your conda environment. If you launch RStudio by clicking on an icon, it doesn't use the current environment you have configured in your shell. On a Linux-based system, the solution is to launch RStudio directly from the shell with rstudio
. On macOS, running open -a rstudio .
from the shell causes RStudio to recognize most of the environment variables, but strangely it does not set the correct library path to the conda R packages. Suggestions for how to fix this are welcome.
Once you have performed the initial setup on your computer, the regular setup you have to perform each time you want to work on the project is much quicker. Navigate to the singlecell-qtl directory, activate the conda environment, and pull the latest changes from the GitHub repository.
cd singlecell-qtl
source activate scqtl
git pull
If there are updates to environment.yaml
, you can update the "scqtl" conda environment. You may want to do this occasionally just to be sure you have all the software you need.
conda env update --file environment.yaml
The main steps of the bioinformatics pipeline are:
umi_tools extract
)umi_tools dedup
; output referred to as molecules)The pipeline is implemented with Snakemake. The Snakefile controls this process, ensuring that output files are updated when upstream files are changed. Options for the pipeline are set in config.yaml, options for submission to Slurm are set in cluster.json, the Bash script for executing Snakemake is in submit-snakemake.sh, and the Slurm sbatch file for submitting the pipeline is snakemake.sbatch.
The processed data for each C1 chip is stored in its own RDS file, a binary format for storing R objects. The RDS file contains an ExpressionSet object. The ExpressionSet class stores the RNA-seq counts along with metadata on the samples, features (e.g. genes), and experimental details.
To import the data, use the R function readRDS()
. Because the ExpressionSet class is defined in the Bioconductor package Biobase, it will be loaded the first time you access the loaded ExpressionSet object. To avoid the long startup message that the Biobase package prints, here it is loaded prior to importing the object.
suppressPackageStartupMessages(library("Biobase"))
eset <- readRDS("../data/eset/03162017.rds")
Executing the object in the R console provides an overall summary of what it contains.
eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54616 features, 96 samples
element names: exprs
protocolData: none
phenoData
sampleNames: 03162017-A01 03162017-A02 ... 03162017-H12 (96
total)
varLabels: experiment well ... valid_id (40 total)
varMetadata: labelDescription
featureData
featureNames: ENSG00000000003 ENSG00000000005 ... WBGene00235374
(54616 total)
fvarLabels: chr start ... source (6 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:
But each element of the object can be extracted individually, for example:
# Dimensions
dim(eset)
Features Samples
54616 96
# Sample names
head(sampleNames(eset))
[1] "03162017-A01" "03162017-A02" "03162017-A03" "03162017-A04"
[5] "03162017-A05" "03162017-A06"
# Expression data
exprs(eset)[1:5, 1:5]
03162017-A01 03162017-A02 03162017-A03 03162017-A04
ENSG00000000003 0 3 3 3
ENSG00000000005 0 0 0 0
ENSG00000000419 0 3 1 3
ENSG00000000457 1 0 1 0
ENSG00000000460 0 3 0 0
03162017-A05
ENSG00000000003 2
ENSG00000000005 0
ENSG00000000419 2
ENSG00000000457 3
ENSG00000000460 0
# Sample-level variables ("phenotype data")
pData(eset)[1:5, 1:5]
experiment well batch cell_number concentration
03162017-A01 03162017 A01 b1 1 0.6790047
03162017-A02 03162017 A02 b1 1 1.6430792
03162017-A03 03162017 A03 b1 1 0.4034764
03162017-A04 03162017 A04 b1 0 0.9992873
03162017-A05 03162017 A05 b1 1 0.3265886
# Descriptions of the sample-level variables
varMetadata(eset)[1:5, , drop = FALSE]
labelDescription
experiment ID of C1 chip (i.e. processing date in MMDDYYYY)
well Well of C1 chip (96 total, rows A-H, cols 1-12)
batch Batch the C1 chip was processed in (b1, b2, ...)
cell_number The number of cells observed in the well via microscopy
concentration The cDNA concentration of the well prior to library prep
# Gene-level variables ("feature data")
head(fData(eset))
chr start end name strand source
ENSG00000000003 hsX 99883667 99891803 TSPAN6 - H. sapiens
ENSG00000000005 hsX 99839799 99854882 TNMD + H. sapiens
ENSG00000000419 hs20 49551404 49575092 DPM1 - H. sapiens
ENSG00000000457 hs1 169818772 169863408 SCYL3 - H. sapiens
ENSG00000000460 hs1 169764181 169823221 C1orf112 + H. sapiens
ENSG00000000938 hs1 27938575 27961788 FGR - H. sapiens
# Descriptions of the gene-level metrics
fvarMetadata(eset)
labelDescription
chr Chromosome
start Most 5' start position (GRCh37/hg19; 1-based; inclusive)
end Most 3' end position (GRCh37/hg19; 1-based; inclusive)
name Gene name
strand Strand (+ = positive/forward; - = negative/reverse)
source Source of RNA
# Experimental details
experimentData(eset)
Experiment data
Experimenter name: Abhishek K. Sarkar, Po-Yuan Tung, John D. Blischak, Yang I. Li, Matthew Stephens, and Yoav Gilad
Laboratory: Gilad (University of Chicago)
Contact information: https://github.com/jdblischak/singlecell-qtl/issues
Title: Discovery and characterization of variance QTLs in human induced pluripotent stem cells
URL: https://jdblischak.github.io/singlecell-qtl
PMIDs:
Abstract: A 195 word abstract is available. Use 'abstract' method.
Information is available on: preprocessing
# Abstract
cat(abstract(eset))
Quantification of gene expression levels at the single cell level has
revealed that gene expression can vary substantially even across a
population of homogeneous cells. However, it is currently unclear what
genomic features control variation in gene expression levels, and whether
common genetic variants may impact gene expression variation. Here, we
take a genome-wide approach to identify expression variance quantitative
trait loci (vQTLs). To this end, we generated single cell RNA-seq
(scRNA-seq) data from induced pluripotent stem cells (iPSCs) derived from
53 Yoruba individuals. We collected data for a median of 95 cells per
individual and a total of 5,447 single cells, and identified 241 mean
expression QTLs (eQTLs) at 10% FDR, of which 82% replicate in bulk RNA-seq
data from the same individuals. We further identified 14 vQTLs at 10% FDR,
but demonstrate that these can also be explained as effects on mean
expression. Our study suggests that dispersion QTLs (dQTLs), which could
alter the variance of expression independently of the mean, have
systematically smaller effect sizes than eQTLs. We estimate that at least
300 cells per individual and 400 individuals would be required to have
modest power to detect the strongest dQTLs in iPSCs. These results will
guide the design of future studies on understanding the genetic control of
gene expression variance.
# Proprocessing steps
steps <- preproc(eset)
for (i in seq_along(steps)) {
cat(steps[[i]], sep = "\n")
}
We used `umi_tools extract` (UMI-tools 0.5.3) to extract the 6 base
pair unique molecular identifier (UMI) from the 5’ end of each read.
We used `subjunc` (Subread 1.5.3) to map reads to the human genome
(Ensembl GRCh37.75; chromosomes 1-22, X, Y, MT) and ERCC spike-ins
(http://tools.invitrogen.com/downloads/ERCC92.fa). Some samples included
spike-in RNA from Drosophila melanogaster or Caenorhabditis elegans, so
we also included the Ensembl genomes BDGP5.75 and WBcel235.75.
We used `featureCounts` (Subread 1.5.3) to count the number of reads
for all protein-coding genes (Ensembl GRCh37 release 75) and the ERCC
spike-in genes. We performed strand-specific counting (flag -s 1)
because the UMI protocol preserves sequence strand information.
We used `umi_tools dedup` (UMI-tools 0.5.3) to deduplicate reads with
the same UMI and start position to molecules.
We used `verifyBamID` (1.1.3) to identify the individual of origin for
each sample based on the overlap of the RNA-seq reads with the known
genotypes.
A nice feature of ExpressionSets is that any subsetting applied to either the sample-level or gene-level variables is automatically propagated to the expression matrix.
eset_quality <- eset[fData(eset)$source == "H. sapiens",
eset$cell_number == 1 & eset$tra1.60]
dim(eset_quality)
Features Samples
20151 85
Multiple ExpressionSet objects can also be combined. The following code imports and combines the data for all 10 C1 chips from batch 1.
chips <- c("03162017.rds", "03172017.rds", "03232017.rds",
"03302017.rds", "03312017.rds", "04052017.rds",
"04072017.rds", "04132017.rds", "04142017.rds",
"04202017.rds")
fname <- Map(function(x) paste0("../data/eset/", x), chips)
batch1 <- Reduce(combine, Map(readRDS, fname))
dim(batch1)
Features Samples
54616 960
To read in the entire data set, use the combined ExpressionSet object (it's much faster to load just one serialized data file).
eset_total <- readRDS("../data/eset.rds")
dim(eset_total)
Features Samples
54616 7584
To learn more about the data that can be stored in an ExpressionSet object and methods to access the data, please see the help pages for ExpressionSet and its parent class eSet.
?ExpressionSet
?eSet
For convenience, the combined data are available in single files.
The annotation for all the single cells is in data/scqtl-annotation.txt. The description of the annotation columns is in data/scqtl-annotation-description.txt.
anno <- read.delim("../data/scqtl-annotation.txt", stringsAsFactors = FALSE)
# Add back leading zero for experiment ID to get MMDDYYYY
anno$experiment <- sprintf("%08d", anno$experiment)
anno_descrip <- read.delim("../data/scqtl-annotation-description.txt",
stringsAsFactors = FALSE)
head(anno_descrip, 3)
variable description
1 experiment ID of C1 chip (i.e. processing date in MMDDYYYY)
2 well Well of C1 chip (96 total, rows A-H, cols 1-12)
3 batch Batch the C1 chip was processed in (b1, b2, ...)
The combined counts are in data/scqtl-counts.txt.gz. Note that because this file is large, it is stored using Git LFS. Thus you must follow the setup instructions above to download this file to your local computer. I recommend importing it with data.table::fread
.
library("data.table")
counts <- data.frame(fread("zcat ../data/scqtl-counts.txt.gz"), row.names = 1)
Read 49.6% of 20151 rows
Read 99.3% of 20151 rows
Read 20151 rows and 7585 (of 7585) columns from 0.292 GB file in 00:00:10
dim(counts)
[1] 20151 7584
counts[1:5, 1:2]
NA18517.02192018.A01 NA18913.02192018.A02
ENSG00000000003 2 1
ENSG00000000005 0 0
ENSG00000000419 1 0
ENSG00000000457 0 1
ENSG00000000460 0 0
The data analysis is organized using the R package workflowr. Here are the basics:
# Create a new analysis file using the workflowr
wflow_open("analysis/new.Rmd")
# Build the website locally
wflow_build()
# Build and commit the website files
wflow_publish("analysis/new.Rmd", "Add a new analysis")
Ideally wflow_publish()
will be run using the conda environment scqtl so that the final results are always generated using the same versions of the software. If you are using RStudio, you may need to open R from the command line to execute wflow_publish()
in the conda environment.
More information on workflowr can be found in the online documentation.
Since we use Git LFS, we can version large binary files with Git without worrying about the size of the Git repo. Before you do anything with large files, make sure you have activated the conda environment to ensure that you have Git LFS installed.
Activate the conda environment:
source activate scqtl
Check the currently tracked file extensions:
git lfs track
If the file extension is not yet being tracked, you first need to track it. For example, if you wanted to use Git LFS with files that end in the extension .xyz
, you would run the following:
git lfs track "*.xyz"
git add .gitattributes
Once the file extension is being tracked, add and commit the file like any other file (you need to specify the -f
flag for "force" because the directory data/
is ignored by default to prevent accidentally committing a large file):
git add -f data/filename.xyz
git commit
Before pushing to GitHub, confirm that the large file was properly committed by running git log --stat
. The file that was committed should only be a few lines long because it is a plain text file with a pointer to the version of the large file on GitHub's servers.
There may be situations where you cannot access the R packages installed in the conda environment (e.g. running RStudio on macOS, or RStudio Server). The code below installs the necessary R packages from CRAN, Bioconductor, and GitHub.
Install R packages from CRAN:
install.packages("cowplot", "data.table", "devtools", "dplyr", "ggplot2",
"readr", "rmarkdown", "stringr", "testit", "tidyr")
Install R packages from Bioconductor:
source("https://bioconductor.org/biocLite.R")
biocLite(c("Biobase", "biomaRt", "edgeR"))
Install R packages from GitHub:
devtools::install_github("jdblischak/workflowr")
Manually typing git pull origin master
and other long Git commands is a pain. To enable Bash auto-completion (i.e. tab-completion) for Git on macOS, follow these steps:
Download the Bash completion script:
curl -L https://raw.github.com/git/git/master/contrib/completion/git-completion.bash > ~/git-completion.bash
Add the following line to your .bash_profile:
source ~/git-completion.bash
Re-run your .bash_profile:
source ~/.bash_profile
The Initial setup instructions install git-lfs using conda. Therefore you need to have the conda environment active (source activate scqtl
) whenever you run a Git command, which ensures that the large files are always handled properly by Git. Since it can be easy to forget to first activate the conda environment, a safety measure would be to install git-lfs on your local machine (i.e. outside of the conda environment). This will ensure that git-lfs is always available even when you forget to activate the conda environment.
There are multiple methods to install git-lfs. Below are instructions to download a pre-built binary.
Download the software:
cd ~/Downloads
# 64-bit macOS:
curl -OL https://github.com/git-lfs/git-lfs/releases/download/v2.3.0/git-lfs-darwin-amd64-2.3.0.tar.gz
# 64-bit Linux:
curl -OL https://github.com/git-lfs/git-lfs/releases/download/v2.3.0/git-lfs-linux-amd64-2.3.0.tar.gz
Extract the software and copy the executable files into ~/bin
(note: you can put it anywhere you like, but this is a conventional place to put executable files):
mkdir -p ~/bin
tar xzf git-lfs-*.tar.gz
cd git-lfs-2.3.0/
cp git-lfs install.sh ~/bin
Add ~/bin
to your PATH
by adding the following line to your ~/.bash_profile
(macOS) or .bashrc
(Linux):
export PATH="~/bin:$PATH"
Note that ~/bin
is already included in PATH
by default on at least some Linux distributions.
Update PATH
:
# macOS
source ~/.bash_profile
# Linux
source ~/.bashrc
Just to be safe, run the git-lfs install script (this modifies the file ~/.gitconfig
). This is unnecessary if you have already gone through the Initial setup instructions on this machine, but it doesn't hurt to run it again:
git lfs install
To confirm that you have git-lfs installed locally, run which git-lfs
. This should point to ~/bin/git-lfs
and not your conda environment. Also run cat ~/.gitconfig
. It should contain the following (it will also likely have additional lines):
[filter "lfs"]
clean = git-lfs clean -- %f
smudge = git-lfs smudge -- %f
process = git-lfs filter-process
required = true
Updating the entire "scqtl" conda environment is time-consuming because it contains so many packages. This is prohibitive, especially if you only want to test out a new package. Thus instead of adding the package to environment.yaml
and updating the entire environment, you can install just that new package directly from the Terminal:
source activate scqtl
conda install pkgname
If you know the channel the package is in, you can specify it with -c
, for example -c bioconda
. However, this can get tedious if the package has multiple dependencies across multiple channels, which requires specifying each channel with -c
. To automatically search the conda channels in the exact same order as specified in environment.yaml
, run the code below to create the file ~/.condarc
:
echo "\
channels:
- jdblischak
- bioconda
- conda-forge
- defaults
- anaconda
default_channels:
- https://repo.anaconda.com/pkgs/free
- https://repo.anaconda.com/pkgs/pro" > ~/.condarc
And now you can run conda install
without the -c
flag.
RStudio has had some major updates with its 1.0 release. If you are using a version less than 1.0 (check by clicking Help
-> About RStudio
), you should strongly consider upgrading. The most noticeable change for our workflow is that it has built-in support for R Markdown websites. Thus if you click the Knit
button for an R Markdown file in the workflowr project, it will recognize the settings in the file analysis/_site.yml
and save the output HTML in docs/
.
One annoying new feature you will notice is that the chunks of R Markdown documents appear inline in the document instead of being executed in the R console. To restore the previous setting, go to Tools
-> Global Options...
-> R Markdown
and unselect "Show output inline for all R Markdown documents".
On macOS and Linux systems you can run RStudio from a terminal and specify which working directory to startup within. Additionally, on Linux systems if you run RStudio from a terminal and specify no command line argument then RStudio will startup using the current working directory of the terminal.
For example, on macOS you could use the following commands to open RStudio (respectively) in the ~/projects/foo
directory or the current working directory:
$ open -a RStudio ~/projects/foo
$ open -a RStudio .
On Linux you would use the following commands (note that no .
is necessary in the second invocation):
$ rstudio ~/projects/foo
$ rstudio
We use git-lfs to version control large files. To push updated versions of these large files requires authentication of the https server where they are stored, and this must be done for each file separately! To reduce the number of times you have to manually enter your password, enable caching of your GitHub password:
git config --global credential.helper cache
sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)
Matrix products: default
BLAS: /project2/gilad/jdblischak/miniconda3/envs/scqtl/lib/R/lib/libRblas.so
LAPACK: /project2/gilad/jdblischak/miniconda3/envs/scqtl/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] data.table_1.10.4 Biobase_2.38.0 BiocGenerics_0.24.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 digest_0.6.12 rprojroot_1.2 backports_1.0.5
[5] git2r_0.19.0 magrittr_1.5 evaluate_0.10.1 stringi_1.1.2
[9] rmarkdown_1.8 tools_3.4.1 stringr_1.2.0 yaml_2.1.14
[13] compiler_3.4.1 htmltools_0.3.6 knitr_1.20
This R Markdown site was created with workflowr