Contributing

Last updated: 2018-08-15

Code version: ede57c0

Initial setup

The initial setup only has to be performed once for each computer that you use.

To ensure all contributors are using the same computational environment, we use conda to manage software dependencies (made possible by the bioconda and conda-forge projects). Please complete the following steps to replicate the computing environment. Note that this is only guaranteed to work on a Linux-64 based architecture, but in theory should be able to work on macOS as well. All commands shown below are intended to be run in a Bash shell from the root of the project directory.

Install Git and register for an account on GitHub
Download and install Miniconda3
- Run bash Downloads/Miniconda3-latest-MacOSX-x86_64.sh
- Press ENTER to start installation
- Press q when finished reading the license and then type yes
- Type the location for installation (press ENTER to accept the default of ~/miniconda3)
- Type yes to add Miniconda to your PATH variable
- Run source ~/.bash_profile on macOS or source ~/.bashrc on Linux to update PATH
- Run which conda to confirm that conda was installed

Clone this repository (or your personal fork)

git clone https://github.com/jdblischak/singlecell-qtl.git
cd singlecell-qtl

Create the conda environment "scqtl" using environment.yaml
```
conda env create --file environment.yaml
```
To use the conda environment, you must first activate it by running source activate scqtl. This will override your default settings for R, Python, and various other software packages. When you are done working on this project, you can either logout of the current session or deactivate the environment.
```
# Activate conda environment scqtl
source activate scqtl
# Deactivate conda environment
source deactivate
```
Initialize git-lfs and download the latest version of large data files
```
git lfs install
git lfs pull
```

Warning: If you are using RStudio, you need to ensure that it recognizes your conda environment. If you launch RStudio by clicking on an icon, it doesn't use the current environment you have configured in your shell. On a Linux-based system, the solution is to launch RStudio directly from the shell with rstudio. On macOS, running open -a rstudio . from the shell causes RStudio to recognize most of the environment variables, but strangely it does not set the correct library path to the conda R packages. Suggestions for how to fix this are welcome.

Regular setup

Once you have performed the initial setup on your computer, the regular setup you have to perform each time you want to work on the project is much quicker. Navigate to the singlecell-qtl directory, activate the conda environment, and pull the latest changes from the GitHub repository.

cd singlecell-qtl
source activate scqtl
git pull

If there are updates to environment.yaml, you can update the "scqtl" conda environment. You may want to do this occasionally just to be sure you have all the software you need.

conda env update --file environment.yaml

Bioinformatics pipeline

The main steps of the bioinformatics pipeline are:

Reads that do not have a valid UMI are removed
UMIs are extracted with UMI-tools (umi_tools extract)
Reads are mapped to the genome with Subjunc
Reads are deduplicated based on their UMI with UMI-tools (umi_tools dedup; output referred to as molecules)
Molecules are assigned to genes with featureCounts
The identity of the single cells is determined with verifyBamID

The pipeline is implemented with Snakemake. The Snakefile controls this process, ensuring that output files are updated when upstream files are changed. Options for the pipeline are set in config.yaml, options for submission to Slurm are set in cluster.json, the Bash script for executing Snakemake is in submit-snakemake.sh, and the Slurm sbatch file for submitting the pipeline is snakemake.sbatch.

Using ExpressionSet objects

The processed data for each C1 chip is stored in its own RDS file, a binary format for storing R objects. The RDS file contains an ExpressionSet object. The ExpressionSet class stores the RNA-seq counts along with metadata on the samples, features (e.g. genes), and experimental details.

To import the data, use the R function readRDS(). Because the ExpressionSet class is defined in the Bioconductor package Biobase, it will be loaded the first time you access the loaded ExpressionSet object. To avoid the long startup message that the Biobase package prints, here it is loaded prior to importing the object.

suppressPackageStartupMessages(library("Biobase"))
eset <- readRDS("../data/eset/03162017.rds")

Executing the object in the R console provides an overall summary of what it contains.

eset

ExpressionSet (storageMode: lockedEnvironment)
assayData: 54616 features, 96 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 03162017-A01 03162017-A02 ... 03162017-H12 (96
    total)
  varLabels: experiment well ... valid_id (40 total)
  varMetadata: labelDescription
featureData
  featureNames: ENSG00000000003 ENSG00000000005 ... WBGene00235374
    (54616 total)
  fvarLabels: chr start ... source (6 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
Annotation:

But each element of the object can be extracted individually, for example:

# Dimensions
dim(eset)

Features  Samples 
   54616       96

# Sample names
head(sampleNames(eset))

[1] "03162017-A01" "03162017-A02" "03162017-A03" "03162017-A04"
[5] "03162017-A05" "03162017-A06"

# Expression data
exprs(eset)[1:5, 1:5]

                03162017-A01 03162017-A02 03162017-A03 03162017-A04
ENSG00000000003            0            3            3            3
ENSG00000000005            0            0            0            0
ENSG00000000419            0            3            1            3
ENSG00000000457            1            0            1            0
ENSG00000000460            0            3            0            0
                03162017-A05
ENSG00000000003            2
ENSG00000000005            0
ENSG00000000419            2
ENSG00000000457            3
ENSG00000000460            0

# Sample-level variables ("phenotype data")
pData(eset)[1:5, 1:5]

             experiment well batch cell_number concentration
03162017-A01   03162017  A01    b1           1     0.6790047
03162017-A02   03162017  A02    b1           1     1.6430792
03162017-A03   03162017  A03    b1           1     0.4034764
03162017-A04   03162017  A04    b1           0     0.9992873
03162017-A05   03162017  A05    b1           1     0.3265886

# Descriptions of the sample-level variables
varMetadata(eset)[1:5, , drop = FALSE]

                                                      labelDescription
experiment            ID of C1 chip (i.e. processing date in MMDDYYYY)
well                   Well of C1 chip (96 total, rows A-H, cols 1-12)
batch                 Batch the C1 chip was processed in (b1, b2, ...)
cell_number    The number of cells observed in the well via microscopy
concentration The cDNA concentration of the well prior to library prep

# Gene-level variables ("feature data")
head(fData(eset))

                 chr     start       end     name strand     source
ENSG00000000003  hsX  99883667  99891803   TSPAN6      - H. sapiens
ENSG00000000005  hsX  99839799  99854882     TNMD      + H. sapiens
ENSG00000000419 hs20  49551404  49575092     DPM1      - H. sapiens
ENSG00000000457  hs1 169818772 169863408    SCYL3      - H. sapiens
ENSG00000000460  hs1 169764181 169823221 C1orf112      + H. sapiens
ENSG00000000938  hs1  27938575  27961788      FGR      - H. sapiens

# Descriptions of the gene-level metrics
fvarMetadata(eset)

                                               labelDescription
chr                                                  Chromosome
start  Most 5' start position (GRCh37/hg19; 1-based; inclusive)
end      Most 3' end position (GRCh37/hg19; 1-based; inclusive)
name                                                  Gene name
strand      Strand (+ = positive/forward; - = negative/reverse)
source                                            Source of RNA

# Experimental details
experimentData(eset)

Experiment data
  Experimenter name: Abhishek K. Sarkar, Po-Yuan Tung, John D. Blischak, Yang I. Li, Matthew Stephens, and Yoav Gilad 
  Laboratory: Gilad (University of Chicago) 
  Contact information: https://github.com/jdblischak/singlecell-qtl/issues 
  Title: Discovery and characterization of variance QTLs in human induced pluripotent stem cells 
  URL: https://jdblischak.github.io/singlecell-qtl 
  PMIDs:  

  Abstract: A 195 word abstract is available. Use 'abstract' method.
  Information is available on: preprocessing

# Abstract
cat(abstract(eset))

Quantification of gene expression levels at the single cell level has
revealed that gene expression can vary substantially even across a
population of homogeneous cells. However, it is currently unclear what
genomic features control variation in gene expression levels, and whether
common genetic variants may impact gene expression variation. Here, we
take a genome-wide approach to identify expression variance quantitative
trait loci (vQTLs). To this end, we generated single cell RNA-seq
(scRNA-seq) data from induced pluripotent stem cells (iPSCs) derived from
53 Yoruba individuals. We collected data for a median of 95 cells per
individual and a total of 5,447 single cells, and identified 241 mean
expression QTLs (eQTLs) at 10% FDR, of which 82% replicate in bulk RNA-seq
data from the same individuals. We further identified 14 vQTLs at 10% FDR,
but demonstrate that these can also be explained as effects on mean
expression. Our study suggests that dispersion QTLs (dQTLs), which could
alter the variance of expression independently of the mean, have
systematically smaller effect sizes than eQTLs. We estimate that at least
300 cells per individual and 400 individuals would be required to have
modest power to detect the strongest dQTLs in iPSCs. These results will
guide the design of future studies on understanding the genetic control of
gene expression variance.

# Proprocessing steps
steps <- preproc(eset)
for (i in seq_along(steps)) {
  cat(steps[[i]], sep = "\n")
}

We used `umi_tools extract` (UMI-tools 0.5.3) to extract the 6 base
pair unique molecular identifier (UMI) from the 5’ end of each read.
We used `subjunc` (Subread 1.5.3) to map reads to the human genome
(Ensembl GRCh37.75; chromosomes 1-22, X, Y, MT) and ERCC spike-ins
(http://tools.invitrogen.com/downloads/ERCC92.fa). Some samples included
spike-in RNA from Drosophila melanogaster or Caenorhabditis elegans, so
we also included the Ensembl genomes BDGP5.75 and WBcel235.75.
We used `featureCounts` (Subread 1.5.3) to count the number of reads
for all protein-coding genes (Ensembl GRCh37 release 75) and the ERCC
spike-in genes. We performed strand-specific counting (flag -s 1)
because the UMI protocol preserves sequence strand information.
We used `umi_tools dedup` (UMI-tools 0.5.3) to deduplicate reads with
the same UMI and start position to molecules.
We used `verifyBamID` (1.1.3) to identify the individual of origin for
each sample based on the overlap of the RNA-seq reads with the known
genotypes.

A nice feature of ExpressionSets is that any subsetting applied to either the sample-level or gene-level variables is automatically propagated to the expression matrix.

eset_quality <- eset[fData(eset)$source == "H. sapiens",
                     eset$cell_number == 1 & eset$tra1.60]
dim(eset_quality)

Features  Samples 
   20151       85

Multiple ExpressionSet objects can also be combined. The following code imports and combines the data for all 10 C1 chips from batch 1.

chips <- c("03162017.rds", "03172017.rds", "03232017.rds",
           "03302017.rds", "03312017.rds", "04052017.rds",
           "04072017.rds", "04132017.rds", "04142017.rds",
           "04202017.rds")
fname <- Map(function(x) paste0("../data/eset/", x), chips)
batch1 <- Reduce(combine, Map(readRDS, fname))
dim(batch1)

Features  Samples 
   54616      960

To read in the entire data set, use the combined ExpressionSet object (it's much faster to load just one serialized data file).

eset_total <- readRDS("../data/eset.rds")
dim(eset_total)

Features  Samples 
   54616     7584

To learn more about the data that can be stored in an ExpressionSet object and methods to access the data, please see the help pages for ExpressionSet and its parent class eSet.

?ExpressionSet
?eSet

Using combined data files

For convenience, the combined data are available in single files.

The annotation for all the single cells is in data/scqtl-annotation.txt. The description of the annotation columns is in data/scqtl-annotation-description.txt.

anno <- read.delim("../data/scqtl-annotation.txt", stringsAsFactors = FALSE)
# Add back leading zero for experiment ID to get MMDDYYYY
anno$experiment <- sprintf("%08d", anno$experiment)
anno_descrip <- read.delim("../data/scqtl-annotation-description.txt",
                           stringsAsFactors = FALSE)
head(anno_descrip, 3)

    variable                                      description
1 experiment ID of C1 chip (i.e. processing date in MMDDYYYY)
2       well  Well of C1 chip (96 total, rows A-H, cols 1-12)
3      batch Batch the C1 chip was processed in (b1, b2, ...)

The combined counts are in data/scqtl-counts.txt.gz. Note that because this file is large, it is stored using Git LFS. Thus you must follow the setup instructions above to download this file to your local computer. I recommend importing it with data.table::fread.

library("data.table")
counts <- data.frame(fread("zcat ../data/scqtl-counts.txt.gz"), row.names = 1)


Read 49.6% of 20151 rows
Read 99.3% of 20151 rows
Read 20151 rows and 7585 (of 7585) columns from 0.292 GB file in 00:00:10

dim(counts)

[1] 20151  7584

counts[1:5, 1:2]

                NA18517.02192018.A01 NA18913.02192018.A02
ENSG00000000003                    2                    1
ENSG00000000005                    0                    0
ENSG00000000419                    1                    0
ENSG00000000457                    0                    1
ENSG00000000460                    0                    0

Data analysis workflow

The data analysis is organized using the R package workflowr. Here are the basics:

# Create a new analysis file using the workflowr
wflow_open("analysis/new.Rmd")
# Build the website locally
wflow_build()
# Build and commit the website files
wflow_publish("analysis/new.Rmd", "Add a new analysis")

Ideally wflow_publish() will be run using the conda environment scqtl so that the final results are always generated using the same versions of the software. If you are using RStudio, you may need to open R from the command line to execute wflow_publish() in the conda environment.

More information on workflowr can be found in the online documentation.

Adding large files to the Git repo

Since we use Git LFS, we can version large binary files with Git without worrying about the size of the Git repo. Before you do anything with large files, make sure you have activated the conda environment to ensure that you have Git LFS installed.

Activate the conda environment:
```
source activate scqtl
```
Check the currently tracked file extensions:
```
git lfs track
```
If the file extension is not yet being tracked, you first need to track it. For example, if you wanted to use Git LFS with files that end in the extension .xyz, you would run the following:
```
git lfs track "*.xyz"
git add .gitattributes
```
Once the file extension is being tracked, add and commit the file like any other file (you need to specify the -f flag for "force" because the directory data/ is ignored by default to prevent accidentally committing a large file):
```
git add -f data/filename.xyz
git commit
```
Before pushing to GitHub, confirm that the large file was properly committed by running git log --stat. The file that was committed should only be a few lines long because it is a plain text file with a pointer to the version of the large file on GitHub's servers.

Installation of R packages outside of conda environment

There may be situations where you cannot access the R packages installed in the conda environment (e.g. running RStudio on macOS, or RStudio Server). The code below installs the necessary R packages from CRAN, Bioconductor, and GitHub.

Install R packages from CRAN:

install.packages("cowplot", "data.table", "devtools", "dplyr", "ggplot2",
                 "readr", "rmarkdown", "stringr", "testit", "tidyr")

Install R packages from Bioconductor:

source("https://bioconductor.org/biocLite.R")
biocLite(c("Biobase", "biomaRt", "edgeR"))

Install R packages from GitHub:

devtools::install_github("jdblischak/workflowr")

Tips and tricks

Enable Bash auto-completion for Git on macOS

Manually typing git pull origin master and other long Git commands is a pain. To enable Bash auto-completion (i.e. tab-completion) for Git on macOS, follow these steps:

Download the Bash completion script:

curl -L https://raw.github.com/git/git/master/contrib/completion/git-completion.bash > ~/git-completion.bash

Add the following line to your .bash_profile:
```
source ~/git-completion.bash
```
Re-run your .bash_profile:
```
source ~/.bash_profile
```

Install git-lfs locally

The Initial setup instructions install git-lfs using conda. Therefore you need to have the conda environment active (source activate scqtl) whenever you run a Git command, which ensures that the large files are always handled properly by Git. Since it can be easy to forget to first activate the conda environment, a safety measure would be to install git-lfs on your local machine (i.e. outside of the conda environment). This will ensure that git-lfs is always available even when you forget to activate the conda environment.

There are multiple methods to install git-lfs. Below are instructions to download a pre-built binary.

Download the software:

cd ~/Downloads
# 64-bit macOS:
curl -OL https://github.com/git-lfs/git-lfs/releases/download/v2.3.0/git-lfs-darwin-amd64-2.3.0.tar.gz
# 64-bit Linux:
curl -OL https://github.com/git-lfs/git-lfs/releases/download/v2.3.0/git-lfs-linux-amd64-2.3.0.tar.gz

Extract the software and copy the executable files into ~/bin (note: you can put it anywhere you like, but this is a conventional place to put executable files):
```
mkdir -p ~/bin
tar xzf git-lfs-*.tar.gz
cd git-lfs-2.3.0/
cp git-lfs install.sh ~/bin
```
Add ~/bin to your PATH by adding the following line to your ~/.bash_profile (macOS) or .bashrc (Linux):
```
export PATH="~/bin:$PATH"
```
Note that ~/bin is already included in PATH by default on at least some Linux distributions.

Update PATH:

# macOS
source ~/.bash_profile
# Linux
source ~/.bashrc

Just to be safe, run the git-lfs install script (this modifies the file ~/.gitconfig). This is unnecessary if you have already gone through the Initial setup instructions on this machine, but it doesn't hurt to run it again:
```
git lfs install
```
To confirm that you have git-lfs installed locally, run which git-lfs. This should point to ~/bin/git-lfs and not your conda environment. Also run cat ~/.gitconfig. It should contain the following (it will also likely have additional lines):
```
[filter "lfs"]
        clean = git-lfs clean -- %f
        smudge = git-lfs smudge -- %f
        process = git-lfs filter-process
        required = true
```

Configure conda channels

Updating the entire "scqtl" conda environment is time-consuming because it contains so many packages. This is prohibitive, especially if you only want to test out a new package. Thus instead of adding the package to environment.yaml and updating the entire environment, you can install just that new package directly from the Terminal:

source activate scqtl
conda install pkgname

If you know the channel the package is in, you can specify it with -c, for example -c bioconda. However, this can get tedious if the package has multiple dependencies across multiple channels, which requires specifying each channel with -c. To automatically search the conda channels in the exact same order as specified in environment.yaml, run the code below to create the file ~/.condarc:

echo "\
channels:
  - jdblischak
  - bioconda
  - conda-forge
  - defaults
  - anaconda
default_channels:
  - https://repo.anaconda.com/pkgs/free
  - https://repo.anaconda.com/pkgs/pro" > ~/.condarc

And now you can run conda install without the -c flag.

Upgrade RStudio

RStudio has had some major updates with its 1.0 release. If you are using a version less than 1.0 (check by clicking Help -> About RStudio), you should strongly consider upgrading. The most noticeable change for our workflow is that it has built-in support for R Markdown websites. Thus if you click the Knit button for an R Markdown file in the workflowr project, it will recognize the settings in the file analysis/_site.yml and save the output HTML in docs/.

One annoying new feature you will notice is that the chunks of R Markdown documents appear inline in the document instead of being executed in the R console. To restore the previous setting, go to Tools -> Global Options... -> R Markdown and unselect "Show output inline for all R Markdown documents".

Open RStudio from the Terminal

On macOS and Linux systems you can run RStudio from a terminal and specify which working directory to startup within. Additionally, on Linux systems if you run RStudio from a terminal and specify no command line argument then RStudio will startup using the current working directory of the terminal.

For example, on macOS you could use the following commands to open RStudio (respectively) in the ~/projects/foo directory or the current working directory:

$ open -a RStudio ~/projects/foo
$ open -a RStudio .

On Linux you would use the following commands (note that no . is necessary in the second invocation):

$ rstudio ~/projects/foo
$ rstudio

FAQ

Why does Git keep asking for my password repeatedly?

We use git-lfs to version control large files. To push updated versions of these large files requires authentication of the https server where they are stored, and this must be done for each file separately! To reduce the number of times you have to manually enter your password, enable caching of your GitHub password:

git config --global credential.helper cache

Session information

sessionInfo()

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.4 (Nitrogen)

Matrix products: default
BLAS: /project2/gilad/jdblischak/miniconda3/envs/scqtl/lib/R/lib/libRblas.so
LAPACK: /project2/gilad/jdblischak/miniconda3/envs/scqtl/lib/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] data.table_1.10.4   Biobase_2.38.0      BiocGenerics_0.24.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.13    digest_0.6.12   rprojroot_1.2   backports_1.0.5
 [5] git2r_0.19.0    magrittr_1.5    evaluate_0.10.1 stringi_1.1.2  
 [9] rmarkdown_1.8   tools_3.4.1     stringr_1.2.0   yaml_2.1.14    
[13] compiler_3.4.1  htmltools_0.3.6 knitr_1.20

This R Markdown site was created with workflowr