Diminishing returns of sequencing depth

Introduction

We previously observed that the number of genes/molecules detected does not saturate with increasing sequencing depth.

Here, we quantify the potential gain in additional sequencing of the samples in terms of number of molecules and number of genes detected.

Analysis

annotations.shape
(5012, 40)

plt.clf()
plt.scatter(annotations['umi'], annotations['mol_hs'], color='k', s=2)
plt.xticks(np.linspace(0, 1, 5) * 1e7, np.linspace(0, 1, 5) * 10)
plt.xlabel('Millions of reads with UMI')
_ = plt.ylabel('Number of molecules detected')

mol-vs-depth.png

plt.clf()
plt.scatter(annotations['umi'], annotations['detect_hs'], color='k', s=2)
plt.xticks(np.linspace(0, 1, 5) * 1e7, np.linspace(0, 1, 5) * 10)
plt.xlabel('Millions of reads with UMI')
_ = plt.ylabel('Number of genes detected')

detect-vs-depth.png

To see whether the trend is different per individual, fit a linear regression per individual:

def _lm(x):
  m = lm.LinearRegression(fit_intercept=True).fit(x['umi'].values.reshape(-1, 1), x['mol_hs'].values)
  rss = np.square(x['mol_hs'].values - m.predict(x['umi'].values.reshape(-1, 1))).sum()
  sigma2 = rss / x.shape[0]
  var = x['umi'].var()
  se = sigma2 / var
  return pd.Series({'beta': m.coef_[0], 'se': se, 'intercept': m.intercept_})

beta = annotations.groupby('chip_id').apply(_lm).sort_index()

Plot the distribution of regression coefficients:

plt.clf()
plt.hist(beta['beta'].values, color='k', bins=20)
plt.xlabel('Estimated number of molecules detected per additional read')
_ = plt.ylabel('Number of individuals')

mol-vs-depth-beta.png

plt.clf()
plt.gcf().set_size_inches(12, 4)
plt.errorbar(x=np.arange(beta.shape[0]), y=beta['beta'].values, yerr=beta['se'].values, ls='', marker='o', ms=1)
plt.xticks(np.arange(beta.shape[0]), beta.index, rotation=90)
plt.xlabel('Individual')
_ = plt.ylabel('Estimated effect size')

mol-vs-depth-beta-errors.png

Author: Abhishek Sarkar

Created: 2018-05-22 Tue 13:57

Validate