Examples of single cell measurement error
Introduction
Our power analysis found several counter-intuitive results:
- The noise ratio (measurement error variance divided by residual variance) is typically larger for the mean than for the dispersion
- The measurement error variance is sometimes larger than the sample variance, but we still find a QTL
- The estimated prior on effect sizes on mean expression has more density close to zero than the estimated prior on effect sizes on dispersion
Here, we demonstrate these results reflect real features of the data on several examples.
Read the data
Read the summary statistics \(\hat\beta, SE(\hat\beta)\).
mean_stats = pd.read_table('/scratch/midway2/aksarkar/singlecell/power/analytic/mean.txt.gz') disp_stats = pd.read_table('/scratch/midway2/aksarkar/singlecell/power/analytic/disp.txt.gz')
Read the sampling variances \(\hat\sigma^2_{mi}\).
log_mean_sampling_var = pd.read_table('/scratch/midway2/aksarkar/singlecell/power/log_mean-se.txt.gz', index_col=0) disp_sampling_var = pd.read_table('/scratch/midway2/aksarkar/singlecell/power/disp-se.txt.gz', index_col=0)
Read the fitted ash
models.
with open('fold-change-ash-results.pkl', 'rb') as f: ash_results = pickle.load(f)
CCT8
CCT8 is one of the cases where we find an eQTL, even though the estimated measurement error variance is larger than the estimated residual variance.
Extract the data pre-computed for the interactive browser.
with sqlite3.connect('/project2/mstephens/aksarkar/projects/singlecell-qtl/browser/browser.db') as conn: params = pd.read_sql( sql="""select mean_qtl_geno.ind, mean_qtl_geno.value as genotype, log_mu, log_phi, logodds, log_mean, log_mean_se, log_phi_se, log_mean + log_mean_se as log_mean_upper, log_mean - log_mean_se as log_mean_lower, log_phi + log_phi_se as log_phi_upper, log_phi - log_phi_se as log_phi_lower from mean_qtl_geno, params where mean_qtl_geno.gene == ? and mean_qtl_geno.gene == params.gene and mean_qtl_geno.ind == params.ind;""", params=('ENSG00000156261',), con=conn)
Estimate the residual variance for the mean, corrected for measurement error.
\[ \hat\sigma^2_r = \frac{1}{n - 1}\sum_i (\tilde{y}_i - \bar{y})^2 - \frac{1}{n}\sum_i \hat\sigma^2_{mi} \]
where \(\tilde{y}_i\) is the observed phenotype of individual \(i\), \(i = 1, \ldots, n\), \(\bar{y} = \frac{1}{n} \sum_i \tilde{y}_i\), and \(\hat\sigma^2_{mi}\) is the estimated sampling variance associated with \(\tilde{y}_i\).
params['log_mean'].var() - log_mean_sampling_var.loc['ENSG00000156261'].mean()
-0.00148092262834108
Do the same for the dispersion.
params['log_phi'].var() - disp_sampling_var.loc['ENSG00000156261'].mean()
0.36559941312947225
For this gene, get the SNP with best Wald statistic \(\frac{{\hat\beta}^2}{V[\hat\beta]}\).
best_eqtl = mean_stats.groupby('gene').apply(lambda x: x.loc[np.square(x['beta'] / x['se']).idxmax()]).loc['ENSG00000156261'] best_eqtl
Unnamed: 0 rs2070611.chr21.30445774 beta 0.028947 gene ENSG00000156261 se 0.00691489 Name: ENSG00000156261, dtype: object
We previously estimated the ash
prior genome-wide by taking 10 random SNPs
for every gene.
Given the EB prior, recover the posterior distribution of the effect size for the best SNP identified above.
shrunk_eqtl = ashr.ash(best_eqtl['beta'], best_eqtl['se'], fixg=True, g=ash_results['mean'].rx2('fitted_g')) pd.Series({'pm': float(np.array(ashr.get_pm(shrunk_eqtl))), 'lfsr': float(np.array(ashr.get_lfsr(shrunk_eqtl)))})
lfsr 0.008324 pm 0.021778 dtype: float64
On the log scale, the posterior effect size on the natural log of the mean is roughly 0.02, despite the prior putting most density much closer to 0.
This is a tiny effect size, but we need to consider it relative to the variation in the phenotype. Define \(\beta = \lambda \sigma_r\), and estimate \(\lambda\) assuming \(V[y] = \sigma_r^2\).
best_eqtl['beta'] / params['log_mean'].std()
Do the same for the best dQTL.
best_dqtl = disp_stats.groupby('gene').apply(lambda x: x.loc[np.square(x['beta'] / x['se']).idxmax()]).loc['ENSG00000156261']
shrunk_dqtl = ashr.ash(best_dqtl['beta'], best_dqtl['se'], fixg=True, g=ash_results['disp'].rx2('fitted_g')) pd.Series({'beta': best_dqtl['beta'], 'se': best_dqtl['se'], 'pm': float(np.array(ashr.get_pm(shrunk_dqtl))), 'lfsr': float(np.array(ashr.get_lfsr(shrunk_dqtl)))})
beta -0.176113 lfsr 0.750243 pm -0.014714 se 0.082764 dtype: float64
Unlike the eQTL effect size, the dQTL effect size is shrunk essentially to zero, and has high lfsr.
Plot the phenotype by genotype class, along with the estimated regression line and prediction confidence intervals.
plt.clf() fig, ax = plt.subplots(1, 2) fig.set_size_inches(6, 3) x = params['genotype'] + np.random.normal(scale=0.05, size=params.shape[0]) ax[0].errorbar(x=x, y=params['log_mean'], yerr=params['log_mean_se'], marker='o', c='k', ms=3, lw=1, ls='') ax[0].set_ylabel('Deconvolved ln mean') grid = np.linspace(0, 2, 100) yhat = params['log_mean'].mean() + (grid - grid.mean()) * best_eqtl['beta'] # https://rpubs.com/aaronsc32/regression-confidence-prediction-intervals pred_se = best_eqtl['se'] * (1 + 1 / grid.shape[0] + np.square(grid - grid.mean()) / np.square(grid - grid.mean()).sum()) ax[0].plot(grid, yhat, c='r', lw=2, zorder=99) ax[0].fill_between(grid, yhat - 1.96 * pred_se, yhat + 1.96 * pred_se, color='r', alpha=0.25) ax[1].errorbar(x=x, y=params['log_phi'], yerr=params['log_phi_se'], marker='o', c='k', ms=3, lw=1, ls='') yhat = params['log_phi'].mean() + (grid - grid.mean()) * best_dqtl['beta'] pred_se = best_dqtl['se'] * (1 + 1 / grid.shape[0] + np.square(grid - grid.mean()) / np.square(grid - grid.mean()).sum()) ax[1].plot(grid, yhat, c='r', lw=2, zorder=99) ax[1].fill_between(grid, yhat - 1.96 * pred_se, yhat + 1.96 * pred_se, color='r', alpha=0.25) ax[1].set_ylabel('$\ln(\phi)$') for a in ax: a.set_xlabel('Dosage') fig.tight_layout()
Overall, the example suggests:
- The major difference between eQTL mapping and dQTL mapping is within genotype class variance. The within genotype class variance reflects PVE (namely, variance unexplained by genotype)
- Even if the measurement error variance is comparable to the effect size or even the phenotypic variance, we can still find a QTL if the PVE is high enough (within genotype class variance is low enough)
Plot synthetic example
def rnbinom(log_mu, log_phi, n_samples): n = np.exp(-log_phi) p = 1 / (1 + np.exp(log_mu + log_phi)) return st.nbinom(n=n, p=p).rvs(n_samples)
plt.clf() plt.gcf().set_size_inches(3, 3) for log_phi in (-5, -2): x = rnbinom(np.log(1.5e5) - 8, log_phi, 5000) f = st.gaussian_kde(x) grid = np.linspace(0, 100, 1000) plt.plot(grid, f(grid), label=log_phi) plt.legend(frameon=False, title='$\ln(\phi)$') plt.xlabel('Count') plt.ylabel('Density')
Text(0,0.5,'Frequency')