Mean/dispersion estimation

Introduction

We take a modular approach to call QTLs:

Estimate a mean and a dispersion for each individual
Treat the mean/dispersion as continuous phenotypes and perform QTL mapping

Here, we solve (1).

We implement GPU-based ML estimation of a zero-inflated negative binomial model
We show in simulation that the estimates are unbiased
We compare ZINB estimates of mean expression against sample-based estimates
We implement a test for goodness of fit for the ZINB model

Model specification

Let \(r_{ijk}\) denote the number of molecules for individual \(i\), cell \(j\), gene \(k\). Let \(R_{ij}\) denote a size factor for each cell.

\[ r_{ijk} \sim \pi_{ik} \delta_0(\cdot) + (1 - \pi_{ik})\text{Poisson}(\cdot; R_{ij} \mu_{ik} u_{ijk}) \]

\[ u_{ijk} \sim \text{Gamma}(\cdot; \phi_{ik}^{-1}, \phi_{ik}^{-1}) \]

Here, \(\mu_{ik}\) is proportional to relative expression (Pachter 2011), and \(\phi_{ik}\) is the variance of expression noise.

Considering just the Poisson component, marginalizing out \(u\) yields the log likelihood:

\[ l(\cdot) = \ln(1 - \pi_{ik}) + r_{ijk} \ln\left(\frac{R_{ij}\mu_{ik}\phi_{ik}}{1 + R_{ij}\mu_{ik}\phi_{ik}}\right) - \phi_{ik}^{-1} \ln(1 + R_{ij}\mu_{ik}\phi_{ik}) + \ln \Gamma(r_{ijk} + \phi_{ik}^{-1}) - \ln \Gamma(r_{ijk} + 1) - \ln \Gamma(\phi^{-1}) \]

Then, marginalizing over the mixture yields the log likelihood:

\[ \ln p(r_{ijk} \mid \cdot) = \ln(\pi_{ik} + \exp(l(\cdot)))\ \text{if}\ r_{ijk} = 0 \]

\[ \ln p(r_{ijk} \mid \cdot) = l(\cdot)\ \text{otherwise} \]

We have enough observations per mean/dispersion parameter that simply minimizing the negative log likelihood should give reasonable estimates.

This model is equivalent to a model where we assume that the underlying rate is a point-Gamma mixture:

\[ r_{ijk} \mid \lambda_{ijk} \sim \mathrm{Poisson}(\cdot; R_{ij}\lambda_{ijk}) \]

\[ \lambda_{ijk} \sim \pi_{ik} \delta_0(\cdot) + (1 - \pi_{ik}) \text{Gamma}(\lambda_{ijk}; \phi_{ik}^{-1}, \phi_{ik}^{-1}\mu_{ik}^{-1}) \]

The Gamma component of this mixture corresponds to \(\mu_{ik}u_{ijk}\) in the model above. Considering just the Gamma component, marginalizing out \(\lambda\) yields the log likelihood:

\[ \tilde{l}(\cdot) = \ln(1 - \pi_{ik}) + r_{ijk} \ln\left(\frac{R_{ij}}{R_{ij} + \phi_{ik}^{-1}\mu_{ik}^{-1}} \right) + \phi_{ik}^{-1} \ln\left(\frac{\phi_{ik}^{-1}\mu_{ik}^{-1}}{R_{ij} + \phi_{ik}^{-1}\mu_{ik}^{-1}}\right) + \ln\Gamma(r_{ijk} + \phi_{ik}^{-1}) - \ln\Gamma(r_{ijk} + 1) - \ln\Gamma(\phi_{ik}^{-1}) \]

It is clear \(l = \tilde{l}\), and therefore the marginal likelihoods (over the mixture components) are also equal.

Tensorflow implementation

Use tensorflow to optimize all of the parameters together using one-hot encoding to map parameters to data points. This makes inference more amenable to running on the GPU.

The method is implemented in the Python package scqtl.

Simulation

<<zinb-imports>>
<<tf-imports>>

def evaluate(num_samples, num_mols, num_trials=10):
  # This will be reset inside the simulation to generate counts, but we need to
  # fix it to get one design matrix for all the simulated genes
  # def simulate(num_samples, size=None, log_mu=None, log_phi=None, logodds=None, seed=None, design=None, fold=None):
  design = np.zeros((num_samples * num_trials, 1))
  # Important: generate all of the samples for each trial in one shot, and use
  # one-hot encoding to get separate estimates
  args = [(num_samples * num_trials, num_mols, log_mu, log_phi, logodds, None, None, None)
          for log_mu in np.linspace(-12, -6, 7)
          for log_phi in np.linspace(-4, 0, 5)
          for logodds in np.linspace(-3, 3, 7)]
  umi = np.concatenate([scqtl.simulation.simulate(*a)[0][:,:1] for a in args], axis=1)
  onehot = np.zeros((num_samples * num_trials, num_trials))
  onehot[np.arange(onehot.shape[0]), np.arange(onehot.shape[0]) // num_samples] = 1

  init = scqtl.tf.fit(
    umi=umi.astype(np.float32),
    onehot=onehot.astype(np.float32),
    design=design.astype(np.float32),
    size_factor=(num_mols * np.ones((num_samples * num_trials, 1))).astype(np.float32),
    learning_rate=1e-3,
    max_epochs=20000,
    verbose=True,
  )
  log_mu, log_phi, logodds, nb_llik, zinb_llik = scqtl.tf.fit(
    umi=umi.astype(np.float32),
    onehot=onehot.astype(np.float32),
    design=design.astype(np.float32),
    size_factor=(num_mols * np.ones((num_samples * num_trials, 1))).astype(np.float32),
    learning_rate=1e-3,
    max_epochs=20000,
    warm_start=init[:3],
    verbose=True)
  result = pd.DataFrame(
    [('rmsprop', a[0] // num_trials, int(a[1]), int(a[2]), int(a[3]), int(a[4]), a[-1], trial)
     for a in args
     for trial in range(num_trials)],
    columns=['method', 'num_samples', 'num_mols', 'log_mu', 'log_phi', 'logodds', 'fold', 'trial'])
  # Important: the results need to be transposed before flattening
  result['log_mu_hat'] = log_mu.ravel(order='F')
  result['log_phi_hat'] = log_phi.ravel(order='F')
  result['logodds_hat'] = logodds.ravel(order='F')
  result['mean'] = result['num_mols'] * np.exp(result['log_mu_hat'])
  result['var'] = result['mean'] + np.square(result['mean']) * np.exp(result['log_phi_hat'])
  log_cpm = np.log(np.ma.masked_values(umi.reshape(num_trials, -1, umi.shape[-1]), 0)) - np.log(num_mols) + 6 * np.log(10)
  result['mean_log_cpm'] = log_cpm.mean(axis=1).ravel(order='F')
  result['var_log_cpm'] = log_cpm.var(axis=1).ravel(order='F')

  diagnostic = []
  for i in range(umi.shape[1]):
    for j in range(onehot.shape[1]):
      idx = onehot[:,j].astype(bool)
      diagnostic.append(scqtl.diagnostic.diagnostic_test(
        umi[idx,i].reshape(-1, 1),
        log_mu[j,i],
        log_phi[j,i],
        -logodds[j,i],
        num_mols,
        np.ones((num_samples, 1))))
  diagnostic = np.array(diagnostic)
  result['ks_d'] = diagnostic[:,0]
  result['ks_p'] = diagnostic[:,1]

  return result

res = evaluate(num_samples=95, num_mols=114026, num_trials=10)
res.to_csv('tf-simulation.txt.gz', compression='gzip', sep='\t')

sbatch --partition=gpu2 --gres=gpu:1 --mem=16G --time=1:00:00 --job-name=tf-sim --output=tf-sim.out
#!/bin/bash
source activate scqtl
python /project2/mstephens/aksarkar/projects/singlecell-qtl/code/tf-sim.py

Submitted batch job 55767347

Read the results.

result = pd.read_table('/scratch/midway2/aksarkar/singlecell/density-estimation/tf-simulation.txt.gz', index_col=0)

Get the latent mean and variance.

result['latent_mean'] = np.exp(result['log_mu'] - np.log1p(np.exp(result['logodds'])))
result['latent_mean_hat'] = np.exp(result['log_mu_hat'] - np.log1p(np.exp(result['logodds_hat'])))
result['latent_var'] = np.exp(2 * result['log_mu'] + result['log_phi'] - np.log1p(np.exp(result['logodds']))) + np.exp(-np.log1p(np.exp(result['logodds'])) - np.log1p(np.exp(-result['logodds'])) + 2 * result['log_mu'])
result['latent_var_hat'] = np.exp(2 * result['log_mu_hat'] + result['log_phi_hat'] - np.log1p(np.exp(result['logodds_hat']))) + np.exp(-np.log1p(np.exp(result['logodds_hat'])) - np.log1p(np.exp(-result['logodds_hat'])) + 2 * result['log_mu_hat'])

Plot the accuracy of the estimated parameters and derived quantities, fixing the experiment size.

exp_pass = np.logical_and(result['num_samples'] == 95, result['num_mols'] == 114026)
mu_pass = result['log_mu'] > -10
pi_pass = result['logodds'] < 0

plt.clf()
fig, ax = plt.subplots(2, 3)
fig.set_size_inches(8, 5)

subset = result.loc[np.logical_and.reduce(np.vstack([exp_pass, pi_pass]))]
ax[0, 0].scatter(subset['log_mu'], subset['log_mu_hat'], s=2, c='k')
ax[0, 0].plot(ax[0, 0].get_xlim(), ax[0, 0].get_xlim(), c='r', ls=':', lw=1)
ax[0, 0].set_xlabel('True $\ln(\mu)$')
ax[0, 0].set_ylabel('Estimated $\ln(\mu)$')

ax[1, 0].set_xscale('log')
ax[1, 0].set_yscale('log')
ax[1, 0].scatter(subset['latent_mean'], subset['latent_mean_hat'], s=2, c='k')
ax[1, 0].plot(ax[1, 0].get_xlim(), ax[1, 0].get_xlim(), c='r', ls=':', lw=1)
ax[1, 0].set_xlabel('True latent mean')
ax[1, 0].set_ylabel('Estimated latent mean')

subset = result.loc[np.logical_and.reduce(np.vstack([exp_pass, mu_pass, pi_pass]))]
ax[0, 1].scatter(subset['log_phi'], subset['log_phi_hat'], s=2, c='k')
ax[0, 1].plot(ax[0, 1].get_xlim(), ax[0, 1].get_xlim(), c='r', ls=':', lw=1)
ax[0, 1].set_xlabel('True $\ln(\phi)$')
ax[0, 1].set_ylabel('Estimated $\ln(\phi)$')

ax[1, 1].set_xscale('log')
ax[1, 1].set_yscale('log')
ax[1, 1].scatter(subset['latent_var'], subset['latent_var_hat'], s=2, c='k')
ax[1, 1].plot(ax[1, 1].get_xlim(), ax[1, 1].get_xlim(), c='r', ls=':', lw=1)
ax[1, 1].set_xlabel('True latent variance')
ax[1, 1].set_ylabel('Estimated latent variance')

subset = result.loc[exp_pass]
ax[0, 2].scatter(subset['logodds'], subset['logodds_hat'], s=2, c='k')
ax[0, 2].plot(ax[0, 2].get_xlim(), ax[0, 2].get_xlim(), c='r', ls=':', lw=1)
ax[0, 2].set_xlabel('True $\mathrm{logit}(\pi)$')
ax[0, 2].set_ylabel('Estimated $\mathrm{logit}(\pi)$')

ax[1, 2].set_axis_off()
fig.tight_layout()

Plot the accuracy of estimated latent mean and variance as a function of number of molecules, holding the number of cells fixed at the median value in the real data.

plt.clf()
fig, ax = plt.subplots(3, 2)
fig.set_size_inches(6, 9)

for i, (k, g) in enumerate(result.loc[result['num_samples'] == 95].groupby('num_mols')):
  mu_pass = g['log_mu'] > -10
  pi_pass = g['logodds'] < 0
  subset = g.loc[pi_pass]
  ax[i, 0].semilogx()
  ax[i, 0].semilogy()
  ax[i, 0].scatter(subset['latent_mean'], subset['latent_mean_hat'], s=2, c='k')
  ax[i, 0].plot(ax[i, 0].get_xlim(), ax[i, 0].get_xlim(), c='r', ls=':', lw=1)
  ax[i, 0].set_xlabel('True expression mean')
  ax[i, 0].set_ylabel('Estimated expression mean')
  ax[i, 0].set_title('Molecules $= {}$'.format(k))

  subset = g.loc[functools.reduce(np.logical_and, [mu_pass, pi_pass])]
  ax[i, 1].semilogx()
  ax[i, 1].semilogy()
  ax[i, 1].scatter(subset['latent_var'], subset['latent_var_hat'], s=2, c='k')
  ax[i, 1].plot(ax[i, 1].get_xlim(), ax[i, 1].get_xlim(), c='r', ls=':', lw=1)
  ax[i, 1].set_xlabel('True expression variance')
  ax[i, 1].set_ylabel('Estimated expression variance')
  ax[i, 1].set_title('Molecules $= {}$'.format(k))

fig.tight_layout()

Plot the accuracy of estimated latent mean and variance as a function of number of cells, holding the number of molecules fixed at the median value in the real data.

plt.clf()
fig, ax = plt.subplots(3, 2)
fig.set_size_inches(6, 9)

for i, (k, g) in enumerate(result.loc[result['num_mols'] == 114026].groupby('num_samples')):
  mu_pass = g['log_mu'] > -10
  pi_pass = g['logodds'] < 0
  subset = g.loc[pi_pass]
  ax[i, 0].semilogx()
  ax[i, 0].semilogy()
  ax[i, 0].scatter(subset['latent_mean'], subset['latent_mean_hat'], s=2, c='k')
  ax[i, 0].plot(ax[i, 0].get_xlim(), ax[i, 0].get_xlim(), c='r', ls=':', lw=1)
  ax[i, 0].set_xlabel('True expression mean')
  ax[i, 0].set_ylabel('Estimated expression mean')
  ax[i, 0].set_title('$n = {}$'.format(k))

  subset = g.loc[functools.reduce(np.logical_and, [mu_pass, pi_pass])]
  ax[i, 1].semilogx()
  ax[i, 1].semilogy()
  ax[i, 1].scatter(subset['latent_var'], subset['latent_var_hat'], s=2, c='k')
  ax[i, 1].plot(ax[i, 1].get_xlim(), ax[i, 1].get_xlim(), c='r', ls=':', lw=1)
  ax[i, 1].set_xlabel('True expression variance')
  ax[i, 1].set_ylabel('Estimated expression variance')
  ax[i, 1].set_title('$n = {}$'.format(k))

fig.tight_layout()

Compare against CPU implementation

We also implemented second-order optimization of the ZINB log likelihood. Compare the simulation results of the GPU implementation to the CPU implementation to make verify.

Write the results to the database backing the interactive browser.

tf_res = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/tf-simulation.txt.gz', index_col=0)
np_res = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/np-simulation.txt.gz', index_col=0)

np_res['method'] = 'lbfgs'
with sqlite3.connect('/project2/mstephens/aksarkar/projects/singlecell-qtl/browser/browser.db') as conn:
  pd.concat([tf_res, np_res], axis=0, join='inner').to_sql(name='simulation', con=conn, index=False, if_exists='replace')
  conn.execute('create index ix_simulation on simulation(log_mu, log_phi, logodds);')

Fit ZINB2

Read the data.

keep_samples = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/quality-single-cells.txt', index_col=0, header=None)
keep_genes = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/genes-pass-filter.txt', index_col=0, header=None)
annotations = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-annotation.txt')
annotations = annotations.loc[keep_samples.values.ravel()]
header = sorted(set(annotations['chip_id']))
umi = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-counts.txt.gz', index_col=0).loc[keep_genes.values.ravel(),keep_samples.values.ravel()]
index = umi.index

Prepare the design matrix of covariates.

onehot = recode(annotations, 'chip_id')

designs = []

# Null covariate model
designs.append(np.zeros((onehot.shape[0], 1)))

chip = recode(annotations, 'experiment')
chip -= chip.mean(axis=0)
designs.append(chip)

# These explain most PVE of circular pseudotime (Joyce Hsiao, personal
# communication)
cell_cycle_genes = [
  'ENSG00000094804', # CDC6
  'ENSG00000170312', # CDK1
  'ENSG00000175063', # UBE2C
  'ENSG00000131747', # TOP2A
  'ENSG00000197061', # HIST1H4C
  ]
cell_cycle = (umi.loc[cell_cycle_genes].values / annotations['mol_hs'].values).reshape(-1, len(cell_cycle_genes))
cell_cycle -= cell_cycle.mean(axis=0)
cell_cycle /= cell_cycle.std(axis=0)
designs.append(cell_cycle)

designs.append(np.concatenate([chip, cell_cycle], axis=1))

Estimate the parameters of the zero-inflated model assuming dropout per individual and gene.

<<zinb-imports>>
<<tf-imports>>
import argparse
<<recode-impl>>

parser = argparse.ArgumentParser()
parser.add_argument('--design', help='Design matrix of confounders', choices=list(range(2)), type=int)
args = parser.parse_args()
<<read-data-qc-impl>>
<<prepare-covars>>

umi = umi.T.astype(np.float32)
onehot = onehot.astype(np.float32)
design = designs[args.design].astype(np.float32)
size_factor = annotations['mol_hs'].astype(np.float32).values.reshape(-1, 1)
init = scqtl.tf.fit(
  umi=umi,
  onehot=onehot,
  design=design,
  size_factor=size_factor,
  learning_rate=1e-3,
  max_epochs=30000,
  verbose=True,
)
log_mu, log_phi, logodds, nb_nll, zinb_nll, beta = scqtl.tf.fit(
  umi=umi,
  onehot=onehot,
  design=design,
  size_factor=size_factor,
  learning_rate=1e-3,
  warm_start=init[:3],
  max_epochs=30000,
  return_beta=True,
  verbose=True,
)
pd.DataFrame(log_mu.T, index=index, columns=header).to_csv('zi2-log-mu.txt.gz', sep=' ', compression='gzip')
pd.DataFrame(log_phi.T, index=index, columns=header).to_csv('zi2-log-phi.txt.gz', sep=' ', compression='gzip')
pd.DataFrame(logodds.T, index=index, columns=header).to_csv('zi2-logodds.txt.gz', sep=' ', compression='gzip')
pd.DataFrame(beta.T, index=index).to_csv('beta.txt.gz', sep=' ', compression='gzip')

sbatch --partition=gpu --gres=gpu:1 --mem=16G --time=7:00:00 -a 1 --job-name=tf-zinb --output=zinb2.out
#!/bin/bash
source activate scqtl
mkdir -p design$SLURM_ARRAY_TASK_ID
pushd design$SLURM_ARRAY_TASK_ID
python /project2/mstephens/aksarkar/projects/singlecell-qtl/code/tf-zinb.py --design $SLURM_ARRAY_TASK_ID

Submitted batch job 55727034

sbatch --partition=broadwl --time=10
#!/bin/bash
cat >.rsync-filter <<EOF
+ */
+ *.txt.gz
- *
EOF
rsync -FFau --delete . /project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/

Submitted batch job 55767365

numpy/scipy implementation

Optimize the negative log-likelihood.

def log(x):
  """Numerically safe log"""
  return np.log(x + 1e-8)

def sigmoid(x):
  """Numerically safe sigmoid"""
  lim = np.log(np.finfo(np.float64).resolution)
  return np.clip(sp.expit(x), lim, -lim)

def nb(theta, x, size, onehot, design):
  """Return the per-data point log likelihood

  x ~ Poisson(size .* design' * theta[2 * m:k] * exp(onehot * theta[:m]) * u)
  u ~ Gamma(exp(onehot * theta[m:2 * m]), exp(onehot * theta[m:2 * m]))

  theta - (2 * m + k, 1)
  x - (n, 1)
  size - (n, 1)
  onehot - (n, m)
  design - (n, k)

  """
  n, m = onehot.shape
  assert x.shape == (n,)
  assert size.shape == (n,)
  assert design.shape[0] == n
  assert theta.shape == (2 * m + design.shape[1],)
  mean = size * np.exp(onehot.dot(theta[:m]) + design.dot(theta[2 * m:]))
  assert mean.shape == (n,)
  inv_disp = onehot.dot(np.exp(theta[m:2 * m]))
  assert inv_disp.shape == (n,)
  return (x * log(mean / inv_disp) -
          x * log(1 + mean / inv_disp) -
          inv_disp * log(1 + mean / inv_disp) +
          sp.gammaln(x + inv_disp) -
          sp.gammaln(inv_disp) -
          sp.gammaln(x + 1))

def _nb(theta, x, size, onehot, design=None):
  """Return the mean negative log likelihood of x"""
  return -nb(theta, x, size, onehot, design).mean()

def zinb(theta, x, size, onehot, design=None):
  """Return the mean negative log likelihood of x"""
  n, m = onehot.shape
  logodds, theta = theta[:m], theta[m:]
  case_non_zero = -np.log1p(np.exp(onehot.dot(logodds))) + nb(theta, x, size, onehot, design)
  case_zero = np.logaddexp(onehot.dot(logodds - np.log1p(np.exp(logodds))), case_non_zero)
  return -np.where(x < 1, case_zero, case_non_zero).mean()

def _fit_gene(chunk, onehot, design=None):
  n, m = onehot.shape
  assert chunk.shape[0] == n
  # We need to take care here to initialize mu=-inf for all zero observations
  x0 = np.log((onehot * chunk[:,:1]).sum(axis=0) / onehot.sum(axis=0)) - np.log(np.ma.masked_values(onehot, 0) * chunk[:,1:]).mean(axis=0).compressed()
  x0 = np.hstack((x0, np.zeros(m)))
  if design is not None:
    assert design.shape[0] == n
    design -= design.mean(axis=0)
    x0 = np.hstack((x0, np.zeros(design.shape[1])))
  res0 = so.minimize(_nb, x0=x0, args=(chunk[:,0], chunk[:,1], onehot, design))
  res = so.minimize(zinb, x0=list(np.zeros(m)) + list(res0.x), args=(chunk[:,0], chunk[:,1], onehot, design))
  if res0.fun < res.fun:
    # This isn't a likelihood ratio test. Numerically, our implementation of
    # ZINB can't represent pi = 0, so we need to use a separate implementation
    # for it
    log_mu = res0.x[:m]
    neg_log_phi = res0.x[m:2 * m]
    logit_pi = np.zeros(m)
    logit_pi.fill(-np.inf)
  else:
    logit_pi = res.x[:m]
    log_mu = res.x[m:2 * m]
    neg_log_phi = res.x[2 * m:3 * m]
  mean_by_sample = chunk[:,1] * onehot.dot(np.exp(log_mu))
  var_by_sample = mean_by_sample + np.square(mean_by_sample) * onehot.dot(np.exp(-neg_log_phi))
  mean_by_ind = np.ma.masked_equal(onehot * mean_by_sample.reshape(-1, 1), 0).mean(axis=0).filled(0)
  var_by_ind = np.ma.masked_equal(onehot * (np.square(mean_by_sample - onehot.dot(mean_by_ind)) + var_by_sample).reshape(-1, 1), 0).mean(axis=0).filled(0)
  return [log_mu, -neg_log_phi, logit_pi, mean_by_ind, var_by_ind]

def fit_gene(chunk, bootstraps=100):
  orig = _fit_gene(chunk)
  B = []
  for _ in range(bootstraps):
    B.append(_fit_gene(chunk[np.random.choice(chunk.shape[0], chunk.shape[0], replace=True)]))
  se = np.array(B)[:,:2].std(axis=0)
  return orig + list(se.ravel())

Computing analytic SE runs into numerical problems.

def _pois(theta, x, size):
  mean = np.exp(theta)
  mean *= size
  return (x * log(mean) - mean - sp.gammaln(x + 1)).mean()

def _pois_jac(theta, x, size):
  mean = np.exp(theta)
  return mean * (x / mean - size).mean()

def _nb_jac(theta, x, size):
  mean, inv_disp = np.exp(theta)
  T = (1 + size * mean / inv_disp)
  return mean * (x / mean - size / inv_disp * (x + inv_disp) / T).mean()

def check_gradients(x, f, df, args=None, num_trials=100):
  x = np.array(x)
  y = f(x, *args)
  analytic_diff = df(x, *args)
  error = []
  for i in range(num_trials):
    eps = np.random.normal(scale=1e-4, size=x.shape)
    num_diff = (f(x + eps, *args) - y) / eps
    error.append(abs(num_diff - analytic_diff))
  return np.array(error)

Simulation

Check the parameter estimation on simulated data.

Assuming simulated confounders \(x\) are isotropic Gaussian, we can derive the scale of \(\beta\) to achieve a specified fold-change in relative abundance:

\[ x \sim N(0, 1) \]

Letting \(\tau\) denote precision:

\[ \beta \sim N(0, \tau) \]

\[ x\beta \sim N(0, 1 + \tau) \]

\[ \mathbb{E}[x\beta] = y = \exp\left(\frac{1}{2 (1 + \tau)}\right) \]

\[ \tau = \frac{1 - 2 \ln y}{2 \ln y} \]

def simulate(num_samples, size=None, log_mu=None, log_phi=None, logodds=None, seed=None, design=None, fold=None):
  if seed is None:
    seed = 0
  np.random.seed(seed)
  if log_mu is None:
    log_mu = np.random.uniform(low=-12, high=-8)
  if log_phi is None:
    log_phi = np.random.uniform(low=-6, high=0)
  if size is None:
    size = 1e5
  if logodds is None:
    prob = np.random.uniform()
  else:
    prob = sp.expit(logodds)
  if design is None:
    design = np.random.normal(size=(num_samples, 1))
  else:
    assert design.shape[0] == num_samples
  if fold is None or np.isclose(fold, 1):
    beta = np.array([[0]])
  else:
    assert fold > 1
    beta = np.random.normal(size=(design.shape[1], 1), scale=2 * np.log(fold) / (1 - 2 * np.log(fold)))

  n = np.exp(-log_phi)
  p = 1 / (1 + size * np.exp(log_mu + design.dot(beta) + log_phi)).ravel()
  x = np.where(np.random.uniform(size=num_samples) < prob,
               0,
               np.random.negative_binomial(n=n, p=p, size=num_samples))
  return np.vstack((x, size * np.ones(num_samples))).T, design

def batch_design_matrix(num_samples, num_batches):
  """Return a matrix of binary indicators representing batch assignment"""
  design = np.zeros((num_samples, num_batches))
  design[np.arange(num_samples), np.random.choice(num_batches, size=num_samples)] = 1
  return design

def evaluate(num_samples, num_mols, log_mu, log_phi, logodds, fold, trial):
  x, design = simulate(num_samples=num_samples, size=num_mols,
                       log_mu=log_mu, log_phi=log_phi,
                       logodds=logodds, design=None, fold=fold, seed=trial)
  onehot = np.ones((num_samples, 1))
  keys = ['num_samples', 'num_mols', 'log_mu', 'log_phi', 'logodds', 'trial',
          'fold', 'log_mu_hat', 'log_phi_hat', 'logodds_hat', 'mean', 'var']
  result = [num_samples, num_mols, log_mu, log_phi, logodds, trial, fold] + [param[0] for param in _fit_gene(x, onehot, design)]
  result = {k: v for k, v in zip(keys, result)}
  eps = .5 / num_mols
  log_cpm = (np.log(np.ma.masked_values(x[:,0], 0) + eps) -
             np.log(x[:,1] + 2 * eps) +
             6 * np.log(10)).compressed()
  result['mean_log_cpm'] = log_cpm.mean()
  result['var_log_cpm'] = log_cpm.var()
  d, p = diagnostic_test(x[:,0],
                         np.atleast_1d(result['log_mu_hat']),
                         np.atleast_1d(result['log_phi_hat']),
                         np.atleast_1d(-result['logodds_hat']),
                         num_mols,
                         onehot)
  result['ks_d'] = d
  result['ks_p'] = p
  return result

Check the implementation actually worked.

x1, design1 = simulate(num_samples=1000, size=1e5, log_mu=-8, log_phi=-6, logodds=-3, seed=0, design=batch_design_matrix(1000, 2), fold=1.1)
x2, design2 = simulate(num_samples=1000, size=1e5, log_mu=-9, log_phi=-6, logodds=-3, seed=0, design=batch_design_matrix(1000, 2), fold=1.1)

x = np.vstack((x1, x2))
design = np.vstack((design1, design2))
onehot = np.zeros((2000, 2))
onehot[:1000,0] = 1
onehot[1000:,1] = 1

so.minimize(_nb, np.zeros(6), (x[:,0], x[:,1], onehot, design - design.mean(axis=0)))

fun: 3.7693708861600173
hess_inv: array([[ 2.95492914e-01, -2.26820271e-02, -6.88886930e-02,
3.66568258e-02, -1.35008739e-02,  1.33990791e-02],
[-2.26820271e-02,  2.76704513e-01,  4.55879587e-03,
6.39326308e-02, -2.03557128e-02,  2.03675939e-02],
[-6.88886930e-02,  4.55879587e-03,  8.40491928e+00,
3.47009715e-01,  1.14991770e-02, -1.17004131e-02],
[ 3.66568258e-02,  6.39326308e-02,  3.47009715e-01,
1.93344347e+01, -8.18164095e-03,  6.19982324e-03],
[-1.35008739e-02, -2.03557128e-02,  1.14991770e-02,
-8.18164095e-03,  6.45950264e-01,  3.54059802e-01],
[ 1.33990791e-02,  2.03675939e-02, -1.17004131e-02,
6.19982324e-03,  3.54059802e-01,  6.45930363e-01]])
jac: array([ 6.82473183e-06,  8.34465027e-06,  1.07288361e-06,  8.34465027e-07,
3.75509262e-06, -4.08291817e-06])
message: 'Optimization terminated successfully.'
nfev: 280
nit: 30
njev: 35
status: 0
success: True
x: array([-7.79891405, -8.79063513,  2.05714587,  2.56621654,  0.16492975,
-0.16494031])

so.minimize(zinb, np.zeros(8), (x[:,0], x[:,1], onehot, design - design.mean(axis=0)))

fun: 3.1120455141161147
hess_inv: array([[ 3.91509877e+01, -6.82255452e+00,  3.56651927e-03,
-3.60544664e-03,  1.89769061e-03, -3.67396633e-05,
-5.05335375e-03,  2.07017842e-03],
[-6.82255452e+00,  3.63677578e+01, -2.89665006e-03,
1.89766982e-03, -1.15515147e-03,  1.19995032e-04,
4.03751660e-03, -1.28881738e-03],
[ 3.56651927e-03, -2.89665006e-03,  1.00060484e-05,
-4.75805864e-06, -1.52866758e-06, -9.78212511e-07,
-7.93108063e-09, -9.33841063e-06],
[-3.60544664e-03,  1.89766982e-03, -4.75805864e-06,
9.43027846e-06, -1.92946942e-08, -1.76906831e-07,
-3.04179309e-07,  5.35291550e-06],
[ 1.89769061e-03, -1.15515147e-03, -1.52866758e-06,
-1.92946942e-08,  2.96476652e-06,  1.16116940e-06,
-4.91977746e-06,  2.37359794e-06],
[-3.67396633e-05,  1.19995032e-04, -9.78212511e-07,
-1.76906831e-07,  1.16116940e-06,  1.48628846e-06,
-1.36236523e-06,  7.97280863e-07],
[-5.05335375e-03,  4.03751660e-03, -7.93108063e-09,
-3.04179309e-07, -4.91977746e-06, -1.36236523e-06,
1.60495775e-05, -3.42044027e-06],
[ 2.07017842e-03, -1.28881738e-03, -9.33841063e-06,
5.35291550e-06,  2.37359794e-06,  7.97280863e-07,
-3.42044027e-06,  1.96937195e-05]])
jac: array([-2.98023224e-07,  7.15255737e-07,  9.99987125e-04, -3.08364630e-04,
5.21874428e-03,  7.71874189e-03,  1.99797750e-03,  9.57250595e-05])
message: 'Desired error not necessarily achieved due to precision loss.'
nfev: 1942
nit: 80
njev: 193
status: 2
success: False
x: array([-3.07858135, -3.07854889, -7.75387174, -8.74631249, 12.91019024,
13.26839552,  0.04538458, -0.28914969])

_fit_gene(x, onehot)

[array([-7.74001987, -8.73200364]),
array([-3.50778747, -3.59795896]),
array([-3.07854886, -3.07859407]),
array([43.50629319, 16.13388663]),
array([100.22044275,  23.26084595])]

_fit_gene(x, onehot, design)

[array([-7.74986202, -8.7418084 ]),
array([-6.82644387, -6.09474886]),
array([-3.0785687 , -3.07861418]),
array([43.08019804, 15.97647082]),
array([45.09331255, 16.55197158])]

Check what happens on all zero data.

x = np.concatenate((np.zeros((1000, 1)), 1e5 * np.ones((1000, 1))), axis=1)
onehot = np.ones((1000, 1))

np.log((onehot * x[:,:1]).sum(axis=0) / onehot.sum(axis=0)) - np.log(np.ma.masked_values(onehot, 0) * x[:,1:]).mean(axis=0).compressed()

array([-inf])

so.minimize(_nb, x0=(-np.inf, 0), args=(x[:,0], x[:,1], onehot))

fun: 9.999999889225288e-09
hess_inv: array([[1, 0],
[0, 1]])
jac: array([0.00000000e+00, 1.00000003e-08])
message: 'Optimization terminated successfully.'
nfev: 4
nit: 0
njev: 1
status: 0
success: True
x: array([-inf,   0.])

_fit_gene(x, onehot)

[array([-inf]), array([-0.]), array([0.]), array([0.]), array([0.])]

_fit_gene(x, onehot, design)

[array([-7.75254245, -8.74529525]),
array([-7.38635243, -6.45950176]),
array([-3.07849171, -3.07866706]),
array([42.9648789 , 15.92086032]),
array([44.10874472, 16.3176927 ])]

Check the end-to-end evaluation.

evaluate(num_samples=500, num_mols=1e5, log_mu=-8, log_phi=-6, logodds=-3, fold=None, trial=1)

{'fold': None,
'ks_d': 0.02216464461580947,
'ks_p': 9.23381354739375e-22,
'log_mu': -8,
'log_mu_hat': -7.995410361780751,
'log_phi': -6,
'log_phi_hat': -6.032810431842001,
'logodds': -3,
'logodds_hat': -2.9873654919335273,
'mean': 33.700581863533316,
'mean_log_cpm': 5.803550886989237,
'num_mols': 100000.0,
'num_samples': 500,
'trial': 1,
'var': 36.42490436676706,
'var_log_cpm': 0.03387162521847725}

%timeit evaluate(num_samples=5000, num_mols=1e5, log_mu=-8, log_phi=-6, logodds=-3, fold=1.1, trial=0)

1.41 s ± 449 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Investigate what happens as the number of confounders increases.

design = np.random.normal(size=(300, 20))
x, _ = simulate(num_samples=300, size=1e5, log_mu=-8, log_phi=-6, logodds=-3, seed=0, design=design, fold=1.1)
_fit_gene(x, design)

[-8.00649125343069,
-6.410804905890429,
-2.99891159295469,
33.32921072894424,
35.15509338485383]

Run the simulation on 28 CPUs.

<<zinb-imports>>
import multiprocessing as mp
import sqlite3
<<np-zinb-impl>>
<<sim-impl>>
<<np-eval-impl>>
<<zinb-diagnostic>>
args = [(num_samples, num_mols, log_mu, log_phi, logodds, fold, trial)
        for num_samples in (95,)
        for num_mols in (114026,)
        for log_mu in np.linspace(-12, -6, 7)
        for log_phi in np.linspace(-6, 0, 7)
        for logodds in np.linspace(-3, 3, 7)
        for fold in np.linspace(1, 1.25, 6)
        for trial in range(10)]
with mp.Pool() as pool:
  result = pd.DataFrame.from_dict(pool.starmap(evaluate, args))
result.to_csv('np-simulation.txt.gz', compression='gzip', sep='\t')

sbatch --partition=mstephens --mem=8G --job-name sim -n1 -c16 --time=1:00:00 --out np-sim.out
#!/bin/bash
source activate scqtl
python /project2/mstephens/aksarkar/projects/singlecell-qtl/code/sim.py

Submitted batch job 55019435

Parameter distributions

The simulation reveals the method has undesirable behavior when the proportion of zeros is too large and mean is too small.

Read the estimated parameters.

log_mu = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-mu.txt.gz', comment='g', index_col=0, header=None, sep=' ')
log_phi = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-phi.txt.gz', comment='g', index_col=0, header=None, sep=' ')
logodds = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-logodds.txt.gz', comment='g', sep=' ', header=None, index_col=0)

Look at the joint distribution.

J = (log_mu.agg(np.mean, axis=1).to_frame()
     .merge(log_phi.agg(np.mean, axis=1).to_frame(), left_index=True, right_index=True)
     .rename(columns={'0_x': 'log_mu', '0_y': 'log_phi'})
     .merge(logodds.agg(np.mean, axis=1).to_frame(), left_index=True, right_index=True)
     .rename(columns={0: 'logodds'}))
J.head()

log_mu   log_phi    logodds
0
ENSG00000000003  -9.548541 -2.684409 -16.055537
ENSG00000000419  -9.940811 -3.439683 -14.967152
ENSG00000000457 -12.600461 -2.165913  -7.902089
ENSG00000000460 -11.135371 -3.090537 -10.376486
ENSG00000001036 -11.127939 -2.941718  -9.684571

plt.clf()
fig, ax = plt.subplots(2, 2)
fig.set_size_inches(6, 6)

ax[0, 0].scatter(J['log_mu'], J['log_phi'], c='k', s=2, alpha=0.25)
ax[0, 0].set_xlabel('$\ln(\mu)$')
ax[0, 0].set_ylabel('$\ln(\phi)$')

ax[1, 0].scatter(J['log_mu'], J['logodds'], c='k', s=2, alpha=0.25)
ax[1, 0].set_xlabel('$\ln(\mu)$')
ax[1, 0].set_ylabel('$\mathrm{logit}(\pi)$')

ax[0, 1].scatter(J['logodds'], J['log_phi'], c='k', s=2, alpha=0.25)
ax[0, 1].set_xlabel('$\mathrm{logit}(\pi)$')
ax[0, 1].set_ylabel('$\ln(\phi)$')

ax[1, 1].axis('off')

fig.tight_layout()

Effect of confounding

Estimate proportion of variance explained by confounders by estimating the average reduction in heterogeneity (residual variance).

log_phi0 = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/zi2-log-phi.txt.gz', index_col=0, sep=' ')
log_phi1 = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-phi.txt.gz', index_col=0, sep=' ')

1 - np.exp(log_phi1 - log_phi0).mean().mean()

-1.90349049788247

Estimate how much the mean changes due to confounding.

log_mu0 = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/zi2-log-mu.txt.gz', index_col=0, sep=' ')
log_mu1 = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-mu.txt.gz', index_col=0, sep=' ')

np.exp(log_mu1 - log_mu0).describe().loc['mean'].describe()

count    54.000000
mean      1.008531
std       0.171844
min       0.894419
25%       0.964850
50%       0.987312
75%       1.008456
max       2.213495
Name: mean, dtype: float64

Comparison with sample moment-based estimators

Mean expression

Compute pseudobulk relative abundance. Important: keep the \(\infty\) around.

pooled_log_mu = np.log(umi.groupby(annotations['chip_id'].values, axis=1).agg(np.sum)) - np.log(annotations.groupby('chip_id')['mol_hs'].agg(np.sum))
pooled_log_rho = pooled_log_mu - sp.logsumexp(pooled_log_mu, axis=0)

To first order,

\[ E[\ln r] = \ln r \]

# Follow edgeR
libsize = annotations['mol_hs'].values
eps = .5 * libsize / libsize.mean()
log_cpm = (np.log(umi + eps) - np.log(libsize + 2 * eps) + 6 * np.log(10)) / np.log(2)

CPM is proportional to relative abundance, so normalize.

cpm_log_mu = log_cpm.groupby(annotations['chip_id'].values, axis=1).agg(np.mean)
cpm_log_rho = cpm_log_mu - sp.logsumexp(cpm_log_mu, axis=0)

nonzero_cpm_log_mu = log_cpm.mask(umi == 0).groupby(annotations['chip_id'].values, axis=1).agg(np.mean).dropna()
nonzero_cpm_log_rho = nonzero_cpm_log_mu - sp.logsumexp(nonzero_cpm_log_mu, axis=0)

zinb_log_mu = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-mu.txt.gz', sep=' ', index_col=0)
zinb_logodds = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-logodds.txt.gz', sep=' ', index_col=0)
# Important: log(sigmoid(x)) = -softplus(-x)
zinb_log_mu -= np.log1p(np.exp(zinb_logodds))
zinb_log_rho = zinb_log_mu - sp.logsumexp(zinb_log_mu, axis=0)

Construct a DataFrame for convenience.

log_rho = pd.DataFrame({'Pooled': pooled_log_rho['NA18507'],
                        'ZINB': zinb_log_rho['NA18507'],
                        'Nonzero CPM': nonzero_cpm_log_rho['NA18507'],
                        'CPM': cpm_log_rho['NA18507']})[['ZINB', 'Pooled', 'Nonzero CPM', 'CPM']]

plt.clf()
N = log_rho.shape[1]
fig, ax = plt.subplots(N, N)
fig.set_size_inches(8, 8)
for y in range(N):
  ax[y, 0].set_ylabel('{}'.format(log_rho.columns[y]))
  for x in range(N):
    if y < x:
      ax[y, x].set_axis_off()
    else:
      ax[y, x].scatter(log_rho.iloc[:, x], log_rho.iloc[:, y], c='k', s=2, alpha=0.1)
      ax[y, x].plot(ax[y, x].get_xlim(), ax[y, x].get_xlim(), c='r', ls=':', lw=1)
      ax[y, x].text(.05, .95, '$r$ = {:.2g}'.format(st.mstats.spearmanr(log_rho.iloc[:, x], log_rho.iloc[:, y]).correlation), transform=ax[y, x].transAxes, verticalalignment='top')
for x in range(N):
  ax[-1, x].set_xlabel('{}'.format(log_rho.columns[x]))
fig.tight_layout()

Expression noise

Read the data.

keep_samples = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/quality-single-cells.txt', index_col=0, header=None)
annotations = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-annotation.txt')
annotations = annotations.loc[keep_samples.values.ravel()]
umi = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-counts.txt.gz', index_col=0)
zeros_pass = umi.loc[:,keep_samples.values.ravel()].agg(np.sum, axis=1) > 0
umi = umi.loc[zeros_pass,keep_samples.values.ravel()]

Look at NA18507 (individual with the most cell), and also over all individuals.

keep_ind = (annotations['chip_id'] == 'NA18507').values.ravel()

Look at Fano vs. mean, following Munsky et al 2013.

mean = umi.agg(np.mean, axis=1)
var = umi.agg(np.var, axis=1)
ind_mean = umi.loc[:,keep_ind].agg(np.mean, axis=1)
ind_var = umi.loc[:,keep_ind].agg(np.var, axis=1)

def plot_fano_vs_mean(mean, var, ax, title, method, ref=True):
  ax.semilogx()
  ax.semilogy()
  ax.scatter(mean, var / mean, c='k', s=2, zorder=0, alpha=0.25)
  lim = [.9 * mean.min(), 1.1 * mean.max()]
  grid = np.geomspace(lim[0], lim[1], 200)
  for phi in np.linspace(.2, 1, 5):
    if ref:
      ax.plot(grid, 1 + phi * np.array(grid), lw=1, c=colorcet.cm['bmy'](phi), zorder=1, label='{:.2g}'.format(phi))
  ax.set_xlim(lim)
  ax.set_ylim(1, 300)
  ax.set_xlabel('{} mean'.format(method))
  ax.set_ylabel('{} Fano factor'.format(method))
  ax.set_title(title)

plt.clf()
fig, ax = plt.subplots(1, 2)
fig.set_size_inches(8, 4)
plot_fano_vs_mean(ind_mean[ind_mean > 0], ind_var[ind_mean > 0], ax[0], 'NA18507', 'Sample')
plot_fano_vs_mean(mean, var, ax[1], 'Over 53 individuals', 'Sample')
ax[1].legend(title='Overdispersion', frameon=False, fancybox=False, loc='center left', bbox_to_anchor=(1, 0.5))
fig.tight_layout()

ZINB estimates of expression noise

Read the estimated parameters.

log_mu = pd.read_table("/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-mu.txt.gz", index_col=0, sep=' ')
log_phi = pd.read_table("/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-phi.txt.gz", index_col=0, sep=' ')
logodds = pd.read_table("/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-logodds.txt.gz", index_col=0, sep=' ')

Read the annotations.

keep_samples = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/quality-single-cells.txt', index_col=0, header=None)
annotations = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-annotation.txt')
annotations = annotations.loc[keep_samples.values.ravel()]

onehot = recode(annotations, 'chip_id')

Fixing individual \(i\), cell \(j\), gene \(k\), we have:

\[ E[r_{ijk}] = (1 - \pi_{ik}) R_{ij} \mu_{ik} \]

\[ V[r_{ijk}] = (1 - \pi_{ik})\left(R_{ij} \mu_{ik} + (R_{ij} \mu_{ik})^2 \phi_{ik}\right) + \pi_{ik} (1 - \pi_{ik}) \mu_{ik}^2 \]

mean_by_sample = (annotations['mol_hs'].values.reshape(-1, 1) * onehot).dot(np.exp(log_mu - np.log1p(np.exp(logodds))).T)
# Nonzero component
var_by_sample = mean_by_sample + np.square(mean_by_sample) * np.exp(onehot.dot(log_phi.values.T))
var_by_sample *= onehot.dot(sp.expit(-logodds.T))
var_by_sample += onehot.dot(sp.expit(logodds.T) * np.exp(log_mu.T)) * mean_by_sample

The index of dispersion for observed data \(r_{ijk}\) at gene \(k\) is:

\[ D_k = \frac{V[r_{ijk}]}{E[r_{ijk}]} \]

where expectations (variances) are taken over individuals \(i\) and cells \(j\).

Let \(g_{ijk}\) denote the zero-inflated negative binomial density as defined above. Then, we have:

\[ r_{ijk} \sim \sum_{ijk} \frac{1}{N} g_{ijk}(\cdot) \]

The mixture density has expectation:

\[ \mu_k = \frac{1}{N} \sum E[r_{ijk}] \]

and variance (Frühwirth-Schnatter 2006):

\[ \sigma^2_k = \frac{1}{N} \sum (E[r_{ijk}] - \mu_k)^2 + V[r_{ijk}] \]

mean_by_ind = onehot.T.dot(mean_by_sample) / onehot.T.sum(axis=1, keepdims=True)
var_by_ind = onehot.T.dot(np.square(mean_by_sample - onehot.dot(mean_by_ind)) + var_by_sample) / onehot.T.sum(axis=1, keepdims=True)
overall_mean = mean_by_sample.mean(axis=0)
overall_var = (np.square(mean_by_sample - overall_mean.reshape(1, -1)) + var_by_sample).mean(axis=0)

Find outliers.

gene_info = (pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-genes.txt.gz')
             .set_index('gene')
             .query('source == "H. sapiens"')
             .query('chr != "hsX"')
             .query('chr != "hsY"')
             .query('chr != "hsMT"'))

T = pd.DataFrame({'mean': overall_mean, 'var': overall_var})
T.index = log_mu.index
J = T.merge(gene_info, left_index=True, right_index=True).sort_values('mean', ascending=False)
J[J['var'] / J['mean'] > 150][['mean', 'var', 'name']]

mean          var      name
gene
ENSG00000110713  37.779102  9160.566673     NUP98
ENSG00000053438   2.454245  1119.916885      NNAT
ENSG00000173401   0.692929   327.169561  GLIPR1L1
ENSG00000213380   0.630846   323.477680      COG8
ENSG00000120690   0.559411   324.580801      ELF1
ENSG00000137502   0.500265   760.557951     RAB30
ENSG00000127080   0.415653   467.467892      IPPK
ENSG00000000457   0.403754   777.369613     SCYL3
ENSG00000185818   0.394969   465.957390     NAT8L
ENSG00000145016   0.391726   772.221129  KIAA0226
ENSG00000099617   0.388581   751.477815     EFNA2
ENSG00000112357   0.353050   746.942494      PEX7

Plot Fano factor vs. mean.

plt.clf()
fig, ax = plt.subplots(1, 2)
fig.set_size_inches(8, 4)
k = sorted(set(annotations['chip_id'])).index('NA18507')
plot_fano_vs_mean(mean_by_ind[k], var_by_ind[k], ax[0], 'NA18507', 'ZINB')
plot_fano_vs_mean(overall_mean, overall_var, ax[1], 'Over 53 individuals', 'ZINB')
ax[1].legend(title='Overdispersion', frameon=False, fancybox=False, loc='center left', bbox_to_anchor=(1, 0.5))

fig.tight_layout()

Compare estimates against each other.

S, T = pd.Series(overall_var / overall_mean, index=log_mu.index).align(var / mean, join='inner')
plt.clf()
plt.gcf().set_size_inches(4, 4)
plt.semilogx()
plt.semilogy()
lim = [.9 * S.min(), 1.1 * S.max()]
plt.scatter(S, T, c='k', s=2, alpha=0.25)
plt.text(.05, .95, 'r = {:.2g}'.format(st.spearmanr(S, T).correlation), verticalalignment='top', transform=plt.gca().transAxes)
plt.plot(lim, lim, c='r', ls=':', lw=1)
plt.xlim(lim)
plt.ylim(lim)
plt.xlabel('ZINB Fano factor')
plt.ylabel('Sample Fano factor')

Text(0,0.5,'Sample Fano factor')

Examples

log_mu = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-mu.txt.gz', index_col=0, sep=' ')
log_phi = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-log-phi.txt.gz', index_col=0, sep=' ')
logodds = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design1/zi2-logodds.txt.gz', sep=' ', index_col=0)

umi = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-counts.txt.gz', index_col=0)
annotations = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-annotation.txt')
keep_samples = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/quality-single-cells.txt', index_col=0, header=None)
keep_genes = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/genes-pass-filter.txt', index_col=0, header=None)
umi = umi.loc[keep_genes.values.ravel(),keep_samples.values.ravel()]
annotations = annotations.loc[keep_samples.values.ravel()]

%config InlineBackend.figure_formats = set(['svg'])
plt.clf()
fig, ax = plt.subplots(1, 2, sharey=True)
fig.set_size_inches(4, 2)
gene = "ENSG00000243709"
for a, k in zip(ax, ('NA18507', 'NA19204')):
  x = umi.loc[gene, (annotations['chip_id'] == k).values]
  y = annotations[annotations['chip_id'] == k]
  grid = np.arange(x.max())
  a.hist(x, color='.75', bins=grid)
  n = np.exp(-log_phi.loc[gene, k])
  p = 1 / (1 + np.outer(y['mol_hs'], np.exp(log_mu.loc[gene, k] + log_phi.loc[gene, k])))
  G = st.nbinom(n=n.ravel(), p=p.ravel()).pmf
  pmf = np.array([G(x).mean() for x in grid])
  exp_count = x.shape[0] * pmf * sp.expit(-logodds.loc[gene, k])
  a.plot(0.5 + grid, exp_count, c='b', lw=1)
  a.arrow(0.5, 0, 0, x.shape[0] * sp.expit(logodds.loc[gene, k]), width=.01, head_width=.5, head_length=2, color='r')
fig.text(0.5, 0, 'UMI counts', ha='center')
fig.text(0, 0.5, 'Number of cells', va='center', rotation=90)
fig.tight_layout()

Evaluating goodness of fit

To test whether the model actually fit the data, we can use a simple fact: if

\[ r_{ijk} \sim f_{ijk}(\cdot) \]

then

\[ F_{ijk}(r_{ijk}) \sim \mathrm{Uniform}(0, 1) \]

and we can use e.g. the Kolmogorov-Smirnov test to test for departures from uniformity.

We need to account for the fact that the support of \(f\) is discrete, and therefore \(F\) is discontinuous. This can be done by randomizing (Dunn and Smyth 1996, Feng et al. 2017):

\[ a_{ijk} = \lim_{x\rightarrow r_{ijk}^-} F_{ijk}(x) \]

\[ b_{ijk} = F_{ijk}(r_{ijk}) \]

\[ u_{ijk} \sim \mathrm{Uniform}(a_{ijk}, b_{ijk}) \]

Implement the diagnostic test.

def rpp(x, log_mu, log_phi, logodds, size, onehot, n_samples=100):
  n = onehot.dot(np.exp(-log_phi))
  pi0 = onehot.dot(sp.expit(-logodds))
  p = 1 / (1 + (size * onehot.dot(np.exp(log_mu + log_phi))))

  cdf = st.nbinom(n=n, p=p).cdf(x - 1)
  # Important: this excludes the right endpoint, so we need to special case x =
  # 0
  cdf = np.where(x > 0, pi0 + (1 - pi0) * cdf, cdf)
  pmf = st.nbinom(n=n, p=p).pmf(x)
  pmf *= (1 - pi0)
  pmf[x == 0] += pi0[x == 0]
  rpp = cdf + np.random.uniform(size=(n_samples, x.shape[0])) * pmf
  return rpp

def diagnostic_test(x, log_mu, log_phi, logodds, size, onehot, n_samples=100):
  vals = rpp(x, log_mu, log_phi, logodds, size, onehot, n_samples)
  return st.kstest(vals.ravel(), 'uniform')

Simulation

Look at randomized residuals for some simulated data. Use the CPU implementation of ZINB for convenience.

X, design = simulate(num_samples=100, size=1e5, log_mu=-12, log_phi=-6, logodds=-3, seed=1, design=None, fold=None)
res = _fit_gene(X, np.ones((X.shape[0], 1)), design)

Plot the simulated data and estimated density.

n = np.exp(-res[1])
pi0 = sp.expit(res[2])
p = 1 / (1 + (1e5 * np.exp(res[0] + res[1])))

grid = np.arange(X[:,0].max())
pmf = st.nbinom(n=n, p=p).pmf(grid)
pmf *= (1 - pi0)
pmf[0] += pi0

plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(X[:,0], color='k', bins=grid)
plt.plot(.5 + grid, X.shape[0] * pmf, c='r', lw=1)
plt.xlabel('UMI count')
_ = plt.ylabel('Number of cells')

Plot the QQ plot using randomized quantiles.

Z = rpp(X[:,0], res[0], res[1], -res[2], 1e5, np.ones((X.shape[0], 1)), n_samples=50)
plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.plot(sorted(Z.ravel()), np.linspace(0, 1, np.prod(Z.shape)), c='k')
plt.text(.05, .92, 'ks = {:.2g}'.format(st.kstest(Z.ravel(), 'uniform')[1]), transform=plt.gca().transAxes)
plt.text(.05, .82, r'$\ln\phi = -3$', transform=plt.gca().transAxes)
plt.text(.05, .72, r'$\ln\hat\phi = {:.2g}$'.format(res[1][0]), transform=plt.gca().transAxes)
plt.xlabel('Theoretical quantiles')
plt.ylabel('Observed quantiles')
plt.plot([0, 1], [0, 1], c='r', ls='--')

[<matplotlib.lines.Line2D at 0x7f8b117b7710>]

Plot the histogram of randomized quantiles.

Z = rpp(X[:,0], res[0], res[1], -res[2], 1e5, np.ones((X.shape[0], 1)), n_samples=50)
plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(Z.ravel(), density=True, bins=25, color='.7')
plt.axhline(y=1, c='k', ls=':')
plt.xlabel('Randomized quantile')
plt.ylabel('Density')

Text(0,0.5,'Density')

See how the randomized quantiles work as \(\pi_0\) changes.

def simulate_nb_diagnostic(pi0=None, seed=0):
  np.random.seed(seed)
  N = 1000
  x = st.nbinom(n=4, p=.5).rvs(N)
  if pi0 is not None:
    y = np.random.uniform(size=N) < pi0
    x[y] = 0
  cdf = st.nbinom(n=4, p=.5).cdf(x - 1)
  pmf = st.nbinom(n=4, p=.5).pmf(x)
  if pi0 is not None:
    cdf = np.where(x > 0, pi0 + (1 - pi0) * cdf, cdf)
    pmf *= (1 - pi0)
    pmf[x == 0] += pi0
  rpp = cdf + np.random.uniform(size=N) * pmf
  _, p1 = st.kstest(rpp, 'uniform')
  nrpp = st.norm().ppf(rpp)
  _, p2 = st.shapiro(nrpp)
  return pi0, seed, p1, p2

pd.DataFrame([simulate_nb_diagnostic(pi0=pi0, seed=trial) for pi0 in (None, 0.01, 0.1) for trial in range(10)], columns=['pi0', 'trial', 'ks', 'sw'])

pi0  trial        ks        sw
0    NaN      0  0.030844  0.145921
1    NaN      1  0.922857  0.579844
2    NaN      2  0.096322  0.995171
3    NaN      3  0.425145  0.748691
4    NaN      4  0.839172  0.158713
5    NaN      5  0.193963  0.305555
6    NaN      6  0.645417  0.229020
7    NaN      7  0.442194  0.935033
8    NaN      8  0.462132  0.280841
9    NaN      9  0.883011  0.262984
10  0.01      0  0.065004  0.665126
11  0.01      1  0.627508  0.902265
12  0.01      2  0.061377  0.783419
13  0.01      3  0.492274  0.544046
14  0.01      4  0.322213  0.120402
15  0.01      5  0.170058  0.108539
16  0.01      6  0.595968  0.637531
17  0.01      7  0.430416  0.812700
18  0.01      8  0.599287  0.258203
19  0.01      9  0.932962  0.191678
20  0.10      0  0.009072  0.263614
21  0.10      1  0.905894  0.899727
22  0.10      2  0.013516  0.702477
23  0.10      3  0.884857  0.659797
24  0.10      4  0.858585  0.516515
25  0.10      5  0.377873  0.372610
26  0.10      6  0.785309  0.800714
27  0.10      7  0.109332  0.307531
28  0.10      8  0.661486  0.359344
29  0.10      9  0.984754  0.085754

Look at the histogram of p-values for the complete range of simulation parameters.

tf_res = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/tf-simulation.txt.gz', index_col=0)
plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(tf_res['ks_p'], color='.7', bins=10, density=True)
plt.axhline(y=1, c='k', ls='--')
plt.xlabel('KS test $p$-value')
plt.ylabel('Density')

Text(0,0.5,'Density')

Using the data twice

Investigate what the KS test does when the reference distribution is estimated from the same data which is going to be tested for departure from the reference.

First simulate the simplest case:

\[ x_i \sim N(0, 1) \]

Estimate the MLE \(\hat\mu, \hat{\sigma^2}\), then test for departure from the maximum likelihood solution.

def trial(seed, N=1000):
  np.random.seed(seed)
  x = np.random.normal(size=N)
  mu = x.mean()
  sigma = x.std()
  q = st.norm(loc=mu, scale=sigma).cdf(x)
  return st.kstest(q, 'uniform')

Plot the histogram of \(p\)-values, varying the sample size \(n\).

plt.clf()
fig, ax = plt.subplots(1, 3, sharey=True)
fig.set_size_inches(8, 3)
for n, a in zip([100, 1000, 5000], ax):
  S = pd.DataFrame([trial(seed, N=n) for seed in range(100)])
  a.hist(S['pvalue'], color='.7', density=True, bins=10)
  a.axhline(y=1, c='k', ls='--')
  a.set_xlabel('KS test $p$-value')
  a.set_title('n={}'.format(n))
ax[0].set_ylabel('Density')
fig.tight_layout()

In the continuous case, the diagnostic reflects the fact that the MLE is the best description of the observed data, and therefore \(p\)-values are concentrated around 1.

Next, simulate a simple discrete case:

\[ x_i \sim \mathrm{Poisson}(10) \]

def pois_trial(seed, N=1000, n_samples=1):
  np.random.seed(seed)
  x = np.random.poisson(lam=10, size=N)
  lamhat = x.mean()
  fhat = st.poisson(mu=lamhat)
  u = np.random.uniform(size=(n_samples, N))
  q = fhat.cdf(x - 1) + u * fhat.pmf(x)
  return st.kstest(q.ravel(), 'uniform')

Plot the analogous histogram just varying the sample size \(n\).

plt.clf()
fig, ax = plt.subplots(1, 3, sharey=True)
fig.set_size_inches(8, 3)
for n, a in zip([100, 1000, 5000], ax):
  S = pd.DataFrame([pois_trial(seed, N=n) for seed in range(100)])
  a.hist(S['pvalue'], color='.7', density=True, bins=10)
  a.axhline(y=1, c='k', ls='--')
  a.set_xlabel('KS test $p$-value')
  a.set_title('n={}'.format(n))
ax[0].set_ylabel('Density')
fig.tight_layout()

Now fix the sample size \(n=100\) and vary the number of randomized quantiles drawn per observation \(m\).

plt.clf()
fig, ax = plt.subplots(1, 3)
fig.set_size_inches(8, 3)
for m, a in zip([1, 2, 5], ax):
  S = pd.DataFrame([pois_trial(seed, N=100, n_samples=m) for seed in range(100)])
  a.hist(S['pvalue'], color='.7', density=True, bins=10)
  a.axhline(y=1, c='k', ls='--')
  a.set_xlabel('KS test $p$-value')
  a.set_title('m={}'.format(m))
ax[0].set_ylabel('Density')
fig.tight_layout()

The results suggest that the diagnostic is only valid for \(m = 1\), because for \(m > 1\) the randomized quantiles are no longer iid. For the case where \(m = 1\), the skew in the \(p\)-values suggests the test is conservative.

Real data

Run the diagnostic on the estimated parameters.

<<zinb-imports>>
import multiprocessing as mp
import scqtl
import functools
<<recode-impl>>

def shard(counts, keep_samples, log_mu):
  for chunk in pd.read_table(counts, index_col=0, chunksize=100):
    chunk = chunk.loc[:,keep_samples.values.ravel()].filter(items=log_mu.index, axis='index')
    if not chunk.empty:
      yield chunk

def process_chunk(chunk, log_mu, log_phi, logodds, size, onehot, n_samples):
  result = []
  for k, x in chunk.iterrows():
    for j in range(onehot.shape[1]):
      idx = onehot[:,j].astype(bool)
      col = log_mu.columns[j]
      d, p = scqtl.diagnostic.diagnostic_test(
        x.loc[idx].values.reshape(-1, 1),
        log_mu.loc[k, col],
        log_phi.loc[k,col],
        -logodds.loc[k,col],
        size[idx].values.reshape(-1, 1),
        np.ones((int(idx.sum()), 1)),
        n_samples=n_samples)
      result.append((k, col, d, p))
  return pd.DataFrame(result)

with mp.Pool() as pool:
  log_mu = pd.read_table("/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/zi2-log-mu.txt.gz", index_col=0, sep=' ')
  log_phi = pd.read_table("/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/zi2-log-phi.txt.gz", index_col=0, sep=' ')
  logodds = pd.read_table("/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/zi2-logodds.txt.gz", index_col=0, sep=' ')

  keep_samples = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/quality-single-cells.txt', index_col=0, header=None)
  annotations = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-annotation.txt')
  annotations = annotations.loc[keep_samples.values.ravel()]

  onehot = recode(annotations, 'chip_id')

  process_chunk_ = functools.partial(
    process_chunk,
    log_mu=log_mu,
    log_phi=log_phi,
    logodds=logodds,
    size=annotations['mol_hs'],
    onehot=onehot,
    n_samples=1)

  result = pd.concat(pool.map(
    process_chunk_,
    shard('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/scqtl-counts.txt.gz', keep_samples, log_mu),
  ))
  result.set_index(0).to_csv('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/diagnostic.txt.gz', compression='gzip', sep='\t')

sbatch --partition=broadwl --job-name="tf-diagnostic" --mem=8G -n1 -c28 --exclusive --time=10:00
#!/bin/bash
source activate scqtl
python /project2/mstephens/aksarkar/projects/singlecell-qtl/code/tf-diagnostic.py

Submitted batch job 55767376

Look at the histogram of KS-test p-values across all genes.

diagnostic_results = pd.read_table('/project2/mstephens/aksarkar/projects/singlecell-qtl/data/density-estimation/design0/diagnostic.txt.gz', index_col=0)
plt.clf()
plt.gcf().set_size_inches(3, 3)
plt.hist(diagnostic_results['3'], density=True, color='.7', bins=10)
plt.axhline(y=1, c='k', ls='--')
plt.xlabel('KS test $p$-value')
_ = plt.ylabel('Density')

Find genes for which we can reject the null (after Bonferroni correction).

diagnostic_results[diagnostic_results['3'] < .05 / diagnostic_results.shape[0]].shape[0]