Theoretical Foundations of scvi-tools

This document explains the mathematical and statistical principles underlying scvi-tools.

Core Concepts

Variational Inference

What is it? Variational inference is a technique for approximating complex probability distributions. In single-cell analysis, we want to understand the posterior distribution p(z|x) - the probability of latent variables z given observed data x.

Why use it? - Exact inference is computationally intractable for complex models - Scales to large datasets (millions of cells) - Provides uncertainty quantification - Enables Bayesian reasoning about cell states

How does it work? 1. Define a simpler approximate distribution q(z|x) with learnable parameters 2. Minimize the KL divergence between q(z|x) and true posterior p(z|x) 3. Equivalent to maximizing the Evidence Lower Bound (ELBO)

ELBO Objective:

ELBO = E_q[log p(x|z)] - KL(q(z|x) || p(z))
       ↑                    ↑
  Reconstruction          Regularization

Reconstruction term: Model should generate data similar to observed
Regularization term: Latent representation should match prior

Variational Autoencoders (VAEs)

Architecture:

x (observed data)
    ↓
[Encoder Neural Network]
    ↓
z (latent representation)
    ↓
[Decoder Neural Network]
    ↓
x̂ (reconstructed data)

Encoder: Maps cells (x) to latent space (z) - Learns q(z|x), the approximate posterior - Parameterized by neural network with learnable weights - Outputs mean and variance of latent distribution

Decoder: Maps latent space (z) back to gene space - Learns p(x|z), the likelihood - Generates gene expression from latent representation - Models count distributions (Negative Binomial, Zero-Inflated NB)

Reparameterization Trick: - Allows backpropagation through stochastic sampling - Sample z = μ + σ ⊙ ε, where ε ~ N(0,1) - Enables end-to-end training with gradient descent

Amortized Inference

Concept: Share encoder parameters across all cells.

Traditional inference: Learn separate latent variables for each cell - n_cells × n_latent parameters - Doesn't scale to large datasets

Amortized inference: Learn single encoder for all cells - Fixed number of parameters regardless of cell count - Enables fast inference on new cells - Transfers learned patterns across dataset

Benefits: - Scalable to millions of cells - Fast inference on query data - Leverages shared structure across cells - Enables few-shot learning

Statistical Modeling

Count Data Distributions

Single-cell data are counts (integer-valued), requiring appropriate distributions.

Negative Binomial (NB)

x ~ NB(μ, θ)

μ (mean): Expected expression level
θ (dispersion): Controls variance
Variance: Var(x) = μ + μ²/θ

When to use: Gene expression without zero-inflation - More flexible than Poisson (allows overdispersion) - Models technical and biological variation

Zero-Inflated Negative Binomial (ZINB)

x ~ π·δ₀ + (1-π)·NB(μ, θ)

π (dropout rate): Probability of technical zero
δ₀: Point mass at zero
NB(μ, θ): Expression when not dropped out

When to use: Sparse scRNA-seq data - Models technical dropout separately from biological zeros - Better fit for highly sparse data (e.g., 10x data)

Poisson

x ~ Poisson(μ)

Simplest count distribution
Mean equals variance: Var(x) = μ

When to use: Less common; ATAC-seq fragment counts - More restrictive than NB - Faster computation

Batch Correction Framework

Problem: Technical variation confounds biological signal - Different sequencing runs, protocols, labs - Must remove technical effects while preserving biology

scvi-tools approach: 1. Encode batch as categorical variable s 2. Include s in generative model 3. Latent space z is batch-invariant 4. Decoder conditions on s for batch-specific effects

Mathematical formulation:

Encoder: q(z|x, s)  - batch-aware encoding
Latent: z           - batch-corrected representation
Decoder: p(x|z, s)  - batch-specific decoding

Key insight: Batch info flows through decoder, not latent space - z captures biological variation - s explains technical variation - Separable biology and batch effects

Deep Generative Modeling

Generative model: Learns p(x), the data distribution

Process: 1. Sample latent variable: z ~ p(z) = N(0, I) 2. Generate expression: x ~ p(x|z) 3. Joint distribution: p(x, z) = p(x|z)p(z)

Benefits: - Generate synthetic cells - Impute missing values - Quantify uncertainty - Perform counterfactual predictions

Inference network: Inverts generative process - Given x, infer z - q(z|x) approximates true posterior p(z|x)

Model Architecture Details

scVI Architecture

Input: Gene expression counts x ∈ ℕ^G (G genes)

Encoder:

h = ReLU(W₁·x + b₁)
μ_z = W₂·h + b₂
log σ²_z = W₃·h + b₃
z ~ N(μ_z, σ²_z)

Latent space: z ∈ ℝ^d (typically d=10-30)

Decoder:

h = ReLU(W₄·z + b₄)
μ = softmax(W₅·h + b₅) · library_size
θ = exp(W₆·h + b₆)
π = sigmoid(W₇·h + b₇)  # for ZINB
x ~ ZINB(μ, θ, π)

Loss function (ELBO):

L = E_q[log p(x|z)] - KL(q(z|x) || N(0,I))

Handling Covariates

Categorical covariates (batch, donor, etc.): - One-hot encoded: s ∈ {0,1}^K - Concatenate with latent: [z, s] - Or use conditional layers

Continuous covariates (library size, percent_mito): - Standardize to zero mean, unit variance - Include in encoder and/or decoder

Covariate injection strategies: - Concatenation: [z, s] fed to decoder - Deep injection: s added at multiple layers - Conditional batch norm: Batch-specific normalization

Advanced Theoretical Concepts

Transfer Learning (scArches)

Concept: Use pretrained model as initialization for new data

Process: 1. Train reference model on large dataset 2. Freeze encoder parameters 3. Fine-tune decoder on query data 4. Or fine-tune all with lower learning rate

Why it works: - Encoder learns general cellular representations - Decoder adapts to query-specific characteristics - Prevents catastrophic forgetting

Applications: - Query-to-reference mapping - Few-shot learning for rare cell types - Rapid analysis of new datasets

Multi-Resolution Modeling (MrVI)

Idea: Separate shared and sample-specific variation

Latent space decomposition:

z = z_shared + z_sample

z_shared: Common across samples
z_sample: Sample-specific effects

Hierarchical structure:

Sample level: ρ_s ~ N(0, I)
Cell level: z_i ~ N(ρ_{s(i)}, σ²)

Benefits: - Disentangle biological sources of variation - Compare samples at different resolutions - Identify sample-specific cell states

Counterfactual Prediction

Goal: Predict outcome under different conditions

Example: "What would this cell look like if from different batch?"

Method: 1. Encode cell to latent: z = Encoder(x, s_original) 2. Decode with new condition: x_new = Decoder(z, s_new) 3. x_new is counterfactual prediction

Applications: - Batch effect assessment - Predicting treatment response - In silico perturbation studies

Posterior Predictive Distribution

Definition: Distribution of new data given observed data

p(x_new | x_observed) = ∫ p(x_new|z) q(z|x_observed) dz

Estimation: Sample z from q(z|x), generate x_new from p(x_new|z)

Uses: - Uncertainty quantification - Robust predictions - Outlier detection

Differential Expression Framework

Bayesian Approach

Traditional methods: Compare point estimates - Wilcoxon, t-test, etc. - Ignore uncertainty - Require pseudocounts

scvi-tools approach: Compare distributions - Sample from posterior: μ_A ~ p(μ|x_A), μ_B ~ p(μ|x_B) - Compute log fold-change: LFC = log(μ_B) - log(μ_A) - Posterior distribution of LFC quantifies uncertainty

Bayes Factor

Definition: Ratio of posterior odds to prior odds

BF = P(H₁|data) / P(H₀|data)
     ─────────────────────────
     P(H₁) / P(H₀)

Interpretation: - BF > 3: Moderate evidence for H₁ - BF > 10: Strong evidence - BF > 100: Decisive evidence

In scvi-tools: Used to rank genes by evidence for DE

False Discovery Proportion (FDP)

Goal: Control expected false discovery rate

Procedure: 1. For each gene, compute posterior probability of DE 2. Rank genes by evidence (Bayes factor) 3. Select top k genes such that E[FDP] ≤ α

Advantage over p-values: - Fully Bayesian - Natural for posterior inference - No arbitrary thresholds

Implementation Details

Optimization

Optimizer: Adam (adaptive learning rates) - Default lr = 0.001 - Momentum parameters: β₁=0.9, β₂=0.999

Training loop: 1. Sample mini-batch of cells 2. Compute ELBO loss 3. Backpropagate gradients 4. Update parameters with Adam 5. Repeat until convergence

Convergence criteria: - ELBO plateaus on validation set - Early stopping prevents overfitting - Typically 200-500 epochs

Regularization

KL annealing: Gradually increase KL weight - Prevents posterior collapse - Starts at 0, increases to 1 over epochs

Dropout: Random neuron dropping during training - Default: 0.1 dropout rate - Prevents overfitting - Improves generalization

Weight decay: L2 regularization on weights - Prevents large weights - Improves stability

Scalability

Mini-batch training: - Process subset of cells per iteration - Batch size: 64-256 cells - Enables scaling to millions of cells

Stochastic optimization: - Estimates ELBO on mini-batches - Unbiased gradient estimates - Converges to optimal solution

GPU acceleration: - Neural networks naturally parallelize - Order of magnitude speedup - Essential for large datasets

Connections to Other Methods

vs. PCA

PCA: Linear, deterministic
scVI: Nonlinear, probabilistic
Advantage: scVI captures complex structure, handles counts

vs. t-SNE/UMAP

t-SNE/UMAP: Visualization-focused
scVI: Full generative model
Advantage: scVI enables downstream tasks (DE, imputation)

vs. Seurat Integration

Seurat: Anchor-based alignment
scVI: Probabilistic modeling
Advantage: scVI provides uncertainty, works for multiple batches

vs. Harmony

Harmony: PCA + batch correction
scVI: VAE-based
Advantage: scVI handles counts natively, more flexible

Mathematical Notation

Common symbols: - x: Observed gene expression (counts) - z: Latent representation - θ: Model parameters - q(z|x): Approximate posterior (encoder) - p(x|z): Likelihood (decoder) - p(z): Prior on latent variables - μ, σ²: Mean and variance - π: Dropout probability (ZINB) - θ (in NB): Dispersion parameter - s: Batch/covariate indicator

references/theoretical-foundations.md