Single-Cell RNA-seq Models
This document covers core models for analyzing single-cell RNA sequencing data in scvi-tools.
scVI (Single-Cell Variational Inference)
Purpose: Unsupervised analysis, dimensionality reduction, and batch correction for scRNA-seq data.
Key Features: - Deep generative model based on variational autoencoders (VAE) - Learns low-dimensional latent representations that capture biological variation - Automatically corrects for batch effects and technical covariates - Enables normalized gene expression estimation - Supports differential expression analysis
When to Use: - Initial exploration and dimensionality reduction of scRNA-seq datasets - Integrating multiple batches or studies - Generating batch-corrected expression matrices - Performing probabilistic differential expression analysis
Basic Usage:
import scvi
# Setup data
scvi.model.SCVI.setup_anndata(
adata,
layer="counts",
batch_key="batch"
)
# Train model
model = scvi.model.SCVI(adata, n_latent=30)
model.train()
# Extract results
latent = model.get_latent_representation()
normalized = model.get_normalized_expression()
Key Parameters:
- n_latent: Dimensionality of latent space (default: 10)
- n_layers: Number of hidden layers (default: 1)
- n_hidden: Number of nodes per hidden layer (default: 128)
- dropout_rate: Dropout rate for neural networks (default: 0.1)
- dispersion: Gene-specific or cell-specific dispersion ("gene" or "gene-batch")
- gene_likelihood: Distribution for data ("zinb", "nb", "poisson")
Outputs:
- get_latent_representation(): Batch-corrected low-dimensional embeddings
- get_normalized_expression(): Denoised, normalized expression values
- differential_expression(): Probabilistic DE testing between groups
- get_feature_correlation_matrix(): Gene-gene correlation estimates
scANVI (Single-Cell ANnotation using Variational Inference)
Purpose: Semi-supervised cell type annotation and integration using labeled and unlabeled cells.
Key Features: - Extends scVI with cell type labels - Leverages partially labeled datasets for annotation transfer - Performs simultaneous batch correction and cell type prediction - Enables query-to-reference mapping
When to Use: - Annotating new datasets using reference labels - Transfer learning from well-annotated to unlabeled datasets - Joint analysis of labeled and unlabeled cells - Building cell type classifiers with uncertainty quantification
Basic Usage:
# Option 1: Train from scratch
scvi.model.SCANVI.setup_anndata(
adata,
layer="counts",
batch_key="batch",
labels_key="cell_type",
unlabeled_category="Unknown"
)
model = scvi.model.SCANVI(adata)
model.train()
# Option 2: Initialize from pretrained scVI
scvi_model = scvi.model.SCVI(adata)
scvi_model.train()
scanvi_model = scvi.model.SCANVI.from_scvi_model(
scvi_model,
unlabeled_category="Unknown"
)
scanvi_model.train()
# Predict cell types
predictions = scanvi_model.predict()
Key Parameters:
- labels_key: Column in adata.obs containing cell type labels
- unlabeled_category: Label for cells without annotations
- All scVI parameters are also available
Outputs:
- predict(): Cell type predictions for all cells
- predict_proba(): Prediction probabilities
- get_latent_representation(): Cell type-aware latent space
AUTOZI
Purpose: Automatic identification and modeling of zero-inflated genes in scRNA-seq data.
Key Features: - Distinguishes biological zeros from technical dropout - Learns which genes exhibit zero-inflation - Provides gene-specific zero-inflation probabilities - Improves downstream analysis by accounting for dropout
When to Use: - Detecting which genes are affected by technical dropout - Improving imputation and normalization for sparse datasets - Understanding the extent of zero-inflation in your data
Basic Usage:
scvi.model.AUTOZI.setup_anndata(adata, layer="counts")
model = scvi.model.AUTOZI(adata)
model.train()
# Get zero-inflation probabilities per gene
zi_probs = model.get_alphas_betas()
VeloVI
Purpose: RNA velocity analysis using variational inference.
Key Features: - Joint modeling of spliced and unspliced RNA counts - Probabilistic estimation of RNA velocity - Accounts for technical noise and batch effects - Provides uncertainty quantification for velocity estimates
When to Use: - Inferring cellular dynamics and differentiation trajectories - Analyzing spliced/unspliced count data - RNA velocity analysis with batch correction
Basic Usage:
import scvelo as scv
# Prepare velocity data
scv.pp.filter_and_normalize(adata)
scv.pp.moments(adata)
# Train VeloVI
scvi.model.VELOVI.setup_anndata(adata, spliced_layer="Ms", unspliced_layer="Mu")
model = scvi.model.VELOVI(adata)
model.train()
# Get velocity estimates
latent_time = model.get_latent_time()
velocities = model.get_velocity()
contrastiveVI
Purpose: Isolating perturbation-specific variations from background biological variation.
Key Features: - Separates shared variation (common across conditions) from target-specific variation - Useful for perturbation studies (drug treatments, genetic perturbations) - Identifies condition-specific gene programs - Enables discovery of treatment-specific effects
When to Use: - Analyzing perturbation experiments (drug screens, CRISPR, etc.) - Identifying genes responding specifically to treatments - Separating treatment effects from background variation - Comparing control vs. perturbed conditions
Basic Usage:
scvi.model.CONTRASTIVEVI.setup_anndata(
adata,
layer="counts",
batch_key="batch",
categorical_covariate_keys=["condition"] # control vs treated
)
model = scvi.model.CONTRASTIVEVI(
adata,
n_latent=10, # Shared variation
n_latent_target=5 # Target-specific variation
)
model.train()
# Extract representations
shared = model.get_latent_representation(representation="shared")
target_specific = model.get_latent_representation(representation="target")
CellAssign
Purpose: Marker-based cell type annotation using known marker genes.
Key Features: - Uses prior knowledge of marker genes for cell types - Probabilistic assignment of cells to types - Handles marker gene overlap and ambiguity - Provides soft assignments with uncertainty
When to Use: - Annotating cells using known marker genes - Leveraging existing biological knowledge for classification - Cases where marker gene lists are available but reference datasets are not
Basic Usage:
# Create marker gene matrix (cell types x genes)
marker_gene_mat = pd.DataFrame({
"CD4 T cells": [1, 1, 0, 0], # CD3D, CD4, CD8A, CD19
"CD8 T cells": [1, 0, 1, 0],
"B cells": [0, 0, 0, 1]
}, index=["CD3D", "CD4", "CD8A", "CD19"])
scvi.model.CELLASSIGN.setup_anndata(adata, layer="counts")
model = scvi.model.CELLASSIGN(adata, marker_gene_mat)
model.train()
predictions = model.predict()
Solo (Doublet Detection)
Purpose: Identifying doublets (cells containing two or more cells) in scRNA-seq data.
Key Features: - Semi-supervised doublet detection using scVI embeddings - Simulates artificial doublets for training - Provides doublet probability scores - Can be applied to any scVI model
When to Use: - Quality control of scRNA-seq datasets - Removing doublets before downstream analysis - Assessing doublet rates in your data
Basic Usage:
# Train scVI model first
scvi.model.SCVI.setup_anndata(adata, layer="counts")
scvi_model = scvi.model.SCVI(adata)
scvi_model.train()
# Train Solo for doublet detection
solo_model = scvi.external.SOLO.from_scvi_model(scvi_model)
solo_model.train()
# Predict doublets
predictions = solo_model.predict()
doublet_scores = predictions["doublet"]
adata.obs["doublet_score"] = doublet_scores
Amortized LDA (Topic Modeling)
Purpose: Topic modeling for gene expression using Latent Dirichlet Allocation.
Key Features: - Discovers gene expression programs (topics) - Amortized variational inference for scalability - Each cell is a mixture of topics - Each topic is a distribution over genes
When to Use: - Discovering gene programs or expression modules - Understanding compositional structure of expression - Alternative dimensionality reduction approach - Interpretable decomposition of expression patterns
Basic Usage:
scvi.model.AMORTIZEDLDA.setup_anndata(adata, layer="counts")
model = scvi.model.AMORTIZEDLDA(adata, n_topics=10)
model.train()
# Get topic compositions per cell
topic_proportions = model.get_latent_representation()
# Get gene loadings per topic
topic_gene_loadings = model.get_topic_distribution()
Model Selection Guidelines
Choose scVI when: - Starting with unsupervised analysis - Need batch correction and integration - Want normalized expression and DE analysis
Choose scANVI when: - Have some labeled cells for training - Need cell type annotation - Want to transfer labels from reference to query
Choose AUTOZI when: - Concerned about technical dropout - Need to identify zero-inflated genes - Working with very sparse datasets
Choose VeloVI when: - Have spliced/unspliced count data - Interested in cellular dynamics - Need RNA velocity with batch correction
Choose contrastiveVI when: - Analyzing perturbation experiments - Need to separate treatment effects - Want to identify condition-specific programs
Choose CellAssign when: - Have marker gene lists available - Want probabilistic marker-based annotation - No reference dataset available
Choose Solo when: - Need doublet detection - Already using scVI for analysis - Want probabilistic doublet scores