ATAC-seq and Chromatin Accessibility Models
This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.
PeakVI
Purpose: Analysis and integration of single-cell ATAC-seq data using peak counts.
Key Features: - Variational autoencoder specifically designed for scATAC-seq peak data - Learns low-dimensional representations of chromatin accessibility - Performs batch correction across samples - Enables differential accessibility testing - Integrates multiple ATAC-seq datasets
When to Use: - Analyzing scATAC-seq peak count matrices - Integrating multiple ATAC-seq experiments - Batch correction of chromatin accessibility data - Dimensionality reduction for ATAC-seq - Differential accessibility analysis between cell types or conditions
Data Requirements: - Peak count matrix (cells × peaks) - Binary or count data for peak accessibility - Batch/sample annotations (optional, for batch correction)
Basic Usage:
import scvi
# Prepare data (peaks should be in adata.X)
# Optional: filter peaks
sc.pp.filter_genes(adata, min_cells=3)
# Setup data
scvi.model.PEAKVI.setup_anndata(
adata,
batch_key="batch"
)
# Train model
model = scvi.model.PEAKVI(adata)
model.train()
# Get latent representation (batch-corrected)
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent
# Differential accessibility
da_results = model.differential_accessibility(
groupby="cell_type",
group1="TypeA",
group2="TypeB"
)
Key Parameters:
- n_latent: Dimensionality of latent space (default: 10)
- n_hidden: Number of nodes per hidden layer (default: 128)
- n_layers: Number of hidden layers (default: 1)
- region_factors: Whether to learn region-specific factors (default: True)
- latent_distribution: Distribution for latent space ("normal" or "ln")
Outputs:
- get_latent_representation(): Low-dimensional embeddings for cells
- get_accessibility_estimates(): Normalized accessibility values
- differential_accessibility(): Statistical testing for differential peaks
- get_region_factors(): Peak-specific scaling factors
Best Practices:
1. Filter out low-quality peaks (present in very few cells)
2. Include batch information if integrating multiple samples
3. Use latent representations for clustering and UMAP visualization
4. Consider using region_factors=True for datasets with high technical variation
5. Store latent embeddings in adata.obsm for downstream analysis with scanpy
PoissonVI
Purpose: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).
Key Features: - Models fragment counts directly (not just peak presence/absence) - Poisson distribution for count data - Captures quantitative differences in accessibility - Enables fine-grained analysis of chromatin state
When to Use: - Analyzing fragment-level ATAC-seq data - Need quantitative accessibility measurements - Higher resolution analysis than binary peak calls - Investigating gradual changes in chromatin accessibility
Data Requirements: - Fragment count matrix (cells × genomic regions) - Count data (not binary)
Basic Usage:
scvi.model.POISSONVI.setup_anndata(
adata,
batch_key="batch"
)
model = scvi.model.POISSONVI(adata)
model.train()
# Get results
latent = model.get_latent_representation()
accessibility = model.get_accessibility_estimates()
Key Differences from PeakVI: - PeakVI: Best for standard peak count matrices, faster - PoissonVI: Best for quantitative fragment counts, more detailed
When to Choose PoissonVI over PeakVI: - Working with fragment counts rather than called peaks - Need to capture quantitative differences - Have high-quality, high-coverage data - Interested in subtle accessibility changes
scBasset
Purpose: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.
Key Features: - Convolutional neural network (CNN) architecture for sequence-based analysis - Models raw DNA sequences, not just peak counts - Enables motif discovery and transcription factor (TF) binding prediction - Provides interpretable feature importance - Performs batch correction
When to Use: - Want to incorporate DNA sequence information - Interested in TF motif analysis - Need interpretable models (which sequences drive accessibility) - Analyzing regulatory elements and TF binding sites - Predicting accessibility from sequence alone
Data Requirements: - Peak sequences (extracted from genome) - Peak accessibility matrix - Genome reference (for sequence extraction)
Basic Usage:
# scBasset requires sequence information
# First, extract sequences for peaks
from scbasset import utils
sequences = utils.fetch_sequences(adata, genome="hg38")
# Setup and train
scvi.model.SCBASSET.setup_anndata(
adata,
batch_key="batch"
)
model = scvi.model.SCBASSET(adata, sequences=sequences)
model.train()
# Get latent representation
latent = model.get_latent_representation()
# Interpret model: which sequences/motifs are important
importance_scores = model.get_feature_importance()
Key Parameters:
- n_latent: Latent space dimensionality
- conv_layers: Number of convolutional layers
- n_filters: Number of filters per conv layer
- filter_size: Size of convolutional filters
Advanced Features: - In silico mutagenesis: Predict how sequence changes affect accessibility - Motif enrichment: Identify enriched TF motifs in accessible regions - Batch correction: Similar to other scvi-tools models - Transfer learning: Fine-tune on new datasets
Interpretability Tools:
# Get importance scores for sequences
importance = model.get_sequence_importance(region_indices=[0, 1, 2])
# Predict accessibility for new sequences
predictions = model.predict_accessibility(new_sequences)
Model Selection for ATAC-seq
PeakVI
Choose when: - Standard scATAC-seq analysis workflow - Have peak count matrices (most common format) - Need fast, efficient batch correction - Want straightforward differential accessibility - Prioritize computational efficiency
Advantages: - Fast training and inference - Proven track record for scATAC-seq - Easy integration with scanpy workflow - Robust batch correction
PoissonVI
Choose when: - Have fragment-level count data - Need quantitative accessibility measures - Interested in subtle differences - Have high-coverage, high-quality data
Advantages: - More detailed quantitative information - Better for gradient changes - Appropriate statistical model for counts
scBasset
Choose when: - Want to incorporate DNA sequence - Need biological interpretation (motifs, TFs) - Interested in regulatory mechanisms - Have computational resources for CNN training - Want predictive power for new sequences
Advantages: - Sequence-based, biologically interpretable - Motif and TF analysis built-in - Predictive modeling capabilities - In silico perturbation experiments
Workflow Example: Complete ATAC-seq Analysis
import scvi
import scanpy as sc
# 1. Load and preprocess ATAC-seq data
adata = sc.read_h5ad("atac_data.h5ad")
# 2. Filter low-quality peaks
sc.pp.filter_genes(adata, min_cells=10)
# 3. Setup and train PeakVI
scvi.model.PEAKVI.setup_anndata(
adata,
batch_key="sample"
)
model = scvi.model.PEAKVI(adata, n_latent=20)
model.train(max_epochs=400)
# 4. Extract latent representation
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent
# 5. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_PeakVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")
# 6. Differential accessibility
da_results = model.differential_accessibility(
groupby="clusters",
group1="0",
group2="1"
)
# 7. Save model
model.save("peakvi_model")
Integration with Gene Expression (RNA+ATAC)
For paired multimodal data (RNA+ATAC from same cells), use MultiVI instead:
# For 10x Multiome or similar paired data
scvi.model.MULTIVI.setup_anndata(
adata,
batch_key="sample",
modality_key="modality" # "RNA" or "ATAC"
)
model = scvi.model.MULTIVI(adata)
model.train()
# Get joint latent space
latent = model.get_latent_representation()
See models-multimodal.md for more details on multimodal integration.
Best Practices for ATAC-seq Analysis
- Quality Control:
- Filter cells with very low or very high peak counts
- Remove peaks present in very few cells
-
Filter mitochondrial and sex chromosome peaks if needed
-
Batch Correction:
- Always include
batch_keyif integrating multiple samples -
Consider technical covariates (sequencing depth, TSS enrichment)
-
Feature Selection:
- Unlike RNA-seq, all peaks are often used
-
Consider filtering very rare peaks for efficiency
-
Latent Dimensions:
- Start with
n_latent=10-30depending on dataset complexity -
Larger values for more heterogeneous datasets
-
Downstream Analysis:
- Use latent representations for clustering and visualization
- Link peaks to genes for regulatory analysis
-
Perform motif enrichment on cluster-specific peaks
-
Computational Considerations:
- ATAC-seq matrices are often very large (many peaks)
- Consider downsampling peaks for initial exploration
- Use GPU acceleration for large datasets