Protein Modeling
Overview
TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.
Available Datasets
Protein Function Prediction
Enzyme Function: - EnzymeCommission (17,562 proteins): EC number classification (7 levels) - BetaLactamase (5,864 sequences): Enzyme activity prediction
Protein Characteristics: - Fluorescence (54,025 sequences): GFP fluorescence intensity - Stability (53,614 sequences): Thermostability prediction - Solubility (62,478 sequences): Protein solubility classification - BinaryLocalization (22,168 proteins): Subcellular localization (membrane vs. soluble) - SubcellularLocalization (8,943 proteins): 10-class localization prediction
Gene Ontology: - GeneOntology (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component
Protein Structure Prediction
- Fold (16,712 proteins): Structural fold classification (1,195 classes)
- SecondaryStructure (8,678 proteins): 3-state or 8-state secondary structure prediction
- ContactPrediction via ProteinNet: Residue-residue contact maps
Protein Interaction
Protein-Protein Interactions: - HumanPPI (1,412 proteins, 6,584 interactions): Human protein interaction network - YeastPPI (2,018 proteins, 6,451 interactions): Yeast protein interaction network - PPIAffinity (2,156 protein pairs): Binding affinity measurements
Protein-Ligand Binding: - BindingDB (~1.5M entries): Comprehensive binding affinity database - PDBBind (20,000+ complexes): 3D structure-based binding data - Refined set: High-quality crystal structures - Core set: Diverse benchmark set
Large-Scale Protein Databases
- AlphaFoldDB: Access to 200M+ predicted protein structures
- ProteinNet: Standardized dataset for structure prediction
Task Types
NodePropertyPrediction
Predict properties at the residue (node) level, such as secondary structure or contact maps.
Use Cases: - Secondary structure prediction (helix, sheet, coil) - Residue-level disorder prediction - Post-translational modification sites - Binding site prediction
PropertyPrediction
Predict protein-level properties like function, stability, or localization.
Use Cases: - Enzyme function classification - Subcellular localization - Protein stability prediction - Gene ontology term prediction
InteractionPrediction
Predict interactions between protein pairs or protein-ligand pairs.
Key Features: - Handles both sequence and structure inputs - Supports symmetric (PPI) and asymmetric (protein-ligand) interactions - Multiple negative sampling strategies
ContactPrediction
Specialized task for predicting spatial proximity between residues in folded structures.
Applications: - Structure prediction from sequence - Protein folding pathway analysis - Validation of predicted structures
Protein Representation Models
Sequence-Based Models
ESM (Evolutionary Scale Modeling): - Pre-trained transformer model on 250M sequences - State-of-the-art for sequence-only tasks - Available in multiple sizes (ESM-1b, ESM-2) - Captures evolutionary and structural information
ProteinBERT: - BERT-style masked language model - Pre-trained on UniProt sequences - Good for transfer learning
ProteinLSTM: - Bidirectional LSTM for sequence encoding - Lightweight and fast - Good baseline for sequence tasks
ProteinCNN / ProteinResNet: - Convolutional architectures - Capture local sequence patterns - Faster than transformer models
Structure-Based Models
GearNet (Geometry-Aware Relational Graph Network): - Incorporates 3D geometric information - Edge types based on sequential, radius, and K-nearest neighbors - State-of-the-art for structure-based tasks - Supports both backbone and full-atom representations
GCN/GAT/GIN on Protein Graphs: - Standard GNN architectures adapted for proteins - Flexible edge definitions (sequence, spatial, contact)
SchNet: - Continuous-filter convolutions - Handles 3D coordinates directly - Good for structure prediction and protein-ligand binding
Feature-Based Models
Statistic Features: - Amino acid composition - Sequence length statistics - Motif counts
Physicochemical Features: - Hydrophobicity scales - Charge properties - Secondary structure propensity - Molecular weight, pI
Protein Graph Construction
Edge Types
Sequential Edges: - Connect adjacent residues in sequence - Captures primary structure
Spatial Edges: - K-nearest neighbors in 3D space - Radius cutoff (e.g., Cα atoms within 10Å) - Captures tertiary structure
Contact Edges: - Based on heavy atom distances - Typically < 8Å threshold
Node Features
Residue Identity: - One-hot encoding of 20 amino acids - Learned embeddings
Position Information: - 3D coordinates (Cα, N, C, O) - Backbone angles (phi, psi, omega) - Relative spatial position encodings
Physicochemical Properties: - Hydrophobicity - Charge - Size - Secondary structure
Training Workflows
Pre-training Strategies
Self-Supervised Pre-training: - Masked residue prediction (like BERT) - Distance prediction between residues - Angle prediction (phi, psi, omega) - Dihedral angle prediction - Contact map prediction
Pre-trained Model Usage:
from torchdrug import models
# Load pre-trained ESM
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")
# Fine-tune on downstream task
task = tasks.PropertyPrediction(
model, task=["stability"],
criterion="mse", metric=["mae", "rmse"]
)
Multi-Task Learning
Train on multiple related tasks simultaneously: - Joint prediction of function, localization, and stability - Improves generalization and data efficiency - Shares representations across tasks
Best Practices
For Sequence-Only Tasks: 1. Start with pre-trained ESM or ProteinBERT 2. Fine-tune with small learning rate (1e-5 to 1e-4) 3. Use frozen embeddings for small datasets 4. Apply dropout for regularization
For Structure-Based Tasks: 1. Use GearNet with multiple edge types 2. Include geometric features (angles, dihedrals) 3. Pre-train on large structure databases 4. Use data augmentation (rotations, crops)
For Small Datasets: 1. Transfer learning from pre-trained models 2. Multi-task learning with related tasks 3. Data augmentation (sequence mutations, structure perturbations) 4. Strong regularization (dropout, weight decay)
Common Use Cases
Enzyme Engineering
- Predict enzyme activity from sequence
- Design mutations to improve stability
- Screen for desired catalytic properties
Antibody Design
- Predict binding affinity
- Optimize antibody sequences
- Predict immunogenicity
Drug Target Identification
- Predict protein function
- Identify druggable sites
- Analyze protein-ligand interactions
Protein Structure Prediction
- Predict secondary structure from sequence
- Generate contact maps for tertiary structure
- Refine AlphaFold predictions
Integration with Other Tools
AlphaFold Integration
Load AlphaFold-predicted structures:
from torchdrug import data
# Load AlphaFold structure
protein = data.Protein.from_pdb("alphafold_structure.pdb")
# Use in TorchDrug workflows
ESMFold Integration
Use ESMFold for structure prediction, then analyze with TorchDrug models.
Rosetta/PyRosetta
Generate structures with Rosetta, import to TorchDrug for analysis.