Protein Modeling

Overview

TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.

Available Datasets

Protein Function Prediction

Enzyme Function: - EnzymeCommission (17,562 proteins): EC number classification (7 levels) - BetaLactamase (5,864 sequences): Enzyme activity prediction

Protein Characteristics: - Fluorescence (54,025 sequences): GFP fluorescence intensity - Stability (53,614 sequences): Thermostability prediction - Solubility (62,478 sequences): Protein solubility classification - BinaryLocalization (22,168 proteins): Subcellular localization (membrane vs. soluble) - SubcellularLocalization (8,943 proteins): 10-class localization prediction

Gene Ontology: - GeneOntology (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component

Protein Structure Prediction

Fold (16,712 proteins): Structural fold classification (1,195 classes)
SecondaryStructure (8,678 proteins): 3-state or 8-state secondary structure prediction
ContactPrediction via ProteinNet: Residue-residue contact maps

Protein Interaction

Protein-Protein Interactions: - HumanPPI (1,412 proteins, 6,584 interactions): Human protein interaction network - YeastPPI (2,018 proteins, 6,451 interactions): Yeast protein interaction network - PPIAffinity (2,156 protein pairs): Binding affinity measurements

Protein-Ligand Binding: - BindingDB (~1.5M entries): Comprehensive binding affinity database - PDBBind (20,000+ complexes): 3D structure-based binding data - Refined set: High-quality crystal structures - Core set: Diverse benchmark set

Large-Scale Protein Databases

AlphaFoldDB: Access to 200M+ predicted protein structures
ProteinNet: Standardized dataset for structure prediction

Task Types

NodePropertyPrediction

Predict properties at the residue (node) level, such as secondary structure or contact maps.

Use Cases: - Secondary structure prediction (helix, sheet, coil) - Residue-level disorder prediction - Post-translational modification sites - Binding site prediction

PropertyPrediction

Predict protein-level properties like function, stability, or localization.

Use Cases: - Enzyme function classification - Subcellular localization - Protein stability prediction - Gene ontology term prediction

InteractionPrediction

Predict interactions between protein pairs or protein-ligand pairs.

Key Features: - Handles both sequence and structure inputs - Supports symmetric (PPI) and asymmetric (protein-ligand) interactions - Multiple negative sampling strategies

ContactPrediction

Specialized task for predicting spatial proximity between residues in folded structures.

Applications: - Structure prediction from sequence - Protein folding pathway analysis - Validation of predicted structures

Protein Representation Models

Sequence-Based Models

ESM (Evolutionary Scale Modeling): - Pre-trained transformer model on 250M sequences - State-of-the-art for sequence-only tasks - Available in multiple sizes (ESM-1b, ESM-2) - Captures evolutionary and structural information

ProteinBERT: - BERT-style masked language model - Pre-trained on UniProt sequences - Good for transfer learning

ProteinLSTM: - Bidirectional LSTM for sequence encoding - Lightweight and fast - Good baseline for sequence tasks

ProteinCNN / ProteinResNet: - Convolutional architectures - Capture local sequence patterns - Faster than transformer models

Structure-Based Models

GearNet (Geometry-Aware Relational Graph Network): - Incorporates 3D geometric information - Edge types based on sequential, radius, and K-nearest neighbors - State-of-the-art for structure-based tasks - Supports both backbone and full-atom representations

GCN/GAT/GIN on Protein Graphs: - Standard GNN architectures adapted for proteins - Flexible edge definitions (sequence, spatial, contact)

SchNet: - Continuous-filter convolutions - Handles 3D coordinates directly - Good for structure prediction and protein-ligand binding

Feature-Based Models

Statistic Features: - Amino acid composition - Sequence length statistics - Motif counts

Physicochemical Features: - Hydrophobicity scales - Charge properties - Secondary structure propensity - Molecular weight, pI

Protein Graph Construction

Edge Types

Sequential Edges: - Connect adjacent residues in sequence - Captures primary structure

Spatial Edges: - K-nearest neighbors in 3D space - Radius cutoff (e.g., Cα atoms within 10Å) - Captures tertiary structure

Contact Edges: - Based on heavy atom distances - Typically < 8Å threshold

Node Features

Residue Identity: - One-hot encoding of 20 amino acids - Learned embeddings

Position Information: - 3D coordinates (Cα, N, C, O) - Backbone angles (phi, psi, omega) - Relative spatial position encodings

Physicochemical Properties: - Hydrophobicity - Charge - Size - Secondary structure

Training Workflows

Pre-training Strategies

Self-Supervised Pre-training: - Masked residue prediction (like BERT) - Distance prediction between residues - Angle prediction (phi, psi, omega) - Dihedral angle prediction - Contact map prediction

Pre-trained Model Usage:

from torchdrug import models

# Load pre-trained ESM
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")

# Fine-tune on downstream task
task = tasks.PropertyPrediction(
    model, task=["stability"],
    criterion="mse", metric=["mae", "rmse"]
)

Multi-Task Learning

Train on multiple related tasks simultaneously: - Joint prediction of function, localization, and stability - Improves generalization and data efficiency - Shares representations across tasks

Best Practices

For Sequence-Only Tasks: 1. Start with pre-trained ESM or ProteinBERT 2. Fine-tune with small learning rate (1e-5 to 1e-4) 3. Use frozen embeddings for small datasets 4. Apply dropout for regularization

For Structure-Based Tasks: 1. Use GearNet with multiple edge types 2. Include geometric features (angles, dihedrals) 3. Pre-train on large structure databases 4. Use data augmentation (rotations, crops)

For Small Datasets: 1. Transfer learning from pre-trained models 2. Multi-task learning with related tasks 3. Data augmentation (sequence mutations, structure perturbations) 4. Strong regularization (dropout, weight decay)

Common Use Cases

Enzyme Engineering

Predict enzyme activity from sequence
Design mutations to improve stability
Screen for desired catalytic properties

Antibody Design

Predict binding affinity
Optimize antibody sequences
Predict immunogenicity

Drug Target Identification

Predict protein function
Identify druggable sites
Analyze protein-ligand interactions

Protein Structure Prediction

Predict secondary structure from sequence
Generate contact maps for tertiary structure
Refine AlphaFold predictions

Integration with Other Tools

AlphaFold Integration

Load AlphaFold-predicted structures:

from torchdrug import data

# Load AlphaFold structure
protein = data.Protein.from_pdb("alphafold_structure.pdb")

# Use in TorchDrug workflows

ESMFold Integration

Use ESMFold for structure prediction, then analyze with TorchDrug models.

Rosetta/PyRosetta

Generate structures with Rosetta, import to TorchDrug for analysis.

references/protein_modeling.md

Protein Modeling

Overview

Available Datasets

Protein Function Prediction

Protein Structure Prediction

Protein Interaction

Large-Scale Protein Databases

Task Types

NodePropertyPrediction

PropertyPrediction

InteractionPrediction

ContactPrediction

Protein Representation Models

Sequence-Based Models

Structure-Based Models

Feature-Based Models

Protein Graph Construction

Edge Types

Node Features

Training Workflows

Pre-training Strategies

Multi-Task Learning

Best Practices

Common Use Cases

Enzyme Engineering

Antibody Design

Drug Target Identification

Protein Structure Prediction

Integration with Other Tools

AlphaFold Integration

ESMFold Integration

Rosetta/PyRosetta