Torchdrug


name: torchdrug description: "Graph-based drug discovery toolkit. Molecular property prediction (ADMET), protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis, GNNs (GIN, GAT, SchNet), 40+ datasets, for PyTorch-based ML on molecules, proteins, and biomedical graphs."


TorchDrug

Overview

TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures.

When to Use This Skill

This skill should be used when working with:

Data Types: - SMILES strings or molecular structures - Protein sequences or 3D structures (PDB files) - Chemical reactions and retrosynthesis - Biomedical knowledge graphs - Drug discovery datasets

Tasks: - Predicting molecular properties (solubility, toxicity, activity) - Protein function or structure prediction - Drug-target binding prediction - Generating new molecular structures - Planning chemical synthesis routes - Link prediction in biomedical knowledge bases - Training graph neural networks on scientific data

Libraries and Integration: - TorchDrug is the primary library - Often used with RDKit for cheminformatics - Compatible with PyTorch and PyTorch Lightning - Integrates with AlphaFold and ESM for proteins

Getting Started

Installation

uv pip install torchdrug
# Or with optional dependencies
uv pip install torchdrug[full]

Quick Example

from torchdrug import datasets, models, tasks
from torch.utils.data import DataLoader

# Load molecular dataset
dataset = datasets.BBBP("~/molecule-datasets/")
train_set, valid_set, test_set = dataset.split()

# Define GNN model
model = models.GIN(
    input_dim=dataset.node_feature_dim,
    hidden_dims=[256, 256, 256],
    edge_input_dim=dataset.edge_feature_dim,
    batch_norm=True,
    readout="mean"
)

# Create property prediction task
task = tasks.PropertyPrediction(
    model,
    task=dataset.tasks,
    criterion="bce",
    metric=["auroc", "auprc"]
)

# Train with PyTorch
optimizer = torch.optim.Adam(task.parameters(), lr=1e-3)
train_loader = DataLoader(train_set, batch_size=32, shuffle=True)

for epoch in range(100):
    for batch in train_loader:
        loss = task(batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Core Capabilities

1. Molecular Property Prediction

Predict chemical, physical, and biological properties of molecules from structure.

Use Cases: - Drug-likeness and ADMET properties - Toxicity screening - Quantum chemistry properties - Binding affinity prediction

Key Components: - 20+ molecular datasets (BBBP, HIV, Tox21, QM9, etc.) - GNN models (GIN, GAT, SchNet) - PropertyPrediction and MultipleBinaryClassification tasks

Reference: See references/molecular_property_prediction.md for: - Complete dataset catalog - Model selection guide - Training workflows and best practices - Feature engineering details

2. Protein Modeling

Work with protein sequences, structures, and properties.

Use Cases: - Enzyme function prediction - Protein stability and solubility - Subcellular localization - Protein-protein interactions - Structure prediction

Key Components: - 15+ protein datasets (EnzymeCommission, GeneOntology, PDBBind, etc.) - Sequence models (ESM, ProteinBERT, ProteinLSTM) - Structure models (GearNet, SchNet) - Multiple task types for different prediction levels

Reference: See references/protein_modeling.md for: - Protein-specific datasets - Sequence vs structure models - Pre-training strategies - Integration with AlphaFold and ESM

3. Knowledge Graph Reasoning

Predict missing links and relationships in biological knowledge graphs.

Use Cases: - Drug repurposing - Disease mechanism discovery - Gene-disease associations - Multi-hop biomedical reasoning

Key Components: - General KGs (FB15k, WN18) and biomedical (Hetionet) - Embedding models (TransE, RotatE, ComplEx) - KnowledgeGraphCompletion task

Reference: See references/knowledge_graphs.md for: - Knowledge graph datasets (including Hetionet with 45k biomedical entities) - Embedding model comparison - Evaluation metrics and protocols - Biomedical applications

4. Molecular Generation

Generate novel molecular structures with desired properties.

Use Cases: - De novo drug design - Lead optimization - Chemical space exploration - Property-guided generation

Key Components: - Autoregressive generation - GCPN (policy-based generation) - GraphAutoregressiveFlow - Property optimization workflows

Reference: See references/molecular_generation.md for: - Generation strategies (unconditional, conditional, scaffold-based) - Multi-objective optimization - Validation and filtering - Integration with property prediction

5. Retrosynthesis

Predict synthetic routes from target molecules to starting materials.

Use Cases: - Synthesis planning - Route optimization - Synthetic accessibility assessment - Multi-step planning

Key Components: - USPTO-50k reaction dataset - CenterIdentification (reaction center prediction) - SynthonCompletion (reactant prediction) - End-to-end Retrosynthesis pipeline

Reference: See references/retrosynthesis.md for: - Task decomposition (center ID → synthon completion) - Multi-step synthesis planning - Commercial availability checking - Integration with other retrosynthesis tools

6. Graph Neural Network Models

Comprehensive catalog of GNN architectures for different data types and tasks.

Available Models: - General GNNs: GCN, GAT, GIN, RGCN, MPNN - 3D-aware: SchNet, GearNet - Protein-specific: ESM, ProteinBERT, GearNet - Knowledge graph: TransE, RotatE, ComplEx, SimplE - Generative: GraphAutoregressiveFlow

Reference: See references/models_architectures.md for: - Detailed model descriptions - Model selection guide by task and dataset - Architecture comparisons - Implementation tips

7. Datasets

40+ curated datasets spanning chemistry, biology, and knowledge graphs.

Categories: - Molecular properties (drug discovery, quantum chemistry) - Protein properties (function, structure, interactions) - Knowledge graphs (general and biomedical) - Retrosynthesis reactions

Reference: See references/datasets.md for: - Complete dataset catalog with sizes and tasks - Dataset selection guide - Loading and preprocessing - Splitting strategies (random, scaffold)

Common Workflows

Workflow 1: Molecular Property Prediction

Scenario: Predict blood-brain barrier penetration for drug candidates.

Steps: 1. Load dataset: datasets.BBBP() 2. Choose model: GIN for molecular graphs 3. Define task: PropertyPrediction with binary classification 4. Train with scaffold split for realistic evaluation 5. Evaluate using AUROC and AUPRC

Navigation: references/molecular_property_prediction.md → Dataset selection → Model selection → Training

Workflow 2: Protein Function Prediction

Scenario: Predict enzyme function from sequence.

Steps: 1. Load dataset: datasets.EnzymeCommission() 2. Choose model: ESM (pre-trained) or GearNet (with structure) 3. Define task: PropertyPrediction with multi-class classification 4. Fine-tune pre-trained model or train from scratch 5. Evaluate using accuracy and per-class metrics

Navigation: references/protein_modeling.md → Model selection (sequence vs structure) → Pre-training strategies

Workflow 3: Drug Repurposing via Knowledge Graphs

Scenario: Find new disease treatments in Hetionet.

Steps: 1. Load dataset: datasets.Hetionet() 2. Choose model: RotatE or ComplEx 3. Define task: KnowledgeGraphCompletion 4. Train with negative sampling 5. Query for "Compound-treats-Disease" predictions 6. Filter by plausibility and mechanism

Navigation: references/knowledge_graphs.md → Hetionet dataset → Model selection → Biomedical applications

Workflow 4: De Novo Molecule Generation

Scenario: Generate drug-like molecules optimized for target binding.

Steps: 1. Train property predictor on activity data 2. Choose generation approach: GCPN for RL-based optimization 3. Define reward function combining affinity, drug-likeness, synthesizability 4. Generate candidates with property constraints 5. Validate chemistry and filter by drug-likeness 6. Rank by multi-objective scoring

Navigation: references/molecular_generation.md → Conditional generation → Multi-objective optimization

Workflow 5: Retrosynthesis Planning

Scenario: Plan synthesis route for target molecule.

Steps: 1. Load dataset: datasets.USPTO50k() 2. Train center identification model (RGCN) 3. Train synthon completion model (GIN) 4. Combine into end-to-end retrosynthesis pipeline 5. Apply recursively for multi-step planning 6. Check commercial availability of building blocks

Navigation: references/retrosynthesis.md → Task types → Multi-step planning

Integration Patterns

With RDKit

Convert between TorchDrug molecules and RDKit:

from torchdrug import data
from rdkit import Chem

# SMILES → TorchDrug molecule
smiles = "CCO"
mol = data.Molecule.from_smiles(smiles)

# TorchDrug → RDKit
rdkit_mol = mol.to_molecule()

# RDKit → TorchDrug
rdkit_mol = Chem.MolFromSmiles(smiles)
mol = data.Molecule.from_molecule(rdkit_mol)

With AlphaFold/ESM

Use predicted structures:

from torchdrug import data

# Load AlphaFold predicted structure
protein = data.Protein.from_pdb("AF-P12345-F1-model_v4.pdb")

# Build graph with spatial edges
graph = protein.residue_graph(
    node_position="ca",
    edge_types=["sequential", "radius"],
    radius_cutoff=10.0
)

With PyTorch Lightning

Wrap tasks for Lightning training:

import pytorch_lightning as pl

class LightningTask(pl.LightningModule):
    def __init__(self, torchdrug_task):
        super().__init__()
        self.task = torchdrug_task

    def training_step(self, batch, batch_idx):
        return self.task(batch)

    def validation_step(self, batch, batch_idx):
        pred = self.task.predict(batch)
        target = self.task.target(batch)
        return {"pred": pred, "target": target}

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

Technical Details

For deep dives into TorchDrug's architecture:

Core Concepts: See references/core_concepts.md for: - Architecture philosophy (modular, configurable) - Data structures (Graph, Molecule, Protein, PackedGraph) - Model interface and forward function signature - Task interface (predict, target, forward, evaluate) - Training workflows and best practices - Loss functions and metrics - Common pitfalls and debugging

Quick Reference Cheat Sheet

Choose Dataset: - Molecular property → references/datasets.md → Molecular section - Protein task → references/datasets.md → Protein section - Knowledge graph → references/datasets.md → Knowledge graph section

Choose Model: - Molecules → references/models_architectures.md → GNN section → GIN/GAT/SchNet - Proteins (sequence) → references/models_architectures.md → Protein section → ESM - Proteins (structure) → references/models_architectures.md → Protein section → GearNet - Knowledge graph → references/models_architectures.md → KG section → RotatE/ComplEx

Common Tasks: - Property prediction → references/molecular_property_prediction.md or references/protein_modeling.md - Generation → references/molecular_generation.md - Retrosynthesis → references/retrosynthesis.md - KG reasoning → references/knowledge_graphs.md

Understand Architecture: - Data structures → references/core_concepts.md → Data Structures - Model design → references/core_concepts.md → Model Interface - Task design → references/core_concepts.md → Task Interface

Troubleshooting Common Issues

Issue: Dimension mismatch errors → Check model.input_dim matches dataset.node_feature_dim → See references/core_concepts.md → Essential Attributes

Issue: Poor performance on molecular tasks → Use scaffold splitting, not random → Try GIN instead of GCN → See references/molecular_property_prediction.md → Best Practices

Issue: Protein model not learning → Use pre-trained ESM for sequence tasks → Check edge construction for structure models → See references/protein_modeling.md → Training Workflows

Issue: Memory errors with large graphs → Reduce batch size → Use gradient accumulation → See references/core_concepts.md → Memory Efficiency

Issue: Generated molecules are invalid → Add validity constraints → Post-process with RDKit validation → See references/molecular_generation.md → Validation and Filtering

Resources

Official Documentation: https://torchdrug.ai/docs/ GitHub: https://github.com/DeepGraphLearning/torchdrug Paper: TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery

Summary

Navigate to the appropriate reference file based on your task:

  1. Molecular property predictionmolecular_property_prediction.md
  2. Protein modelingprotein_modeling.md
  3. Knowledge graphsknowledge_graphs.md
  4. Molecular generationmolecular_generation.md
  5. Retrosynthesisretrosynthesis.md
  6. Model selectionmodels_architectures.md
  7. Dataset selectiondatasets.md
  8. Technical detailscore_concepts.md

Each reference provides comprehensive coverage of its domain with examples, best practices, and common use cases.

← Back to All Skills