Datasets Reference
Overview
TorchDrug provides 40+ curated datasets across multiple domains: molecular property prediction, protein modeling, knowledge graph reasoning, and retrosynthesis. All datasets support lazy loading, automatic downloading, and customizable feature extraction.
Molecular Property Prediction Datasets
Drug Discovery Classification
| Dataset | Size | Task | Classes | Description |
|---|---|---|---|---|
| BACE | 1,513 | Binary | 2 | β-secretase inhibition for Alzheimer's |
| BBBP | 2,039 | Binary | 2 | Blood-brain barrier penetration |
| HIV | 41,127 | Binary | 2 | Inhibition of HIV replication |
| ClinTox | 1,478 | Multi-label | 2 | Clinical trial toxicity |
| SIDER | 1,427 | Multi-label | 27 | Side effects by system organ class |
| Tox21 | 7,831 | Multi-label | 12 | Toxicity across 12 targets |
| ToxCast | 8,576 | Multi-label | 617 | High-throughput toxicology |
| MUV | 93,087 | Multi-label | 17 | Unbiased validation for screening |
Key Features: - All use scaffold splits for realistic evaluation - Binary classification metrics: AUROC, AUPRC - Multi-label handles missing values
Use Cases: - Drug safety prediction - Virtual screening - ADMET property prediction
Drug Discovery Regression
| Dataset | Size | Property | Units | Description |
|---|---|---|---|---|
| ESOL | 1,128 | Solubility | log(mol/L) | Water solubility |
| FreeSolv | 642 | Hydration | kcal/mol | Hydration free energy |
| Lipophilicity | 4,200 | LogD | - | Octanol/water distribution |
| SAMPL | 643 | Solvation | kcal/mol | Solvation free energies |
Metrics: MAE, RMSE, R² Use Cases: ADME optimization, lead optimization
Quantum Chemistry
| Dataset | Size | Properties | Description |
|---|---|---|---|
| QM7 | 7,165 | 1 | Atomization energy |
| QM8 | 21,786 | 12 | Electronic spectra, excited states |
| QM9 | 133,885 | 12 | Geometric, energetic, electronic, thermodynamic |
| PCQM4M | 3.8M | 1 | Large-scale HOMO-LUMO gap |
Properties (QM9): - Dipole moment - Isotropic polarizability - HOMO/LUMO energies - Internal energy, enthalpy, free energy - Heat capacity - Electronic spatial extent
Use Cases: - Quantum property prediction - Method development benchmarking - Pre-training molecular models
Large Molecule Databases
| Dataset | Size | Description | Use Case |
|---|---|---|---|
| ZINC250k | 250,000 | Drug-like molecules | Generative model training |
| ZINC2M | 2,000,000 | Drug-like molecules | Large-scale pre-training |
| ChEMBL | Millions | Bioactive molecules | Property prediction, generation |
Protein Datasets
Function Prediction
| Dataset | Size | Task | Classes | Description |
|---|---|---|---|---|
| EnzymeCommission | 17,562 | Multi-class | 7 levels | EC number classification |
| GeneOntology | 46,796 | Multi-label | 489 | GO term prediction (BP/MF/CC) |
| BetaLactamase | 5,864 | Regression | - | Enzyme activity levels |
| Fluorescence | 54,025 | Regression | - | GFP fluorescence intensity |
| Stability | 53,614 | Regression | - | Thermostability (ΔΔG) |
Features: - Sequence and/or structure input - Evolutionary information available - Multiple train/test splits
Use Cases: - Protein engineering - Function annotation - Enzyme design
Localization and Solubility
| Dataset | Size | Task | Classes | Description |
|---|---|---|---|---|
| Solubility | 62,478 | Binary | 2 | Protein solubility |
| BinaryLocalization | 22,168 | Binary | 2 | Membrane vs soluble |
| SubcellularLocalization | 8,943 | Multi-class | 10 | Subcellular compartment |
Use Cases: - Protein expression optimization - Target identification - Cell biology
Structure Prediction
| Dataset | Size | Task | Description |
|---|---|---|---|
| Fold | 16,712 | Multi-class (1,195) | Structural fold recognition |
| SecondaryStructure | 8,678 | Sequence labeling | 3-state or 8-state prediction |
| ProteinNet | Varied | Contact prediction | Residue-residue contacts |
Use Cases: - Structure prediction pipelines - Fold recognition - Contact map generation
Protein Interactions
| Dataset | Size | Positives | Negatives | Description |
|---|---|---|---|---|
| HumanPPI | 1,412 proteins | 6,584 | - | Human protein interactions |
| YeastPPI | 2,018 proteins | 6,451 | - | Yeast protein interactions |
| PPIAffinity | 2,156 pairs | - | - | Binding affinity values |
Use Cases: - PPI prediction - Network biology - Drug target identification
Protein-Ligand Binding
| Dataset | Size | Type | Description |
|---|---|---|---|
| BindingDB | ~1.5M | Affinity | Comprehensive binding data |
| PDBBind | 20,000+ | 3D complexes | Structure-based binding |
| - Refined Set | 5,316 | High quality | Curated crystal structures |
| - Core Set | 285 | Benchmark | Diverse test set |
Use Cases: - Binding affinity prediction - Structure-based drug design - Scoring function development
Large Protein Databases
| Dataset | Size | Description |
|---|---|---|
| AlphaFoldDB | 200M+ | Predicted structures for most known proteins |
| UniProt | Integration | Sequence and annotation data |
Knowledge Graph Datasets
General Knowledge
| Dataset | Entities | Relations | Triples | Domain |
|---|---|---|---|---|
| FB15k | 14,951 | 1,345 | 592,213 | Freebase (general knowledge) |
| FB15k-237 | 14,541 | 237 | 310,116 | Filtered Freebase |
| WN18 | 40,943 | 18 | 151,442 | WordNet (lexical) |
| WN18RR | 40,943 | 11 | 93,003 | Filtered WordNet |
Relation Types (FB15k-237):
- /people/person/nationality
- /film/film/genre
- /location/location/contains
- /business/company/founders
- Many more...
Use Cases: - Link prediction - Relation extraction - Knowledge base completion
Biomedical Knowledge
| Dataset | Entities | Relations | Triples | Description |
|---|---|---|---|---|
| Hetionet | 45,158 | 24 | 2,250,197 | Integrates 29 biomedical databases |
Entity Types in Hetionet: - Genes (20,945) - Compounds (1,552) - Diseases (137) - Anatomy (400) - Pathways (1,822) - Pharmacologic classes - Side effects - Symptoms - Molecular functions - Biological processes - Cellular components
Relation Types: - Compound-binds-Gene - Gene-associates-Disease - Disease-presents-Symptom - Compound-treats-Disease - Compound-causes-Side effect - Gene-participates-Pathway - And 18 more...
Use Cases: - Drug repurposing - Disease mechanism discovery - Target identification - Multi-hop reasoning in biomedicine
Citation Network Datasets
| Dataset | Nodes | Edges | Classes | Description |
|---|---|---|---|---|
| Cora | 2,708 | 5,429 | 7 | Machine learning papers |
| CiteSeer | 3,327 | 4,732 | 6 | Computer science papers |
| PubMed | 19,717 | 44,338 | 3 | Biomedical papers |
Use Cases: - Node classification - GNN baseline comparisons - Method development
Retrosynthesis Datasets
| Dataset | Size | Description |
|---|---|---|
| USPTO-50k | 50,017 | Curated patent reactions, single-step |
Features: - Product → Reactants mapping - Atom mapping for reaction centers - Canonicalized SMILES - Balanced across reaction types
Splits: - Train: ~40,000 - Validation: ~5,000 - Test: ~5,000
Use Cases: - Retrosynthesis prediction - Reaction type classification - Synthetic route planning
Dataset Usage Patterns
Loading Datasets
from torchdrug import datasets
# Basic loading
dataset = datasets.BBBP("~/molecule-datasets/")
# With transforms
from torchdrug import transforms
transform = transforms.VirtualNode()
dataset = datasets.BBBP("~/molecule-datasets/", transform=transform)
# Protein dataset
dataset = datasets.EnzymeCommission("~/protein-datasets/")
# Knowledge graph
dataset = datasets.FB15k237("~/kg-datasets/")
Data Splitting
# Random split
train, valid, test = dataset.split([0.8, 0.1, 0.1])
# Scaffold split (for molecules)
from torchdrug import utils
train, valid, test = dataset.split(
utils.scaffold_split(dataset, [0.8, 0.1, 0.1])
)
# Predefined splits (some datasets)
train, valid, test = dataset.split()
Feature Extraction
Node Features (Molecules): - Atom type (one-hot or embedding) - Formal charge - Hybridization - Aromaticity - Number of hydrogens - Chirality
Edge Features (Molecules): - Bond type (single, double, triple, aromatic) - Stereochemistry - Conjugation - Ring membership
Node Features (Proteins): - Amino acid type (one-hot) - Physicochemical properties - Position in sequence - Secondary structure - Solvent accessibility
Edge Features (Proteins): - Edge type (sequential, spatial, contact) - Distance - Angles and dihedrals
Choosing Datasets
By Task
Molecular Property Prediction: - Start with BBBP or HIV (medium size, clear task) - Use QM9 for quantum properties - ESOL/FreeSolv for regression
Protein Function: - EnzymeCommission (well-defined classes) - GeneOntology (comprehensive annotations)
Drug Safety: - Tox21 (standard benchmark) - ClinTox (clinical relevance)
Structure-Based: - PDBBind (protein-ligand) - ProteinNet (structure prediction)
Knowledge Graph: - FB15k-237 (standard benchmark) - Hetionet (biomedical applications)
Generation: - ZINC250k (training) - QM9 (with properties)
Retrosynthesis: - USPTO-50k (only choice)
By Size and Resources
Small (<5k, for testing): - BACE, FreeSolv, ClinTox - Core set of PDBBind
Medium (5k-100k): - BBBP, HIV, ESOL, Tox21 - EnzymeCommission, Fold - FB15k-237, WN18RR
Large (>100k): - QM9, MUV, PCQM4M - GeneOntology, AlphaFoldDB - ZINC2M, BindingDB
By Domain
Drug Discovery: BBBP, HIV, Tox21, ESOL, ZINC Quantum Chemistry: QM7, QM8, QM9, PCQM4M Protein Engineering: Fluorescence, Stability, Solubility Structural Biology: Fold, PDBBind, ProteinNet, AlphaFoldDB Biomedical: Hetionet, GeneOntology, EnzymeCommission Retrosynthesis: USPTO-50k
Best Practices
- Start Small: Test on small datasets before scaling
- Scaffold Split: Use for realistic drug discovery evaluation
- Balanced Metrics: Use AUROC + AUPRC for imbalanced data
- Multiple Runs: Report mean ± std over multiple random seeds
- Data Leakage: Be careful with pre-trained models
- Domain Knowledge: Understand what you're predicting
- Validation: Always use held-out test set
- Preprocessing: Standardize features, handle missing values