references/datasets_reference.md

PyTorch Geometric Datasets Reference

This document provides a comprehensive catalog of all datasets available in torch_geometric.datasets.

Citation Networks

Planetoid

Usage: Node classification, semi-supervised learning Networks: Cora, CiteSeer, PubMed Description: Citation networks where nodes are papers and edges are citations - Cora: 2,708 nodes, 5,429 edges, 7 classes, 1,433 features - CiteSeer: 3,327 nodes, 4,732 edges, 6 classes, 3,703 features - PubMed: 19,717 nodes, 44,338 edges, 3 classes, 500 features

from torch_geometric.datasets import Planetoid
dataset = Planetoid(root='/tmp/Cora', name='Cora')

Coauthor

Usage: Node classification on collaboration networks Networks: CS, Physics Description: Co-authorship networks from Microsoft Academic Graph - CS: 18,333 nodes, 81,894 edges, 15 classes (computer science) - Physics: 34,493 nodes, 247,962 edges, 5 classes (physics)

from torch_geometric.datasets import Coauthor
dataset = Coauthor(root='/tmp/CS', name='CS')

Amazon

Usage: Node classification on product networks Networks: Computers, Photo Description: Amazon co-purchase networks where nodes are products - Computers: 13,752 nodes, 245,861 edges, 10 classes - Photo: 7,650 nodes, 119,081 edges, 8 classes

from torch_geometric.datasets import Amazon
dataset = Amazon(root='/tmp/Computers', name='Computers')

CitationFull

Usage: Citation network analysis Networks: Cora, Cora_ML, DBLP, PubMed Description: Full citation networks without sampling

from torch_geometric.datasets import CitationFull
dataset = CitationFull(root='/tmp/Cora', name='Cora')

Graph Classification

TUDataset

Usage: Graph classification, graph kernel benchmarks Description: Collection of 120+ graph classification datasets - MUTAG: 188 graphs, 2 classes (molecular compounds) - PROTEINS: 1,113 graphs, 2 classes (protein structures) - ENZYMES: 600 graphs, 6 classes (protein enzymes) - IMDB-BINARY: 1,000 graphs, 2 classes (social networks) - REDDIT-BINARY: 2,000 graphs, 2 classes (discussion threads) - COLLAB: 5,000 graphs, 3 classes (scientific collaborations) - NCI1: 4,110 graphs, 2 classes (chemical compounds) - DD: 1,178 graphs, 2 classes (protein structures)

from torch_geometric.datasets import TUDataset
dataset = TUDataset(root='/tmp/ENZYMES', name='ENZYMES')

MoleculeNet

Usage: Molecular property prediction Datasets: Over 10 molecular benchmark datasets Description: Comprehensive molecular machine learning benchmarks - ESOL: Aqueous solubility (regression) - FreeSolv: Hydration free energy (regression) - Lipophilicity: Octanol/water distribution (regression) - BACE: Binding results (classification) - BBBP: Blood-brain barrier penetration (classification) - HIV: HIV inhibition (classification) - Tox21: Toxicity prediction (multi-task classification) - ToxCast: Toxicology forecasting (multi-task classification) - SIDER: Side effects (multi-task classification) - ClinTox: Clinical trial toxicity (multi-task classification)

from torch_geometric.datasets import MoleculeNet
dataset = MoleculeNet(root='/tmp/ESOL', name='ESOL')

Molecular and Chemical Datasets

QM7b

Usage: Molecular property prediction (quantum mechanics) Description: 7,211 molecules with up to 7 heavy atoms - Properties: Atomization energies, electronic properties

from torch_geometric.datasets import QM7b
dataset = QM7b(root='/tmp/QM7b')

QM9

Usage: Molecular property prediction (quantum mechanics) Description: ~130,000 molecules with up to 9 heavy atoms (C, O, N, F) - Properties: 19 quantum chemical properties including HOMO, LUMO, gap, energy

from torch_geometric.datasets import QM9
dataset = QM9(root='/tmp/QM9')

ZINC

Usage: Molecular generation, property prediction Description: ~250,000 drug-like molecular graphs - Properties: Constrained solubility, molecular weight

from torch_geometric.datasets import ZINC
dataset = ZINC(root='/tmp/ZINC', subset=True)

AQSOL

Usage: Aqueous solubility prediction Description: ~10,000 molecules with solubility measurements

from torch_geometric.datasets import AQSOL
dataset = AQSOL(root='/tmp/AQSOL')

MD17

Usage: Molecular dynamics, force field learning Description: Molecular dynamics trajectories for small molecules - Molecules: Benzene, Uracil, Naphthalene, Aspirin, Salicylic acid, etc.

from torch_geometric.datasets import MD17
dataset = MD17(root='/tmp/MD17', name='benzene')

PCQM4Mv2

Usage: Large-scale molecular property prediction Description: 3.8M molecules from PubChem for quantum chemistry - Part of OGB Large-Scale Challenge

from torch_geometric.datasets import PCQM4Mv2
dataset = PCQM4Mv2(root='/tmp/PCQM4Mv2')

Social Networks

Reddit

Usage: Large-scale node classification Description: Reddit posts from September 2014 - 232,965 nodes, 11,606,919 edges, 41 classes - Features: TF-IDF of post content

from torch_geometric.datasets import Reddit
dataset = Reddit(root='/tmp/Reddit')

Reddit2

Usage: Large-scale node classification Description: Updated Reddit dataset with more posts

from torch_geometric.datasets import Reddit2
dataset = Reddit2(root='/tmp/Reddit2')

Twitch

Usage: Node classification, social network analysis Networks: DE, EN, ES, FR, PT, RU Description: Twitch user networks by language

from torch_geometric.datasets import Twitch
dataset = Twitch(root='/tmp/Twitch', name='DE')

Facebook

Usage: Social network analysis, node classification Description: Facebook page-page networks

from torch_geometric.datasets import FacebookPagePage
dataset = FacebookPagePage(root='/tmp/Facebook')

GitHub

Usage: Social network analysis Description: GitHub developer networks

from torch_geometric.datasets import GitHub
dataset = GitHub(root='/tmp/GitHub')

Knowledge Graphs

Entities

Usage: Link prediction, knowledge graph embeddings Datasets: AIFB, MUTAG, BGS, AM Description: RDF knowledge graphs with typed relations

from torch_geometric.datasets import Entities
dataset = Entities(root='/tmp/AIFB', name='AIFB')

WordNet18

Usage: Link prediction on semantic networks Description: Subset of WordNet with 18 relations - 40,943 entities, 151,442 triplets

from torch_geometric.datasets import WordNet18
dataset = WordNet18(root='/tmp/WordNet18')

WordNet18RR

Usage: Link prediction (no inverse relations) Description: Refined version without inverse relations

from torch_geometric.datasets import WordNet18RR
dataset = WordNet18RR(root='/tmp/WordNet18RR')

FB15k-237

Usage: Link prediction on Freebase Description: Subset of Freebase with 237 relations - 14,541 entities, 310,116 triplets

from torch_geometric.datasets import FB15k_237
dataset = FB15k_237(root='/tmp/FB15k')

Heterogeneous Graphs

OGB_MAG

Usage: Heterogeneous graph learning, node classification Description: Microsoft Academic Graph with multiple node/edge types - Node types: paper, author, institution, field of study - 1M+ nodes, 21M+ edges

from torch_geometric.datasets import OGB_MAG
dataset = OGB_MAG(root='/tmp/OGB_MAG')

MovieLens

Usage: Recommendation systems, link prediction Versions: 100K, 1M, 10M, 20M Description: User-movie rating networks - Node types: user, movie - Edge types: rates

from torch_geometric.datasets import MovieLens
dataset = MovieLens(root='/tmp/MovieLens', model_name='100k')

IMDB

Usage: Heterogeneous graph learning Description: IMDB movie network - Node types: movie, actor, director

from torch_geometric.datasets import IMDB
dataset = IMDB(root='/tmp/IMDB')

DBLP

Usage: Heterogeneous graph learning, node classification Description: DBLP bibliography network - Node types: author, paper, term, conference

from torch_geometric.datasets import DBLP
dataset = DBLP(root='/tmp/DBLP')

LastFM

Usage: Heterogeneous recommendation Description: LastFM music network - Node types: user, artist, tag

from torch_geometric.datasets import LastFM
dataset = LastFM(root='/tmp/LastFM')

Temporal Graphs

BitcoinOTC

Usage: Temporal link prediction, trust networks Description: Bitcoin OTC trust network over time

from torch_geometric.datasets import BitcoinOTC
dataset = BitcoinOTC(root='/tmp/BitcoinOTC')

ICEWS18

Usage: Temporal knowledge graph completion Description: Integrated Crisis Early Warning System events

from torch_geometric.datasets import ICEWS18
dataset = ICEWS18(root='/tmp/ICEWS18')

GDELT

Usage: Temporal event forecasting Description: Global Database of Events, Language, and Tone

from torch_geometric.datasets import GDELT
dataset = GDELT(root='/tmp/GDELT')

JODIEDataset

Usage: Dynamic graph learning Datasets: Reddit, Wikipedia, MOOC, LastFM Description: Temporal interaction networks

from torch_geometric.datasets import JODIEDataset
dataset = JODIEDataset(root='/tmp/JODIE', name='Reddit')

3D Meshes and Point Clouds

ShapeNet

Usage: 3D shape classification and segmentation Description: Large-scale 3D CAD model dataset - 16,881 models across 16 categories - Part-level segmentation labels

from torch_geometric.datasets import ShapeNet
dataset = ShapeNet(root='/tmp/ShapeNet', categories=['Airplane'])

ModelNet

Usage: 3D shape classification Versions: ModelNet10, ModelNet40 Description: CAD models for 3D object classification - ModelNet10: 4,899 models, 10 categories - ModelNet40: 12,311 models, 40 categories

from torch_geometric.datasets import ModelNet
dataset = ModelNet(root='/tmp/ModelNet', name='10')

FAUST

Usage: 3D shape matching, correspondence Description: Human body scans for shape analysis - 100 meshes of 10 people in 10 poses

from torch_geometric.datasets import FAUST
dataset = FAUST(root='/tmp/FAUST')

CoMA

Usage: 3D mesh deformation Description: Facial expression meshes - 20,466 3D face scans with expressions

from torch_geometric.datasets import CoMA
dataset = CoMA(root='/tmp/CoMA')

S3DIS

Usage: 3D semantic segmentation Description: Stanford Large-Scale 3D Indoor Spaces - 6 areas, 271 rooms, point cloud data

from torch_geometric.datasets import S3DIS
dataset = S3DIS(root='/tmp/S3DIS', test_area=6)

Image and Vision Datasets

MNISTSuperpixels

Usage: Graph-based image classification Description: MNIST images as superpixel graphs - 70,000 graphs (60k train, 10k test)

from torch_geometric.datasets import MNISTSuperpixels
dataset = MNISTSuperpixels(root='/tmp/MNIST')

Flickr

Usage: Image description, node classification Description: Flickr image network - 89,250 nodes, 899,756 edges

from torch_geometric.datasets import Flickr
dataset = Flickr(root='/tmp/Flickr')

PPI

Usage: Protein-protein interaction prediction Description: Multi-graph protein interaction networks - 24 graphs, 2,373 nodes total

from torch_geometric.datasets import PPI
dataset = PPI(root='/tmp/PPI', split='train')

Small Classic Graphs

KarateClub

Usage: Community detection, visualization Description: Zachary's karate club network - 34 nodes, 78 edges, 2 communities

from torch_geometric.datasets import KarateClub
dataset = KarateClub()

Open Graph Benchmark (OGB)

PyG integrates seamlessly with OGB datasets:

Node Property Prediction

  • ogbn-products: Amazon product network (2.4M nodes)
  • ogbn-proteins: Protein association network (132K nodes)
  • ogbn-arxiv: Citation network (169K nodes)
  • ogbn-papers100M: Large citation network (111M nodes)
  • ogbn-mag: Heterogeneous academic graph
  • ogbl-ppa: Protein association networks
  • ogbl-collab: Collaboration networks
  • ogbl-ddi: Drug-drug interaction network
  • ogbl-citation2: Citation network
  • ogbl-wikikg2: Wikidata knowledge graph

Graph Property Prediction

  • ogbg-molhiv: Molecular HIV activity prediction
  • ogbg-molpcba: Molecular bioassays (multi-task)
  • ogbg-ppa: Protein function prediction
  • ogbg-code2: Code abstract syntax trees
from torch_geometric.datasets import OGB_MAG, OGB_PPA
# or
from ogb.nodeproppred import PygNodePropPredDataset
dataset = PygNodePropPredDataset(name='ogbn-arxiv')

Synthetic Datasets

FakeDataset

Usage: Testing, debugging Description: Generates random graph data

from torch_geometric.datasets import FakeDataset
dataset = FakeDataset(num_graphs=100, avg_num_nodes=50)

StochasticBlockModelDataset

Usage: Community detection benchmarks Description: Graphs generated from stochastic block models

from torch_geometric.datasets import StochasticBlockModelDataset
dataset = StochasticBlockModelDataset(root='/tmp/SBM', num_graphs=1000)

ExplainerDataset

Usage: Testing explainability methods Description: Synthetic graphs with known explanation ground truth

from torch_geometric.datasets import ExplainerDataset
dataset = ExplainerDataset(num_graphs=1000)

Materials Science

QM8

Usage: Molecular property prediction Description: Electronic properties of small molecules

from torch_geometric.datasets import QM8
dataset = QM8(root='/tmp/QM8')

Biological Networks

PPI (Protein-Protein Interaction)

Already listed above under Image and Vision Datasets

STRING

Usage: Protein interaction networks Description: Known and predicted protein-protein interactions

# Available through external sources or custom loading

Usage Tips

  1. Start with small datasets: Use Cora, KarateClub, or ENZYMES for prototyping
  2. Citation networks: Planetoid datasets are perfect for node classification
  3. Graph classification: TUDataset provides diverse benchmarks
  4. Molecular: QM9, ZINC, MoleculeNet for chemistry applications
  5. Large-scale: Use Reddit, OGB datasets with NeighborLoader
  6. Heterogeneous: OGB_MAG, MovieLens, IMDB for multi-type graphs
  7. Temporal: JODIE, ICEWS for dynamic graph learning
  8. 3D: ShapeNet, ModelNet, S3DIS for geometric learning

Common Patterns

Loading with Transforms

from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures

dataset = Planetoid(root='/tmp/Cora', name='Cora',
                    transform=NormalizeFeatures())

Train/Val/Test Splits

# For datasets with pre-defined splits
data = dataset[0]
train_data = data[data.train_mask]
val_data = data[data.val_mask]
test_data = data[data.test_mask]

# For graph classification
from torch_geometric.loader import DataLoader
train_dataset = dataset[:int(len(dataset) * 0.8)]
test_dataset = dataset[int(len(dataset) * 0.8):]
train_loader = DataLoader(train_dataset, batch_size=32)

Custom Data Loading

from torch_geometric.data import Data, Dataset

class MyCustomDataset(Dataset):
    def __init__(self, root, transform=None):
        super().__init__(root, transform)
        # Your initialization

    def len(self):
        return len(self.data_list)

    def get(self, idx):
        # Load and return data object
        return self.data_list[idx]
← Back to torch_geometric