Molfeat API Reference
Core Modules
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
molfeat.store- Manages model loading, listing, and registrationmolfeat.calc- Provides calculators for single-molecule featurizationmolfeat.trans- Offers scikit-learn compatible transformers for batch processingmolfeat.utils- Utility functions for data handlingmolfeat.viz- Visualization tools for molecular features
molfeat.calc - Calculators
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit Chem.Mol objects or SMILES strings as input.
SerializableCalculator (Base Class)
Base abstract class for all calculators. When subclassing, must implement:
- __call__() - Required method for featurization
- __len__() - Optional, returns output length
- columns - Optional property, returns feature names
- batch_compute() - Optional, for efficient batch processing
State Management Methods:
- to_state_json() - Save calculator state as JSON
- to_state_yaml() - Save calculator state as YAML
- from_state_dict() - Load calculator from state dictionary
- to_state_dict() - Export calculator state as dictionary
FPCalculator
Computes molecular fingerprints. Supports 15+ fingerprint methods.
Supported Fingerprint Types:
Structural Fingerprints:
- ecfp - Extended-connectivity fingerprints (circular)
- fcfp - Functional-class fingerprints
- rdkit - RDKit topological fingerprints
- maccs - MACCS keys (166-bit structural keys)
- avalon - Avalon fingerprints
- pattern - Pattern fingerprints
- layered - Layered fingerprints
Atom-based Fingerprints:
- atompair - Atom pair fingerprints
- atompair-count - Counted atom pairs
- topological - Topological torsion fingerprints
- topological-count - Counted topological torsions
Specialized Fingerprints:
- map4 - MinHashed atom-pair fingerprint up to 4 bonds
- secfp - SMILES extended connectivity fingerprint
- erg - Extended reduced graphs
- estate - Electrotopological state indices
Parameters:
- method (str) - Fingerprint type name
- radius (int) - Radius for circular fingerprints (default: 3)
- fpSize (int) - Fingerprint size (default: 2048)
- includeChirality (bool) - Include chirality information
- counting (bool) - Use count vectors instead of binary
Usage:
from molfeat.calc import FPCalculator
# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
# Compute fingerprint for single molecule
fp = calc("CCO") # Returns numpy array
# Get fingerprint length
length = len(calc) # 2048
# Get feature names
names = calc.columns
Common Fingerprint Dimensions: - MACCS: 167 dimensions - ECFP (default): 2048 dimensions - MAP4 (default): 1024 dimensions
Descriptor Calculators
RDKitDescriptors2D Computes 2D molecular descriptors using RDKit.
from molfeat.calc import RDKitDescriptors2D
calc = RDKitDescriptors2D()
descriptors = calc("CCO") # Returns 200+ descriptors
RDKitDescriptors3D Computes 3D molecular descriptors (requires conformer generation).
MordredDescriptors Calculates over 1800 molecular descriptors using Mordred.
from molfeat.calc import MordredDescriptors
calc = MordredDescriptors()
descriptors = calc("CCO")
Pharmacophore Calculators
Pharmacophore2D RDKit's 2D pharmacophore fingerprint generation.
Pharmacophore3D Consensus pharmacophore fingerprints from multiple conformers.
CATSCalculator Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
Parameters:
- mode - "2D" or "3D" distance calculations
- dist_bins - Distance bins for pair distributions
- scale - Scaling mode: "raw", "num", or "count"
from molfeat.calc import CATSCalculator
calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO") # Returns 21 descriptors by default
Shape Descriptors
USRDescriptors Ultrafast shape recognition descriptors (multiple variants).
ElectroShapeDescriptors Electrostatic shape descriptors combining shape, chirality, and electrostatics.
Graph-Based Calculators
ScaffoldKeyCalculator Computes 40+ scaffold-based molecular properties.
AtomCalculator Atom-level featurization for graph neural networks.
BondCalculator Bond-level featurization for graph neural networks.
Utility Function
get_calculator() Factory function to instantiate calculators by name.
from molfeat.calc import get_calculator
# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")
Raises ValueError for unsupported featurizers.
molfeat.trans - Transformers
Transformers wrap calculators into complete featurization pipelines for batch processing.
MoleculeTransformer
Scikit-learn compatible transformer for batch molecular featurization.
Key Parameters:
- featurizer - Calculator or featurizer to use
- n_jobs (int) - Number of parallel jobs (-1 for all cores)
- dtype - Output data type (numpy float32/64, torch tensors)
- verbose (bool) - Enable verbose logging
- ignore_errors (bool) - Continue on failures (returns None for failed molecules)
Essential Methods:
- transform(mols) - Processes batches and returns representations
- _transform(mol) - Handles individual molecule featurization
- __call__(mols) - Convenience wrapper around transform()
- preprocess(mol) - Prepares input molecules (not automatically applied)
- to_state_yaml_file(path) - Save transformer configuration
- from_state_yaml_file(path) - Load transformer configuration
Usage:
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm
# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch
features = transformer(smiles) # Returns numpy array (100, 2048)
# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")
# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
Performance: Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
FeatConcat
Concatenates multiple featurizers into unified representations.
from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator
# Combine multiple fingerprints
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
])
# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)
PretrainedMolTransformer
Subclass of MoleculeTransformer for pre-trained deep learning models.
Unique Features:
- _embed() - Batched inference for neural networks
- _convert() - Transforms SMILES/molecules into model-compatible formats
- SELFIES strings for language models
- DGL graphs for graph neural networks
- Integrated caching system for efficient storage
Usage:
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
embeddings = transformer(smiles)
PrecomputedMolTransformer
Transformer for cached/precomputed features.
molfeat.store - Model Store
Manages featurizer discovery, loading, and registration.
ModelStore
Central hub for accessing available featurizers.
Key Methods:
- available_models - Property listing all available featurizers
- search(name=None, **kwargs) - Search for specific featurizers
- load(name, **kwargs) - Load a featurizer by name
- register(name, card) - Register custom featurizer
Usage:
from molfeat.store.modelstore import ModelStore
# Initialize store
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")
# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
model_card = results[0]
# View usage information
model_card.usage()
# Load the model
transformer = model_card.load()
# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")
ModelCard Attributes:
- name - Model identifier
- description - Model description
- version - Model version
- authors - Model authors
- tags - Categorization tags
- usage() - Display usage examples
- load(**kwargs) - Load the model
Common Patterns
Error Handling
# Enable error tolerance
featurizer = MoleculeTransformer(
calc,
n_jobs=-1,
verbose=True,
ignore_errors=True
)
# Failed molecules return None
features = featurizer(smiles_with_errors)
Data Type Control
# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)
# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)
Persistence and Reproducibility
# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")
Preprocessing
# Manual preprocessing
mol = transformer.preprocess("CCO")
# Transform with preprocessing
features = transformer.transform(smiles_list)
Integration Examples
Scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
PyTorch Integration
import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.smiles = smiles
self.labels = labels
self.transformer = transformer
def __len__(self):
return len(self.smiles)
def __getitem__(self, idx):
features = self.transformer(self.smiles[idx])
return torch.tensor(features), torch.tensor(self.labels[idx])
# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)
Performance Tips
- Parallelization: Use
n_jobs=-1to utilize all CPU cores - Batch Processing: Process multiple molecules at once instead of loops
- Caching: Leverage built-in caching for pretrained models
- Data Types: Use float32 instead of float64 when precision allows
- Error Handling: Set
ignore_errors=Truefor large datasets with potential invalid molecules