references/reactions_data.md

Datamol Reactions and Data Modules Reference

Reactions Module (datamol.reactions)

The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.

Applying Chemical Reactions

dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)

Apply a chemical reaction to reactant molecules. - Parameters: - rxn: Reaction object (from SMARTS pattern) - reactants: Tuple of reactant molecules - as_smiles: Return SMILES strings (True) or molecule objects (False) - sanitize: Sanitize product molecules - single_product_group: Return single product (True) or all product groups (False) - rm_attach: Remove attachment point markers - product_index: Which product to return from reaction - Returns: Product molecule(s) or SMILES - Example: ```python from rdkit import Chem

# Define reaction: alcohol + carboxylic acid → ester rxn = Chem.rdChemReactions.ReactionFromSmarts( '[C:1][OH:2].C:3[OH:5]>>[C:1][O:2]C:3' )

# Apply to reactants alcohol = dm.to_mol("CCO") acid = dm.to_mol("CC(=O)O") product = dm.reactions.apply_reaction(rxn, (alcohol, acid)) ```

Creating Reactions

Reactions are typically created from SMARTS patterns using RDKit:

from rdkit.Chem import rdChemReactions

# Reaction pattern: [reactant1].[reactant2]>>[product]
rxn = rdChemReactions.ReactionFromSmarts(
    '[1*][*:1].[1*][*:2]>>[*:1][*:2]'
)

Validation Functions

The module includes functions to: - Check if molecule is reactant: Verify if molecule matches reactant pattern - Validate reaction: Check if reaction is synthetically reasonable - Process reaction files: Load reactions from files or databases

Common Reaction Patterns

Amide formation:

# Amine + carboxylic acid → amide
amide_rxn = rdChemReactions.ReactionFromSmarts(
    '[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
)

Suzuki coupling:

# Aryl halide + boronic acid → biaryl
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
    '[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
)

Functional group transformations:

# Alcohol → ester
esterification = rdChemReactions.ReactionFromSmarts(
    '[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
)

Workflow Example

import datamol as dm
from rdkit.Chem import rdChemReactions

# 1. Define reaction
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'  # Acid → acid chloride
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)

# 2. Apply to molecule library
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
acid_chlorides = []

for acid in acids:
    try:
        product = dm.reactions.apply_reaction(
            rxn,
            (acid,),  # Single reactant as tuple
            sanitize=True
        )
        acid_chlorides.append(product)
    except Exception as e:
        print(f"Reaction failed: {e}")

# 3. Validate products
valid_products = [p for p in acid_chlorides if p is not None]

Key Concepts

  • SMARTS: SMiles ARbitrary Target Specification - pattern language for reactions
  • Atom Mapping: Numbers like [C:1] preserve atom identity through reaction
  • Attachment Points: [1*] represents generic connection points
  • Reaction Validation: Not all SMARTS reactions are chemically reasonable

Data Module (datamol.data)

The data module provides convenient access to curated molecular datasets for testing and learning.

Available Datasets

dm.data.cdk2(as_df=True, mol_column='mol')

RDKit CDK2 dataset - kinase inhibitor data. - Parameters: - as_df: Return as DataFrame (True) or list of molecules (False) - mol_column: Name for molecule column - Returns: Dataset with molecular structures and activity data - Use case: Small dataset for algorithm testing - Example: python cdk2_df = dm.data.cdk2(as_df=True) print(cdk2_df.shape) print(cdk2_df.columns)

dm.data.freesolv()

FreeSolv dataset - experimental and calculated hydration free energies. - Contents: 642 molecules with: - IUPAC names - SMILES strings - Experimental hydration free energy values - Calculated values - Warning: "Only meant to be used as a toy dataset for pedagogic and testing purposes" - Not suitable for: Benchmarking or production model training - Example: python freesolv_df = dm.data.freesolv() # Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)

dm.data.solubility(as_df=True, mol_column='mol')

RDKit solubility dataset with train/test splits. - Contents: Aqueous solubility data with pre-defined splits - Columns: Includes 'split' column with 'train' or 'test' values - Use case: Testing ML workflows with proper train/test separation - Example: ```python sol_df = dm.data.solubility(as_df=True)

# Split into train/test train_df = sol_df[sol_df['split'] == 'train'] test_df = sol_df[sol_df['split'] == 'test']

# Use for model development X_train = dm.to_fp(train_df[mol_column]) y_train = train_df['solubility'] ```

Usage Guidelines

For testing and tutorials:

# Quick dataset for testing code
df = dm.data.cdk2()
mols = df['mol'].tolist()

# Test descriptor calculation
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)

# Test clustering
clusters = dm.cluster_mols(mols, cutoff=0.3)

For learning workflows:

# Complete ML pipeline example
sol_df = dm.data.solubility()

# Preprocessing
train = sol_df[sol_df['split'] == 'train']
test = sol_df[sol_df['split'] == 'test']

# Featurization
X_train = dm.to_fp(train['mol'])
X_test = dm.to_fp(test['mol'])

# Model training (example)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, train['solubility'])
predictions = model.predict(X_test)

Important Notes

  • Toy Datasets: Designed for pedagogical purposes, not production use
  • Small Size: Limited number of compounds suitable for quick tests
  • Pre-processed: Data already cleaned and formatted
  • Citations: Check dataset documentation for proper attribution if publishing

Best Practices

  1. Use for development only: Don't draw scientific conclusions from toy datasets
  2. Validate on real data: Always test production code on actual project data
  3. Proper attribution: Cite original data sources if using in publications
  4. Understand limitations: Know the scope and quality of each dataset
← Back to datamol