references/chemistry_molecular_formats.md

Chemistry and Molecular File Formats Reference

This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields.

Structure File Formats

.pdb - Protein Data Bank

Description: Standard format for 3D structures of biological macromolecules Typical Data: Atomic coordinates, residue information, secondary structure, crystal structure data Use Cases: Protein structure analysis, molecular visualization, docking studies Python Libraries: - Biopython: Bio.PDB - MDAnalysis: MDAnalysis.Universe('file.pdb') - PyMOL: pymol.cmd.load('file.pdb') - ProDy: prody.parsePDB('file.pdb') EDA Approach: - Structure validation (bond lengths, angles, clashes) - Secondary structure analysis - B-factor distribution - Missing residues/atoms detection - Ramachandran plots for validation - Surface area and volume calculations

.cif - Crystallographic Information File

Description: Structured data format for crystallographic information Typical Data: Unit cell parameters, atomic coordinates, symmetry operations, experimental data Use Cases: Crystal structure determination, structural biology, materials science Python Libraries: - gemmi: gemmi.cif.read_file('file.cif') - PyCifRW: CifFile.ReadCif('file.cif') - Biopython: Bio.PDB.MMCIFParser() EDA Approach: - Data completeness check - Resolution and quality metrics - Unit cell parameter analysis - Symmetry group validation - Atomic displacement parameters - R-factors and validation metrics

.mol - MDL Molfile

Description: Chemical structure file format by MDL/Accelrys Typical Data: 2D/3D coordinates, atom types, bond orders, charges Use Cases: Chemical database storage, cheminformatics, drug design Python Libraries: - RDKit: Chem.MolFromMolFile('file.mol') - Open Babel: pybel.readfile('mol', 'file.mol') - ChemoPy: For descriptor calculation EDA Approach: - Molecular property calculation (MW, logP, TPSA) - Functional group analysis - Ring system detection - Stereochemistry validation - 2D/3D coordinate consistency - Valence and charge validation

.mol2 - Tripos Mol2

Description: Complete 3D molecular structure format with atom typing Typical Data: Coordinates, SYBYL atom types, bond types, charges, substructures Use Cases: Molecular docking, QSAR studies, drug discovery Python Libraries: - RDKit: Chem.MolFromMol2File('file.mol2') - Open Babel: pybel.readfile('mol2', 'file.mol2') - MDAnalysis: Can parse mol2 topology EDA Approach: - Atom type distribution - Partial charge analysis - Bond type statistics - Substructure identification - Conformational analysis - Energy minimization status check

.sdf - Structure Data File

Description: Multi-structure file format with associated data Typical Data: Multiple molecular structures with properties/annotations Use Cases: Chemical databases, virtual screening, compound libraries Python Libraries: - RDKit: Chem.SDMolSupplier('file.sdf') - Open Babel: pybel.readfile('sdf', 'file.sdf') - PandasTools (RDKit): For DataFrame integration EDA Approach: - Dataset size and diversity metrics - Property distribution analysis (MW, logP, etc.) - Structural diversity (Tanimoto similarity) - Missing data assessment - Outlier detection in properties - Scaffold analysis

.xyz - XYZ Coordinates

Description: Simple Cartesian coordinate format Typical Data: Atom types and 3D coordinates Use Cases: Quantum chemistry, geometry optimization, molecular dynamics Python Libraries: - ASE: ase.io.read('file.xyz') - Open Babel: pybel.readfile('xyz', 'file.xyz') - cclib: For parsing QM outputs with xyz EDA Approach: - Geometry analysis (bond lengths, angles, dihedrals) - Center of mass calculation - Moment of inertia - Molecular size metrics - Coordinate validation - Symmetry detection

.smi / .smiles - SMILES String

Description: Line notation for chemical structures Typical Data: Text representation of molecular structure Use Cases: Chemical databases, literature mining, data exchange Python Libraries: - RDKit: Chem.MolFromSmiles(smiles) - Open Babel: Can parse SMILES - DeepChem: For ML on SMILES EDA Approach: - SMILES syntax validation - Descriptor calculation from SMILES - Fingerprint generation - Substructure searching - Tautomer enumeration - Stereoisomer handling

.pdbqt - AutoDock PDBQT

Description: Modified PDB format for AutoDock docking Typical Data: Coordinates, partial charges, atom types for docking Use Cases: Molecular docking, virtual screening Python Libraries: - Meeko: For PDBQT preparation - Open Babel: Can read PDBQT - ProDy: Limited PDBQT support EDA Approach: - Charge distribution analysis - Rotatable bond identification - Atom type validation - Coordinate quality check - Hydrogen placement validation - Torsion definition analysis

.mae - Maestro Format

Description: Schrödinger's proprietary molecular structure format Typical Data: Structures, properties, annotations from Schrödinger suite Use Cases: Drug discovery, molecular modeling with Schrödinger tools Python Libraries: - schrodinger.structure: Requires Schrödinger installation - Custom parsers for basic reading EDA Approach: - Property extraction and analysis - Structure quality metrics - Conformer analysis - Docking score distributions - Ligand efficiency metrics

.gro - GROMACS Coordinate File

Description: Molecular structure file for GROMACS MD simulations Typical Data: Atom positions, velocities, box vectors Use Cases: Molecular dynamics simulations, GROMACS workflows Python Libraries: - MDAnalysis: Universe('file.gro') - MDTraj: mdtraj.load_gro('file.gro') - GromacsWrapper: For GROMACS integration EDA Approach: - System composition analysis - Box dimension validation - Atom position distribution - Velocity distribution (if present) - Density calculation - Solvation analysis

Computational Chemistry Output Formats

.log - Gaussian Log File

Description: Output from Gaussian quantum chemistry calculations Typical Data: Energies, geometries, frequencies, orbitals, populations Use Cases: QM calculations, geometry optimization, frequency analysis Python Libraries: - cclib: cclib.io.ccread('file.log') - GaussianRunPack: For Gaussian workflows - Custom parsers with regex EDA Approach: - Convergence analysis - Energy profile extraction - Vibrational frequency analysis - Orbital energy levels - Population analysis (Mulliken, NBO) - Thermochemistry data extraction

.out - Quantum Chemistry Output

Description: Generic output file from various QM packages Typical Data: Calculation results, energies, properties Use Cases: QM calculations across different software Python Libraries: - cclib: Universal parser for QM outputs - ASE: Can read some output formats EDA Approach: - Software-specific parsing - Convergence criteria check - Energy and gradient trends - Basis set and method validation - Computational cost analysis

.wfn / .wfx - Wavefunction Files

Description: Wavefunction data for quantum chemical analysis Typical Data: Molecular orbitals, basis sets, density matrices Use Cases: Electron density analysis, QTAIM analysis Python Libraries: - Multiwfn: Interface via Python - Horton: For wavefunction analysis - Custom parsers for specific formats EDA Approach: - Orbital population analysis - Electron density distribution - Critical point analysis (QTAIM) - Molecular orbital visualization - Bonding analysis

.fchk - Gaussian Formatted Checkpoint

Description: Formatted checkpoint file from Gaussian Typical Data: Complete wavefunction data, results, geometry Use Cases: Post-processing Gaussian calculations Python Libraries: - cclib: Can parse fchk files - GaussView Python API (if available) - Custom parsers EDA Approach: - Wavefunction quality assessment - Property extraction - Basis set information - Gradient and Hessian analysis - Natural orbital analysis

.cube - Gaussian Cube File

Description: Volumetric data on a 3D grid Typical Data: Electron density, molecular orbitals, ESP on grid Use Cases: Visualization of volumetric properties Python Libraries: - cclib: cclib.io.ccread('file.cube') - ase.io: ase.io.read('file.cube') - pyquante: For cube file manipulation EDA Approach: - Grid dimension and spacing analysis - Value distribution statistics - Isosurface value determination - Integration over volume - Comparison between different cubes

Molecular Dynamics Formats

.dcd - Binary Trajectory

Description: Binary trajectory format (CHARMM, NAMD) Typical Data: Time series of atomic coordinates Use Cases: MD trajectory analysis Python Libraries: - MDAnalysis: Universe(topology, 'traj.dcd') - MDTraj: mdtraj.load_dcd('traj.dcd', top='topology.pdb') - PyTraj (Amber): Limited support EDA Approach: - RMSD/RMSF analysis - Trajectory length and frame count - Coordinate range and drift - Periodic boundary handling - File integrity check - Time step validation

.xtc - Compressed Trajectory

Description: GROMACS compressed trajectory format Typical Data: Compressed coordinates from MD simulations Use Cases: Space-efficient MD trajectory storage Python Libraries: - MDAnalysis: Universe(topology, 'traj.xtc') - MDTraj: mdtraj.load_xtc('traj.xtc', top='topology.pdb') EDA Approach: - Compression ratio assessment - Precision loss evaluation - RMSD over time - Structural stability metrics - Sampling frequency analysis

.trr - GROMACS Trajectory

Description: Full precision GROMACS trajectory Typical Data: Coordinates, velocities, forces from MD Use Cases: High-precision MD analysis Python Libraries: - MDAnalysis: Full support - MDTraj: Can read trr files - GromacsWrapper EDA Approach: - Full system dynamics analysis - Energy conservation check (with velocities) - Force analysis - Temperature and pressure validation - System equilibration assessment

.nc / .netcdf - Amber NetCDF Trajectory

Description: Network Common Data Form trajectory Typical Data: MD coordinates, velocities, forces Use Cases: Amber MD simulations, large trajectory storage Python Libraries: - MDAnalysis: NetCDF support - PyTraj: Native Amber analysis - netCDF4: Low-level access EDA Approach: - Metadata extraction - Trajectory statistics - Time series analysis - Replica exchange analysis - Multi-dimensional data extraction

.top - GROMACS Topology

Description: Molecular topology for GROMACS Typical Data: Atom types, bonds, angles, force field parameters Use Cases: MD simulation setup and analysis Python Libraries: - ParmEd: parmed.load_file('system.top') - MDAnalysis: Can parse topology - Custom parsers for specific fields EDA Approach: - Force field parameter validation - System composition - Bond/angle/dihedral distribution - Charge neutrality check - Molecule type enumeration

.psf - Protein Structure File (CHARMM)

Description: Topology file for CHARMM/NAMD Typical Data: Atom connectivity, types, charges Use Cases: CHARMM/NAMD MD simulations Python Libraries: - MDAnalysis: Native PSF support - ParmEd: Can read PSF files EDA Approach: - Topology validation - Connectivity analysis - Charge distribution - Atom type statistics - Segment analysis

.prmtop - Amber Parameter/Topology

Description: Amber topology and parameter file Typical Data: System topology, force field parameters Use Cases: Amber MD simulations Python Libraries: - ParmEd: parmed.load_file('system.prmtop') - PyTraj: Native Amber support EDA Approach: - Force field completeness - Parameter validation - System size and composition - Periodic box information - Atom mask creation for analysis

.inpcrd / .rst7 - Amber Coordinates

Description: Amber coordinate/restart file Typical Data: Atomic coordinates, velocities, box info Use Cases: Starting coordinates for Amber MD Python Libraries: - ParmEd: Works with prmtop - PyTraj: Amber coordinate reading EDA Approach: - Coordinate validity - System initialization check - Box vector validation - Velocity distribution (if restart) - Energy minimization status

Spectroscopy and Analytical Data

.jcamp / .jdx - JCAMP-DX

Description: Joint Committee on Atomic and Molecular Physical Data eXchange Typical Data: Spectroscopic data (IR, NMR, MS, UV-Vis) Use Cases: Spectroscopy data exchange and archiving Python Libraries: - jcamp: jcamp.jcamp_reader('file.jdx') - nmrglue: For NMR JCAMP files - Custom parsers for specific subtypes EDA Approach: - Peak detection and analysis - Baseline correction assessment - Signal-to-noise calculation - Spectral range validation - Integration analysis - Comparison with reference spectra

.mzML - Mass Spectrometry Markup Language

Description: Standard XML format for mass spectrometry data Typical Data: MS/MS spectra, chromatograms, metadata Use Cases: Proteomics, metabolomics, mass spectrometry workflows Python Libraries: - pymzml: pymzml.run.Reader('file.mzML') - pyteomics: pyteomics.mzml.read('file.mzML') - MSFileReader wrappers EDA Approach: - Scan count and types - MS level distribution - Retention time range - m/z range and resolution - Peak intensity distribution - Data completeness - Quality control metrics

.mzXML - Mass Spectrometry XML

Description: Open XML format for MS data Typical Data: Mass spectra, retention times, peak lists Use Cases: Legacy MS data, metabolomics Python Libraries: - pymzml: Can read mzXML - pyteomics.mzxml - lxml for direct XML parsing EDA Approach: - Similar to mzML - Version compatibility check - Conversion quality assessment - Peak picking validation

.raw - Vendor Raw Data

Description: Proprietary instrument data files (Thermo, Bruker, etc.) Typical Data: Raw instrument signals, unprocessed data Use Cases: Direct instrument data access Python Libraries: - pymsfilereader: For Thermo RAW files - ThermoRawFileParser: CLI wrapper - Vendor-specific APIs (Thermo, Bruker Compass) EDA Approach: - Instrument method extraction - Raw signal quality - Calibration status - Scan function analysis - Chromatographic quality metrics

.d - Agilent Data Directory

Description: Agilent's data folder structure Typical Data: LC-MS, GC-MS data and metadata Use Cases: Agilent instrument data processing Python Libraries: - agilent-reader: Community tools - Chemstation Python integration - Custom directory parsing EDA Approach: - Directory structure validation - Method parameter extraction - Signal file integrity - Calibration curve analysis - Sequence information extraction

.fid - NMR Free Induction Decay

Description: Raw NMR time-domain data Typical Data: Time-domain NMR signal Use Cases: NMR processing and analysis Python Libraries: - nmrglue: nmrglue.bruker.read_fid('fid') - nmrstarlib: For NMR-STAR files EDA Approach: - Signal decay analysis - Noise level assessment - Acquisition parameter validation - Apodization function selection - Zero-filling optimization - Phasing parameter estimation

.ft - NMR Frequency-Domain Data

Description: Processed NMR spectrum Typical Data: Frequency-domain NMR data Use Cases: NMR analysis and interpretation Python Libraries: - nmrglue: Comprehensive NMR support - pyNMR: For processing EDA Approach: - Peak picking and integration - Chemical shift calibration - Multiplicity analysis - Coupling constant extraction - Spectral quality metrics - Reference compound identification

.spc - Spectroscopy File

Description: Thermo Galactic spectroscopy format Typical Data: IR, Raman, UV-Vis spectra Use Cases: Spectroscopic data from various instruments Python Libraries: - spc: spc.File('file.spc') - Custom parsers for binary format EDA Approach: - Spectral resolution - Wavelength/wavenumber range - Baseline characterization - Peak identification - Derivative spectra calculation

Chemical Database Formats

.inchi - International Chemical Identifier

Description: Text identifier for chemical substances Typical Data: Layered chemical structure representation Use Cases: Chemical database keys, structure searching Python Libraries: - RDKit: Chem.MolFromInchi(inchi) - Open Babel: InChI conversion EDA Approach: - InChI validation - Layer analysis - Stereochemistry verification - InChI key generation - Structure round-trip validation

.cdx / .cdxml - ChemDraw Exchange

Description: ChemDraw drawing file format Typical Data: 2D chemical structures with annotations Use Cases: Chemical drawing, publication figures Python Libraries: - RDKit: Can import some CDXML - Open Babel: Limited support - ChemDraw Python API (commercial) EDA Approach: - Structure extraction - Annotation preservation - Style consistency - 2D coordinate validation

.cml - Chemical Markup Language

Description: XML-based chemical structure format Typical Data: Chemical structures, reactions, properties Use Cases: Semantic chemical data representation Python Libraries: - RDKit: CML support - Open Babel: Good CML support - lxml: For XML parsing EDA Approach: - XML schema validation - Namespace handling - Property extraction - Reaction scheme analysis - Metadata completeness

.rxn - MDL Reaction File

Description: Chemical reaction structure file Typical Data: Reactants, products, reaction arrows Use Cases: Reaction databases, synthesis planning Python Libraries: - RDKit: Chem.ReactionFromRxnFile('file.rxn') - Open Babel: Reaction support EDA Approach: - Reaction balancing validation - Atom mapping analysis - Reagent identification - Stereochemistry changes - Reaction classification

.rdf - Reaction Data File

Description: Multi-reaction file format Typical Data: Multiple reactions with data Use Cases: Reaction databases Python Libraries: - RDKit: RDF reading capabilities - Custom parsers EDA Approach: - Reaction yield statistics - Condition analysis - Success rate patterns - Reagent frequency analysis

Computational Output and Data

.hdf5 / .h5 - Hierarchical Data Format

Description: Container for scientific data arrays Typical Data: Large arrays, metadata, hierarchical organization Use Cases: Large dataset storage, computational results Python Libraries: - h5py: h5py.File('file.h5', 'r') - pytables: Advanced HDF5 interface - pandas: Can read HDF5 EDA Approach: - Dataset structure exploration - Array shape and dtype analysis - Metadata extraction - Memory-efficient data sampling - Chunk optimization analysis - Compression ratio assessment

.pkl / .pickle - Python Pickle

Description: Serialized Python objects Typical Data: Any Python object (molecules, dataframes, models) Use Cases: Intermediate data storage, model persistence Python Libraries: - pickle: Built-in serialization - joblib: Enhanced pickling for large arrays - dill: Extended pickle support EDA Approach: - Object type inspection - Size and complexity analysis - Version compatibility check - Security validation (trusted source) - Deserialization testing

.npy / .npz - NumPy Arrays

Description: NumPy array binary format Typical Data: Numerical arrays (coordinates, features, matrices) Use Cases: Fast numerical data I/O Python Libraries: - numpy: np.load('file.npy') - Direct memory mapping for large files EDA Approach: - Array shape and dimensions - Data type and precision - Statistical summary (mean, std, range) - Missing value detection - Outlier identification - Memory footprint analysis

.mat - MATLAB Data File

Description: MATLAB workspace data Typical Data: Arrays, structures from MATLAB Use Cases: MATLAB-Python data exchange Python Libraries: - scipy.io: scipy.io.loadmat('file.mat') - h5py: For v7.3 MAT files EDA Approach: - Variable extraction and types - Array dimension analysis - Structure field exploration - MATLAB version compatibility - Data type conversion validation

.csv - Comma-Separated Values

Description: Tabular data in text format Typical Data: Chemical properties, experimental data, descriptors Use Cases: Data exchange, analysis, machine learning Python Libraries: - pandas: pd.read_csv('file.csv') - csv: Built-in module - polars: Fast CSV reading EDA Approach: - Data types inference - Missing value patterns - Statistical summaries - Correlation analysis - Distribution visualization - Outlier detection

.json - JavaScript Object Notation

Description: Structured text data format Typical Data: Chemical properties, metadata, API responses Use Cases: Data interchange, configuration, web APIs Python Libraries: - json: Built-in JSON support - pandas: pd.read_json() - ujson: Faster JSON parsing EDA Approach: - Schema validation - Nesting depth analysis - Key-value distribution - Data type consistency - Array length statistics

.parquet - Apache Parquet

Description: Columnar storage format Typical Data: Large tabular datasets efficiently Use Cases: Big data, efficient columnar analytics Python Libraries: - pandas: pd.read_parquet('file.parquet') - pyarrow: Direct parquet access - fastparquet: Alternative implementation EDA Approach: - Column statistics from metadata - Partition analysis - Compression efficiency - Row group structure - Fast sampling for large files - Schema evolution tracking

← Back to exploratory-data-analysis