Chemistry and Molecular File Formats Reference
This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields.
Structure File Formats
.pdb - Protein Data Bank
Description: Standard format for 3D structures of biological macromolecules
Typical Data: Atomic coordinates, residue information, secondary structure, crystal structure data
Use Cases: Protein structure analysis, molecular visualization, docking studies
Python Libraries:
- Biopython: Bio.PDB
- MDAnalysis: MDAnalysis.Universe('file.pdb')
- PyMOL: pymol.cmd.load('file.pdb')
- ProDy: prody.parsePDB('file.pdb')
EDA Approach:
- Structure validation (bond lengths, angles, clashes)
- Secondary structure analysis
- B-factor distribution
- Missing residues/atoms detection
- Ramachandran plots for validation
- Surface area and volume calculations
.cif - Crystallographic Information File
Description: Structured data format for crystallographic information
Typical Data: Unit cell parameters, atomic coordinates, symmetry operations, experimental data
Use Cases: Crystal structure determination, structural biology, materials science
Python Libraries:
- gemmi: gemmi.cif.read_file('file.cif')
- PyCifRW: CifFile.ReadCif('file.cif')
- Biopython: Bio.PDB.MMCIFParser()
EDA Approach:
- Data completeness check
- Resolution and quality metrics
- Unit cell parameter analysis
- Symmetry group validation
- Atomic displacement parameters
- R-factors and validation metrics
.mol - MDL Molfile
Description: Chemical structure file format by MDL/Accelrys
Typical Data: 2D/3D coordinates, atom types, bond orders, charges
Use Cases: Chemical database storage, cheminformatics, drug design
Python Libraries:
- RDKit: Chem.MolFromMolFile('file.mol')
- Open Babel: pybel.readfile('mol', 'file.mol')
- ChemoPy: For descriptor calculation
EDA Approach:
- Molecular property calculation (MW, logP, TPSA)
- Functional group analysis
- Ring system detection
- Stereochemistry validation
- 2D/3D coordinate consistency
- Valence and charge validation
.mol2 - Tripos Mol2
Description: Complete 3D molecular structure format with atom typing
Typical Data: Coordinates, SYBYL atom types, bond types, charges, substructures
Use Cases: Molecular docking, QSAR studies, drug discovery
Python Libraries:
- RDKit: Chem.MolFromMol2File('file.mol2')
- Open Babel: pybel.readfile('mol2', 'file.mol2')
- MDAnalysis: Can parse mol2 topology
EDA Approach:
- Atom type distribution
- Partial charge analysis
- Bond type statistics
- Substructure identification
- Conformational analysis
- Energy minimization status check
.sdf - Structure Data File
Description: Multi-structure file format with associated data
Typical Data: Multiple molecular structures with properties/annotations
Use Cases: Chemical databases, virtual screening, compound libraries
Python Libraries:
- RDKit: Chem.SDMolSupplier('file.sdf')
- Open Babel: pybel.readfile('sdf', 'file.sdf')
- PandasTools (RDKit): For DataFrame integration
EDA Approach:
- Dataset size and diversity metrics
- Property distribution analysis (MW, logP, etc.)
- Structural diversity (Tanimoto similarity)
- Missing data assessment
- Outlier detection in properties
- Scaffold analysis
.xyz - XYZ Coordinates
Description: Simple Cartesian coordinate format
Typical Data: Atom types and 3D coordinates
Use Cases: Quantum chemistry, geometry optimization, molecular dynamics
Python Libraries:
- ASE: ase.io.read('file.xyz')
- Open Babel: pybel.readfile('xyz', 'file.xyz')
- cclib: For parsing QM outputs with xyz
EDA Approach:
- Geometry analysis (bond lengths, angles, dihedrals)
- Center of mass calculation
- Moment of inertia
- Molecular size metrics
- Coordinate validation
- Symmetry detection
.smi / .smiles - SMILES String
Description: Line notation for chemical structures
Typical Data: Text representation of molecular structure
Use Cases: Chemical databases, literature mining, data exchange
Python Libraries:
- RDKit: Chem.MolFromSmiles(smiles)
- Open Babel: Can parse SMILES
- DeepChem: For ML on SMILES
EDA Approach:
- SMILES syntax validation
- Descriptor calculation from SMILES
- Fingerprint generation
- Substructure searching
- Tautomer enumeration
- Stereoisomer handling
.pdbqt - AutoDock PDBQT
Description: Modified PDB format for AutoDock docking
Typical Data: Coordinates, partial charges, atom types for docking
Use Cases: Molecular docking, virtual screening
Python Libraries:
- Meeko: For PDBQT preparation
- Open Babel: Can read PDBQT
- ProDy: Limited PDBQT support
EDA Approach:
- Charge distribution analysis
- Rotatable bond identification
- Atom type validation
- Coordinate quality check
- Hydrogen placement validation
- Torsion definition analysis
.mae - Maestro Format
Description: Schrödinger's proprietary molecular structure format
Typical Data: Structures, properties, annotations from Schrödinger suite
Use Cases: Drug discovery, molecular modeling with Schrödinger tools
Python Libraries:
- schrodinger.structure: Requires Schrödinger installation
- Custom parsers for basic reading
EDA Approach:
- Property extraction and analysis
- Structure quality metrics
- Conformer analysis
- Docking score distributions
- Ligand efficiency metrics
.gro - GROMACS Coordinate File
Description: Molecular structure file for GROMACS MD simulations
Typical Data: Atom positions, velocities, box vectors
Use Cases: Molecular dynamics simulations, GROMACS workflows
Python Libraries:
- MDAnalysis: Universe('file.gro')
- MDTraj: mdtraj.load_gro('file.gro')
- GromacsWrapper: For GROMACS integration
EDA Approach:
- System composition analysis
- Box dimension validation
- Atom position distribution
- Velocity distribution (if present)
- Density calculation
- Solvation analysis
Computational Chemistry Output Formats
.log - Gaussian Log File
Description: Output from Gaussian quantum chemistry calculations
Typical Data: Energies, geometries, frequencies, orbitals, populations
Use Cases: QM calculations, geometry optimization, frequency analysis
Python Libraries:
- cclib: cclib.io.ccread('file.log')
- GaussianRunPack: For Gaussian workflows
- Custom parsers with regex
EDA Approach:
- Convergence analysis
- Energy profile extraction
- Vibrational frequency analysis
- Orbital energy levels
- Population analysis (Mulliken, NBO)
- Thermochemistry data extraction
.out - Quantum Chemistry Output
Description: Generic output file from various QM packages
Typical Data: Calculation results, energies, properties
Use Cases: QM calculations across different software
Python Libraries:
- cclib: Universal parser for QM outputs
- ASE: Can read some output formats
EDA Approach:
- Software-specific parsing
- Convergence criteria check
- Energy and gradient trends
- Basis set and method validation
- Computational cost analysis
.wfn / .wfx - Wavefunction Files
Description: Wavefunction data for quantum chemical analysis
Typical Data: Molecular orbitals, basis sets, density matrices
Use Cases: Electron density analysis, QTAIM analysis
Python Libraries:
- Multiwfn: Interface via Python
- Horton: For wavefunction analysis
- Custom parsers for specific formats
EDA Approach:
- Orbital population analysis
- Electron density distribution
- Critical point analysis (QTAIM)
- Molecular orbital visualization
- Bonding analysis
.fchk - Gaussian Formatted Checkpoint
Description: Formatted checkpoint file from Gaussian
Typical Data: Complete wavefunction data, results, geometry
Use Cases: Post-processing Gaussian calculations
Python Libraries:
- cclib: Can parse fchk files
- GaussView Python API (if available)
- Custom parsers
EDA Approach:
- Wavefunction quality assessment
- Property extraction
- Basis set information
- Gradient and Hessian analysis
- Natural orbital analysis
.cube - Gaussian Cube File
Description: Volumetric data on a 3D grid
Typical Data: Electron density, molecular orbitals, ESP on grid
Use Cases: Visualization of volumetric properties
Python Libraries:
- cclib: cclib.io.ccread('file.cube')
- ase.io: ase.io.read('file.cube')
- pyquante: For cube file manipulation
EDA Approach:
- Grid dimension and spacing analysis
- Value distribution statistics
- Isosurface value determination
- Integration over volume
- Comparison between different cubes
Molecular Dynamics Formats
.dcd - Binary Trajectory
Description: Binary trajectory format (CHARMM, NAMD)
Typical Data: Time series of atomic coordinates
Use Cases: MD trajectory analysis
Python Libraries:
- MDAnalysis: Universe(topology, 'traj.dcd')
- MDTraj: mdtraj.load_dcd('traj.dcd', top='topology.pdb')
- PyTraj (Amber): Limited support
EDA Approach:
- RMSD/RMSF analysis
- Trajectory length and frame count
- Coordinate range and drift
- Periodic boundary handling
- File integrity check
- Time step validation
.xtc - Compressed Trajectory
Description: GROMACS compressed trajectory format
Typical Data: Compressed coordinates from MD simulations
Use Cases: Space-efficient MD trajectory storage
Python Libraries:
- MDAnalysis: Universe(topology, 'traj.xtc')
- MDTraj: mdtraj.load_xtc('traj.xtc', top='topology.pdb')
EDA Approach:
- Compression ratio assessment
- Precision loss evaluation
- RMSD over time
- Structural stability metrics
- Sampling frequency analysis
.trr - GROMACS Trajectory
Description: Full precision GROMACS trajectory
Typical Data: Coordinates, velocities, forces from MD
Use Cases: High-precision MD analysis
Python Libraries:
- MDAnalysis: Full support
- MDTraj: Can read trr files
- GromacsWrapper
EDA Approach:
- Full system dynamics analysis
- Energy conservation check (with velocities)
- Force analysis
- Temperature and pressure validation
- System equilibration assessment
.nc / .netcdf - Amber NetCDF Trajectory
Description: Network Common Data Form trajectory
Typical Data: MD coordinates, velocities, forces
Use Cases: Amber MD simulations, large trajectory storage
Python Libraries:
- MDAnalysis: NetCDF support
- PyTraj: Native Amber analysis
- netCDF4: Low-level access
EDA Approach:
- Metadata extraction
- Trajectory statistics
- Time series analysis
- Replica exchange analysis
- Multi-dimensional data extraction
.top - GROMACS Topology
Description: Molecular topology for GROMACS
Typical Data: Atom types, bonds, angles, force field parameters
Use Cases: MD simulation setup and analysis
Python Libraries:
- ParmEd: parmed.load_file('system.top')
- MDAnalysis: Can parse topology
- Custom parsers for specific fields
EDA Approach:
- Force field parameter validation
- System composition
- Bond/angle/dihedral distribution
- Charge neutrality check
- Molecule type enumeration
.psf - Protein Structure File (CHARMM)
Description: Topology file for CHARMM/NAMD
Typical Data: Atom connectivity, types, charges
Use Cases: CHARMM/NAMD MD simulations
Python Libraries:
- MDAnalysis: Native PSF support
- ParmEd: Can read PSF files
EDA Approach:
- Topology validation
- Connectivity analysis
- Charge distribution
- Atom type statistics
- Segment analysis
.prmtop - Amber Parameter/Topology
Description: Amber topology and parameter file
Typical Data: System topology, force field parameters
Use Cases: Amber MD simulations
Python Libraries:
- ParmEd: parmed.load_file('system.prmtop')
- PyTraj: Native Amber support
EDA Approach:
- Force field completeness
- Parameter validation
- System size and composition
- Periodic box information
- Atom mask creation for analysis
.inpcrd / .rst7 - Amber Coordinates
Description: Amber coordinate/restart file
Typical Data: Atomic coordinates, velocities, box info
Use Cases: Starting coordinates for Amber MD
Python Libraries:
- ParmEd: Works with prmtop
- PyTraj: Amber coordinate reading
EDA Approach:
- Coordinate validity
- System initialization check
- Box vector validation
- Velocity distribution (if restart)
- Energy minimization status
Spectroscopy and Analytical Data
.jcamp / .jdx - JCAMP-DX
Description: Joint Committee on Atomic and Molecular Physical Data eXchange
Typical Data: Spectroscopic data (IR, NMR, MS, UV-Vis)
Use Cases: Spectroscopy data exchange and archiving
Python Libraries:
- jcamp: jcamp.jcamp_reader('file.jdx')
- nmrglue: For NMR JCAMP files
- Custom parsers for specific subtypes
EDA Approach:
- Peak detection and analysis
- Baseline correction assessment
- Signal-to-noise calculation
- Spectral range validation
- Integration analysis
- Comparison with reference spectra
.mzML - Mass Spectrometry Markup Language
Description: Standard XML format for mass spectrometry data
Typical Data: MS/MS spectra, chromatograms, metadata
Use Cases: Proteomics, metabolomics, mass spectrometry workflows
Python Libraries:
- pymzml: pymzml.run.Reader('file.mzML')
- pyteomics: pyteomics.mzml.read('file.mzML')
- MSFileReader wrappers
EDA Approach:
- Scan count and types
- MS level distribution
- Retention time range
- m/z range and resolution
- Peak intensity distribution
- Data completeness
- Quality control metrics
.mzXML - Mass Spectrometry XML
Description: Open XML format for MS data
Typical Data: Mass spectra, retention times, peak lists
Use Cases: Legacy MS data, metabolomics
Python Libraries:
- pymzml: Can read mzXML
- pyteomics.mzxml
- lxml for direct XML parsing
EDA Approach:
- Similar to mzML
- Version compatibility check
- Conversion quality assessment
- Peak picking validation
.raw - Vendor Raw Data
Description: Proprietary instrument data files (Thermo, Bruker, etc.)
Typical Data: Raw instrument signals, unprocessed data
Use Cases: Direct instrument data access
Python Libraries:
- pymsfilereader: For Thermo RAW files
- ThermoRawFileParser: CLI wrapper
- Vendor-specific APIs (Thermo, Bruker Compass)
EDA Approach:
- Instrument method extraction
- Raw signal quality
- Calibration status
- Scan function analysis
- Chromatographic quality metrics
.d - Agilent Data Directory
Description: Agilent's data folder structure
Typical Data: LC-MS, GC-MS data and metadata
Use Cases: Agilent instrument data processing
Python Libraries:
- agilent-reader: Community tools
- Chemstation Python integration
- Custom directory parsing
EDA Approach:
- Directory structure validation
- Method parameter extraction
- Signal file integrity
- Calibration curve analysis
- Sequence information extraction
.fid - NMR Free Induction Decay
Description: Raw NMR time-domain data
Typical Data: Time-domain NMR signal
Use Cases: NMR processing and analysis
Python Libraries:
- nmrglue: nmrglue.bruker.read_fid('fid')
- nmrstarlib: For NMR-STAR files
EDA Approach:
- Signal decay analysis
- Noise level assessment
- Acquisition parameter validation
- Apodization function selection
- Zero-filling optimization
- Phasing parameter estimation
.ft - NMR Frequency-Domain Data
Description: Processed NMR spectrum
Typical Data: Frequency-domain NMR data
Use Cases: NMR analysis and interpretation
Python Libraries:
- nmrglue: Comprehensive NMR support
- pyNMR: For processing
EDA Approach:
- Peak picking and integration
- Chemical shift calibration
- Multiplicity analysis
- Coupling constant extraction
- Spectral quality metrics
- Reference compound identification
.spc - Spectroscopy File
Description: Thermo Galactic spectroscopy format
Typical Data: IR, Raman, UV-Vis spectra
Use Cases: Spectroscopic data from various instruments
Python Libraries:
- spc: spc.File('file.spc')
- Custom parsers for binary format
EDA Approach:
- Spectral resolution
- Wavelength/wavenumber range
- Baseline characterization
- Peak identification
- Derivative spectra calculation
Chemical Database Formats
.inchi - International Chemical Identifier
Description: Text identifier for chemical substances
Typical Data: Layered chemical structure representation
Use Cases: Chemical database keys, structure searching
Python Libraries:
- RDKit: Chem.MolFromInchi(inchi)
- Open Babel: InChI conversion
EDA Approach:
- InChI validation
- Layer analysis
- Stereochemistry verification
- InChI key generation
- Structure round-trip validation
.cdx / .cdxml - ChemDraw Exchange
Description: ChemDraw drawing file format
Typical Data: 2D chemical structures with annotations
Use Cases: Chemical drawing, publication figures
Python Libraries:
- RDKit: Can import some CDXML
- Open Babel: Limited support
- ChemDraw Python API (commercial)
EDA Approach:
- Structure extraction
- Annotation preservation
- Style consistency
- 2D coordinate validation
.cml - Chemical Markup Language
Description: XML-based chemical structure format
Typical Data: Chemical structures, reactions, properties
Use Cases: Semantic chemical data representation
Python Libraries:
- RDKit: CML support
- Open Babel: Good CML support
- lxml: For XML parsing
EDA Approach:
- XML schema validation
- Namespace handling
- Property extraction
- Reaction scheme analysis
- Metadata completeness
.rxn - MDL Reaction File
Description: Chemical reaction structure file
Typical Data: Reactants, products, reaction arrows
Use Cases: Reaction databases, synthesis planning
Python Libraries:
- RDKit: Chem.ReactionFromRxnFile('file.rxn')
- Open Babel: Reaction support
EDA Approach:
- Reaction balancing validation
- Atom mapping analysis
- Reagent identification
- Stereochemistry changes
- Reaction classification
.rdf - Reaction Data File
Description: Multi-reaction file format
Typical Data: Multiple reactions with data
Use Cases: Reaction databases
Python Libraries:
- RDKit: RDF reading capabilities
- Custom parsers
EDA Approach:
- Reaction yield statistics
- Condition analysis
- Success rate patterns
- Reagent frequency analysis
Computational Output and Data
.hdf5 / .h5 - Hierarchical Data Format
Description: Container for scientific data arrays
Typical Data: Large arrays, metadata, hierarchical organization
Use Cases: Large dataset storage, computational results
Python Libraries:
- h5py: h5py.File('file.h5', 'r')
- pytables: Advanced HDF5 interface
- pandas: Can read HDF5
EDA Approach:
- Dataset structure exploration
- Array shape and dtype analysis
- Metadata extraction
- Memory-efficient data sampling
- Chunk optimization analysis
- Compression ratio assessment
.pkl / .pickle - Python Pickle
Description: Serialized Python objects
Typical Data: Any Python object (molecules, dataframes, models)
Use Cases: Intermediate data storage, model persistence
Python Libraries:
- pickle: Built-in serialization
- joblib: Enhanced pickling for large arrays
- dill: Extended pickle support
EDA Approach:
- Object type inspection
- Size and complexity analysis
- Version compatibility check
- Security validation (trusted source)
- Deserialization testing
.npy / .npz - NumPy Arrays
Description: NumPy array binary format
Typical Data: Numerical arrays (coordinates, features, matrices)
Use Cases: Fast numerical data I/O
Python Libraries:
- numpy: np.load('file.npy')
- Direct memory mapping for large files
EDA Approach:
- Array shape and dimensions
- Data type and precision
- Statistical summary (mean, std, range)
- Missing value detection
- Outlier identification
- Memory footprint analysis
.mat - MATLAB Data File
Description: MATLAB workspace data
Typical Data: Arrays, structures from MATLAB
Use Cases: MATLAB-Python data exchange
Python Libraries:
- scipy.io: scipy.io.loadmat('file.mat')
- h5py: For v7.3 MAT files
EDA Approach:
- Variable extraction and types
- Array dimension analysis
- Structure field exploration
- MATLAB version compatibility
- Data type conversion validation
.csv - Comma-Separated Values
Description: Tabular data in text format
Typical Data: Chemical properties, experimental data, descriptors
Use Cases: Data exchange, analysis, machine learning
Python Libraries:
- pandas: pd.read_csv('file.csv')
- csv: Built-in module
- polars: Fast CSV reading
EDA Approach:
- Data types inference
- Missing value patterns
- Statistical summaries
- Correlation analysis
- Distribution visualization
- Outlier detection
.json - JavaScript Object Notation
Description: Structured text data format
Typical Data: Chemical properties, metadata, API responses
Use Cases: Data interchange, configuration, web APIs
Python Libraries:
- json: Built-in JSON support
- pandas: pd.read_json()
- ujson: Faster JSON parsing
EDA Approach:
- Schema validation
- Nesting depth analysis
- Key-value distribution
- Data type consistency
- Array length statistics
.parquet - Apache Parquet
Description: Columnar storage format
Typical Data: Large tabular datasets efficiently
Use Cases: Big data, efficient columnar analytics
Python Libraries:
- pandas: pd.read_parquet('file.parquet')
- pyarrow: Direct parquet access
- fastparquet: Alternative implementation
EDA Approach:
- Column statistics from metadata
- Partition analysis
- Compression efficiency
- Row group structure
- Fast sampling for large files
- Schema evolution tracking