Proteomics and Metabolomics File Formats Reference
This reference covers file formats specific to proteomics, metabolomics, lipidomics, and related omics workflows.
Mass Spectrometry-Based Proteomics
.mzML - Mass Spectrometry Markup Language
Description: Standard XML format for MS data
Typical Data: MS1 and MS2 spectra, retention times, intensities
Use Cases: Proteomics, metabolomics pipelines
Python Libraries:
- pymzml: pymzml.run.Reader('file.mzML')
- pyteomics.mzml: pyteomics.mzml.read('file.mzML')
- pyopenms: OpenMS Python bindings
EDA Approach:
- Scan count and MS level distribution
- Total ion chromatogram (TIC) analysis
- Base peak chromatogram (BPC)
- m/z coverage and resolution
- Retention time range
- Precursor selection patterns
- Data completeness
- Quality control metrics (lock mass, standards)
.mzXML - Legacy MS XML Format
Description: Older XML-based MS format
Typical Data: Mass spectra with metadata
Use Cases: Legacy proteomics data
Python Libraries:
- pyteomics.mzxml
- pymzml: Can read mzXML
EDA Approach:
- Similar to mzML
- Format version compatibility
- Conversion quality validation
- Metadata preservation check
.mzIdentML - Peptide Identification Format
Description: PSI standard for peptide identifications
Typical Data: Peptide-spectrum matches, proteins, scores
Use Cases: Search engine results, proteomics workflows
Python Libraries:
- pyteomics.mzid
- pyopenms: MzIdentML support
EDA Approach:
- PSM count and score distribution
- FDR calculation and filtering
- Modification analysis
- Missed cleavage statistics
- Protein inference results
- Search parameters validation
- Decoy hit analysis
- Rank-1 vs lower ranks
.pepXML - Trans-Proteomic Pipeline Peptide XML
Description: TPP format for peptide identifications
Typical Data: Search results with statistical validation
Use Cases: Proteomics database search output
Python Libraries:
- pyteomics.pepxml
EDA Approach:
- Search engine comparison
- Score distributions (XCorr, expect value, etc.)
- Charge state analysis
- Modification frequencies
- PeptideProphet probabilities
- Protein coverage
- Spectral counting
.protXML - Protein Inference Results
Description: TPP protein-level identifications
Typical Data: Protein groups, probabilities, peptides
Use Cases: Protein-level analysis
Python Libraries:
- pyteomics.protxml
EDA Approach:
- Protein group statistics
- Parsimonious protein sets
- ProteinProphet probabilities
- Coverage and peptide count per protein
- Unique vs shared peptides
- Protein molecular weight distribution
- GO term enrichment preparation
.pride.xml - PRIDE XML Format
Description: Proteomics Identifications Database format
Typical Data: Complete proteomics experiment data
Use Cases: Public data deposition (legacy)
Python Libraries:
- pyteomics.pride
- Custom XML parsers
EDA Approach:
- Experiment metadata extraction
- Identification completeness
- Cross-linking to spectra
- Protocol information
- Instrument details
.tsv / .csv (Proteomics)
Description: Tab or comma-separated proteomics results
Typical Data: Peptide or protein quantification tables
Use Cases: MaxQuant, Proteome Discoverer, Skyline output
Python Libraries:
- pandas: pd.read_csv() or pd.read_table()
EDA Approach:
- Identification counts
- Quantitative value distributions
- Missing value patterns
- Intensity-based analysis
- Label-free quantification assessment
- Isobaric tag ratio analysis
- Coefficient of variation
- Batch effects
.msf - Thermo MSF Database
Description: Proteome Discoverer results database
Typical Data: SQLite database with search results
Use Cases: Thermo Proteome Discoverer workflows
Python Libraries:
- sqlite3: Database access
- Custom MSF parsers
EDA Approach:
- Database schema exploration
- Peptide and protein tables
- Score thresholds
- Quantification data
- Processing node information
- Confidence levels
.pdResult - Proteome Discoverer Result
Description: Proteome Discoverer study results Typical Data: Comprehensive search and quantification Use Cases: PD study exports Python Libraries: - Vendor tools for conversion - Export to TSV for Python analysis EDA Approach: - Study design validation - Result filtering criteria - Quantitative comparison groups - Imputation strategies
.pep.xml - Peptide Summary
Description: Compact peptide identification format
Typical Data: Peptide sequences, modifications, scores
Use Cases: Downstream analysis input
Python Libraries:
- pyteomics: XML parsing
EDA Approach:
- Unique peptide counting
- PTM site localization
- Retention time predictability
- Charge state preferences
Quantitative Proteomics
.sky - Skyline Document
Description: Skyline targeted proteomics document
Typical Data: Transition lists, chromatograms, results
Use Cases: Targeted proteomics (SRM/MRM/PRM)
Python Libraries:
- skyline: Python API (limited)
- Export to CSV for analysis
EDA Approach:
- Transition selection validation
- Chromatographic peak quality
- Interference detection
- Retention time consistency
- Calibration curve assessment
- Replicate correlation
- LOD/LOQ determination
.sky.zip - Zipped Skyline Document
Description: Skyline document with external files
Typical Data: Complete Skyline analysis
Use Cases: Sharing Skyline projects
Python Libraries:
- zipfile: Extract for processing
EDA Approach:
- Document structure
- External file references
- Result export and analysis
.wiff - SCIEX WIFF Format
Description: SCIEX instrument data with quantitation Typical Data: LC-MS/MS with MRM transitions Use Cases: SCIEX QTRAP, TripleTOF data Python Libraries: - Vendor tools (limited Python access) - Conversion to mzML EDA Approach: - MRM transition performance - Dwell time optimization - Cycle time analysis - Peak integration quality
.raw (Thermo)
Description: Thermo raw instrument file
Typical Data: Full MS data from Orbitrap, Q Exactive
Use Cases: Label-free and TMT quantification
Python Libraries:
- pymsfilereader: Thermo RawFileReader
- ThermoRawFileParser: Cross-platform CLI
EDA Approach:
- MS1 and MS2 acquisition rates
- AGC target and fill times
- Resolution settings
- Isolation window validation
- SPS ion selection (TMT)
- Contamination assessment
.d (Agilent)
Description: Agilent data directory Typical Data: LC-MS and GC-MS data Use Cases: Agilent instrument workflows Python Libraries: - Community parsers - Export to mzML EDA Approach: - Method consistency - Calibration status - Sequence run information - Retention time stability
Metabolomics and Lipidomics
.mzML (Metabolomics)
Description: Standard MS format for metabolomics Typical Data: Full scan MS, targeted MS/MS Use Cases: Untargeted and targeted metabolomics Python Libraries: - Same as proteomics mzML tools EDA Approach: - Feature detection quality - Mass accuracy assessment - Retention time alignment - Blank subtraction - QC sample consistency - Isotope pattern validation - Adduct formation analysis - In-source fragmentation check
.cdf / .netCDF - ANDI-MS
Description: Analytical Data Interchange for MS
Typical Data: GC-MS, LC-MS chromatography data
Use Cases: Metabolomics, GC-MS workflows
Python Libraries:
- netCDF4: Low-level access
- pyopenms: CDF support
- xcms via R integration
EDA Approach:
- TIC and extracted ion chromatograms
- Peak detection across samples
- Retention index calculation
- Mass spectral matching
- Library search preparation
.msp - Mass Spectral Format (NIST)
Description: NIST spectral library format
Typical Data: Reference mass spectra
Use Cases: Metabolite identification, library matching
Python Libraries:
- matchms: Spectral matching
- Custom MSP parsers
EDA Approach:
- Library coverage
- Metadata completeness (InChI, SMILES)
- Spectral quality metrics
- Collision energy standardization
- Precursor type annotation
.mgf (Metabolomics)
Description: Mascot Generic Format for MS/MS
Typical Data: MS/MS spectra for metabolite ID
Use Cases: Spectral library searching
Python Libraries:
- matchms: Metabolomics spectral analysis
- pyteomics.mgf
EDA Approach:
- Spectrum quality filtering
- Precursor isolation purity
- Fragment m/z accuracy
- Neutral loss patterns
- MS/MS completeness
.nmrML - NMR Markup Language
Description: Standard XML format for NMR metabolomics
Typical Data: 1D/2D NMR spectra with metadata
Use Cases: NMR-based metabolomics
Python Libraries:
- nmrml2isa: Format conversion
- Custom XML parsers
EDA Approach:
- Spectral quality metrics
- Binning consistency
- Reference compound validation
- pH and temperature effects
- Metabolite identification confidence
.json (Metabolomics)
Description: JSON format for metabolomics results
Typical Data: Feature tables, annotations, metadata
Use Cases: GNPS, MetaboAnalyst, web tools
Python Libraries:
- json: Standard library
- pandas: JSON normalization
EDA Approach:
- Feature annotation coverage
- GNPS clustering results
- Molecular networking statistics
- Adduct and in-source fragment linkage
- Putative identification confidence
.txt (Metabolomics Tables)
Description: Tab-delimited feature tables
Typical Data: m/z, RT, intensities across samples
Use Cases: MZmine, XCMS, MS-DIAL output
Python Libraries:
- pandas: Text file reading
EDA Approach:
- Feature count and quality
- Missing value imputation
- Data normalization assessment
- Batch correction validation
- PCA and clustering for QC
- Fold change calculations
- Statistical test preparation
.featureXML - OpenMS Feature Format
Description: OpenMS detected features
Typical Data: LC-MS features with quality scores
Use Cases: OpenMS workflows
Python Libraries:
- pyopenms: FeatureXML support
EDA Approach:
- Feature detection parameters
- Quality metrics per feature
- Isotope pattern fitting
- Charge state assignment
- FWHM and asymmetry
.consensusXML - OpenMS Consensus Features
Description: Linked features across samples
Typical Data: Aligned features with group info
Use Cases: Multi-sample LC-MS analysis
Python Libraries:
- pyopenms: ConsensusXML reading
EDA Approach:
- Feature correspondence quality
- Retention time alignment
- Missing value patterns
- Intensity normalization needs
- Batch-wise feature agreement
.idXML - OpenMS Identification Format
Description: Peptide/metabolite identifications
Typical Data: MS/MS identifications with scores
Use Cases: OpenMS ID workflows
Python Libraries:
- pyopenms: IdXML support
EDA Approach:
- Identification rate
- Score distribution
- Spectral match quality
- False discovery assessment
- Annotation transfer validation
Lipidomics-Specific Formats
.lcb - LipidCreator Batch
Description: LipidCreator transition list Typical Data: Lipid transitions for targeted MS Use Cases: Targeted lipidomics Python Libraries: - Export to CSV for processing EDA Approach: - Transition coverage per lipid class - Retention time prediction - Collision energy optimization - Class-specific fragmentation patterns
.mzTab - Proteomics/Metabolomics Tabular Format
Description: PSI tabular summary format
Typical Data: Protein/peptide/metabolite quantification
Use Cases: Publication and data sharing
Python Libraries:
- pyteomics.mztab
- pandas for TSV-like structure
EDA Approach:
- Data completeness
- Metadata section validation
- Quantification method
- Identification confidence
- Software and parameters
- Quality metrics summary
.csv (LipidSearch, LipidMatch)
Description: Lipid identification results
Typical Data: Lipid annotations, grades, intensities
Use Cases: Lipidomics software output
Python Libraries:
- pandas: CSV reading
EDA Approach:
- Lipid class distribution
- Identification grade/confidence
- Fatty acid composition analysis
- Double bond and chain length patterns
- Intensity correlations
- Normalization to internal standards
.sdf (Metabolomics)
Description: Structure data file for metabolites
Typical Data: Chemical structures with properties
Use Cases: Metabolite database creation
Python Libraries:
- RDKit: Chem.SDMolSupplier('file.sdf')
EDA Approach:
- Structure validation
- Property calculation (logP, MW, TPSA)
- Molecular formula consistency
- Tautomer enumeration
- Retention time prediction features
.mol (Metabolomics)
Description: Single molecule structure files
Typical Data: Metabolite chemical structure
Use Cases: Structure-based searches
Python Libraries:
- RDKit: Chem.MolFromMolFile('file.mol')
EDA Approach:
- Structure correctness
- Stereochemistry validation
- Charge state
- Implicit hydrogen handling
Data Processing and Analysis
.h5 / .hdf5 (Omics)
Description: HDF5 for large omics datasets
Typical Data: Feature matrices, spectra, metadata
Use Cases: Large-scale studies, cloud computing
Python Libraries:
- h5py: HDF5 access
- anndata: For single-cell proteomics
EDA Approach:
- Dataset organization
- Chunking and compression
- Metadata structure
- Efficient data access patterns
- Sample and feature annotations
.Rdata / .rds - R Objects
Description: Serialized R analysis objects
Typical Data: Processed omics results from R packages
Use Cases: xcms, CAMERA, MSnbase workflows
Python Libraries:
- pyreadr: pyreadr.read_r('file.Rdata')
- rpy2: R-Python integration
EDA Approach:
- Object structure exploration
- Data extraction
- Method parameter review
- Conversion to Python-native formats
.mzTab-M - Metabolomics mzTab
Description: mzTab specific to metabolomics
Typical Data: Small molecule quantification
Use Cases: Metabolomics data sharing
Python Libraries:
- pyteomics.mztab: Can parse mzTab-M
EDA Approach:
- Small molecule evidence
- Feature quantification
- Database references (HMDB, KEGG, etc.)
- Adduct and charge annotation
- MS level information
.parquet (Omics)
Description: Columnar storage for large tables
Typical Data: Feature matrices, metadata
Use Cases: Efficient big data omics
Python Libraries:
- pandas: pd.read_parquet()
- pyarrow: Direct parquet access
EDA Approach:
- Compression efficiency
- Column-wise statistics
- Partition structure
- Schema validation
- Fast filtering and aggregation
.pkl (Omics Models)
Description: Pickled Python objects
Typical Data: ML models, processed data
Use Cases: Workflow intermediate storage
Python Libraries:
- pickle: Standard serialization
- joblib: Enhanced pickling
EDA Approach:
- Object type and structure
- Model parameters
- Feature importance (if ML model)
- Data shapes and types
- Deserialization validation
.zarr (Omics)
Description: Chunked, compressed array storage
Typical Data: Multi-dimensional omics data
Use Cases: Cloud-optimized analysis
Python Libraries:
- zarr: Array storage
EDA Approach:
- Chunk optimization
- Compression codecs
- Multi-scale data
- Parallel access patterns
- Metadata annotations