General Scientific Data Formats Reference
This reference covers general-purpose scientific data formats used across multiple disciplines.
Numerical and Array Data
.npy - NumPy Array
Description: Binary NumPy array format
Typical Data: N-dimensional arrays of any data type
Use Cases: Fast I/O for numerical data, intermediate results
Python Libraries:
- numpy: np.load('file.npy'), np.save()
- Memory-mapped access: np.load('file.npy', mmap_mode='r')
EDA Approach:
- Array shape and dimensionality
- Data type and precision
- Statistical summary (mean, std, min, max, percentiles)
- Missing or invalid values (NaN, inf)
- Memory footprint
- Value distribution and histogram
- Sparsity analysis
- Correlation structure (if 2D)
.npz - Compressed NumPy Archive
Description: Multiple NumPy arrays in one file
Typical Data: Collections of related arrays
Use Cases: Saving multiple arrays together, compressed storage
Python Libraries:
- numpy: np.load('file.npz') returns dict-like object
- np.savez() or np.savez_compressed()
EDA Approach:
- List of contained arrays
- Individual array analysis
- Relationships between arrays
- Total file size and compression ratio
- Naming conventions
- Data consistency checks
.csv - Comma-Separated Values
Description: Plain text tabular data
Typical Data: Experimental measurements, results tables
Use Cases: Universal data exchange, spreadsheet export
Python Libraries:
- pandas: pd.read_csv('file.csv')
- csv: Built-in module
- polars: High-performance CSV reading
- numpy: np.loadtxt() or np.genfromtxt()
EDA Approach:
- Row and column counts
- Data type inference
- Missing value patterns and frequency
- Column statistics (numeric: mean, std; categorical: frequencies)
- Outlier detection
- Correlation matrix
- Duplicate row detection
- Header and index validation
- Encoding issues detection
.tsv / .tab - Tab-Separated Values
Description: Tab-delimited tabular data
Typical Data: Similar to CSV but tab-separated
Use Cases: Bioinformatics, text processing output
Python Libraries:
- pandas: pd.read_csv('file.tsv', sep='\t')
EDA Approach:
- Same as CSV format
- Tab vs space validation
- Quote handling
.xlsx / .xls - Excel Spreadsheets
Description: Microsoft Excel binary/XML formats
Typical Data: Tabular data with formatting, formulas
Use Cases: Lab notebooks, data entry, reports
Python Libraries:
- pandas: pd.read_excel('file.xlsx')
- openpyxl: Full Excel file manipulation
- xlrd: Reading .xls (legacy)
EDA Approach:
- Sheet enumeration and names
- Per-sheet data analysis
- Formula evaluation
- Merged cells handling
- Hidden rows/columns
- Data validation rules
- Named ranges
- Formatting-only cells detection
.json - JavaScript Object Notation
Description: Hierarchical text data format
Typical Data: Nested data structures, metadata
Use Cases: API responses, configuration, results
Python Libraries:
- json: Built-in module
- pandas: pd.read_json()
- ujson: Faster JSON parsing
EDA Approach:
- Schema inference
- Nesting depth
- Key-value distribution
- Array lengths
- Data type consistency
- Missing keys
- Duplicate detection
- Size and complexity metrics
.xml - Extensible Markup Language
Description: Hierarchical markup format
Typical Data: Structured data with metadata
Use Cases: Standards-based data exchange, APIs
Python Libraries:
- lxml: lxml.etree.parse()
- xml.etree.ElementTree: Built-in XML
- xmltodict: Convert XML to dict
EDA Approach:
- Schema/DTD validation
- Element hierarchy and depth
- Namespace handling
- Attribute vs element content
- CDATA sections
- Text content extraction
- Sibling and child counts
.yaml / .yml - YAML
Description: Human-readable data serialization
Typical Data: Configuration, metadata, parameters
Use Cases: Experiment configurations, pipelines
Python Libraries:
- yaml: yaml.safe_load() or yaml.load()
- ruamel.yaml: YAML 1.2 support
EDA Approach:
- Configuration structure
- Data type handling
- List and dict depth
- Anchor and alias usage
- Multi-document files
- Comments preservation
- Validation against schema
.toml - TOML Configuration
Description: Configuration file format
Typical Data: Settings, parameters
Use Cases: Python package configuration, settings
Python Libraries:
- tomli / tomllib: TOML reading (tomllib in Python 3.11+)
- toml: Reading and writing
EDA Approach:
- Section structure
- Key-value pairs
- Data type inference
- Nested table validation
- Required vs optional fields
.ini - INI Configuration
Description: Simple configuration format
Typical Data: Application settings
Use Cases: Legacy configurations, simple settings
Python Libraries:
- configparser: Built-in INI parser
EDA Approach:
- Section enumeration
- Key-value extraction
- Type conversion
- Comment handling
- Case sensitivity
Binary and Compressed Data
.hdf5 / .h5 - Hierarchical Data Format 5
Description: Container for large scientific datasets
Typical Data: Multi-dimensional arrays, metadata, groups
Use Cases: Large datasets, multi-modal data, parallel I/O
Python Libraries:
- h5py: h5py.File('file.h5', 'r')
- pytables: Advanced HDF5 interface
- pandas: HDF5 storage via HDFStore
EDA Approach:
- Group and dataset hierarchy
- Dataset shapes and dtypes
- Attributes and metadata
- Compression and chunking strategy
- Memory-efficient sampling
- Dataset relationships
- File size and efficiency
- Access patterns optimization
.zarr - Chunked Array Storage
Description: Cloud-optimized chunked arrays
Typical Data: Large N-dimensional arrays
Use Cases: Cloud storage, parallel computing, streaming
Python Libraries:
- zarr: zarr.open('file.zarr')
- xarray: Zarr backend support
EDA Approach:
- Array metadata and dimensions
- Chunk size optimization
- Compression codec and ratio
- Synchronizer and store type
- Multi-scale hierarchies
- Parallel access performance
- Attribute metadata
.gz / .gzip - Gzip Compressed
Description: Compressed data files
Typical Data: Any compressed text or binary
Use Cases: Compression for storage/transfer
Python Libraries:
- gzip: Built-in gzip module
- pandas: Automatic gzip handling in read functions
EDA Approach:
- Compression ratio
- Original file type detection
- Decompression validation
- Header information
- Multi-member archives
.bz2 - Bzip2 Compressed
Description: Bzip2 compression
Typical Data: Highly compressed files
Use Cases: Better compression than gzip
Python Libraries:
- bz2: Built-in bz2 module
- Automatic handling in pandas
EDA Approach:
- Compression efficiency
- Decompression time
- Content validation
.zip - ZIP Archive
Description: Archive with multiple files
Typical Data: Collections of files
Use Cases: File distribution, archiving
Python Libraries:
- zipfile: Built-in ZIP support
- pandas: Can read zipped CSVs
EDA Approach:
- Archive member listing
- Compression method per file
- Total vs compressed size
- Directory structure
- File type distribution
- Extraction validation
.tar / .tar.gz - TAR Archive
Description: Unix tape archive
Typical Data: Multiple files and directories
Use Cases: Software distribution, backups
Python Libraries:
- tarfile: Built-in TAR support
EDA Approach:
- Member file listing
- Compression (if .tar.gz, .tar.bz2)
- Directory structure
- Permissions preservation
- Extraction testing
Time Series and Waveform Data
.wav - Waveform Audio
Description: Audio waveform data
Typical Data: Acoustic signals, audio recordings
Use Cases: Acoustic analysis, ultrasound, signal processing
Python Libraries:
- scipy.io.wavfile: scipy.io.wavfile.read()
- wave: Built-in module
- soundfile: Enhanced audio I/O
EDA Approach:
- Sample rate and duration
- Bit depth and channels
- Amplitude distribution
- Spectral analysis (FFT)
- Signal-to-noise ratio
- Clipping detection
- Frequency content
.mat - MATLAB Data
Description: MATLAB workspace variables
Typical Data: Arrays, structures, cells
Use Cases: MATLAB-Python interoperability
Python Libraries:
- scipy.io: scipy.io.loadmat()
- h5py: For MATLAB v7.3 files (HDF5-based)
- mat73: Pure Python for v7.3
EDA Approach:
- Variable names and types
- Array dimensions
- Structure field exploration
- Cell array handling
- Sparse matrix detection
- MATLAB version compatibility
- Metadata extraction
.edf - European Data Format
Description: Time series data (especially medical)
Typical Data: EEG, physiological signals
Use Cases: Medical signal storage
Python Libraries:
- pyedflib: EDF/EDF+ reading and writing
- mne: Neurophysiology data (supports EDF)
EDA Approach:
- Signal count and names
- Sampling frequencies
- Signal ranges and units
- Recording duration
- Annotation events
- Data quality (saturation, noise)
- Patient/study information
.csv (Time Series)
Description: CSV with timestamp column
Typical Data: Time-indexed measurements
Use Cases: Sensor data, monitoring, experiments
Python Libraries:
- pandas: pd.read_csv() with parse_dates
EDA Approach:
- Temporal range and resolution
- Sampling regularity
- Missing time points
- Trend and seasonality
- Stationarity tests
- Autocorrelation
- Anomaly detection
Geospatial and Environmental Data
.shp - Shapefile
Description: Geospatial vector data
Typical Data: Geographic features (points, lines, polygons)
Use Cases: GIS analysis, spatial data
Python Libraries:
- geopandas: gpd.read_file('file.shp')
- fiona: Lower-level shapefile access
- pyshp: Pure Python shapefile reader
EDA Approach:
- Geometry type and count
- Coordinate reference system
- Bounding box
- Attribute table analysis
- Geometry validity
- Spatial distribution
- Multi-part features
- Associated files (.shx, .dbf, .prj)
.geojson - GeoJSON
Description: JSON format for geographic data
Typical Data: Features with geometry and properties
Use Cases: Web mapping, spatial analysis
Python Libraries:
- geopandas: Native GeoJSON support
- json: Parse as JSON then process
EDA Approach:
- Feature count and types
- CRS specification
- Bounding box calculation
- Property schema
- Geometry complexity
- Nesting structure
.tif / .tiff (Geospatial)
Description: GeoTIFF with spatial reference
Typical Data: Satellite imagery, DEMs, rasters
Use Cases: Remote sensing, terrain analysis
Python Libraries:
- rasterio: rasterio.open('file.tif')
- gdal: Geospatial Data Abstraction Library
- xarray with rioxarray: N-D geospatial arrays
EDA Approach:
- Raster dimensions and resolution
- Band count and descriptions
- Coordinate reference system
- Geotransform parameters
- NoData value handling
- Pixel value distribution
- Histogram analysis
- Overviews and pyramids
.nc / .netcdf - Network Common Data Form
Description: Self-describing array-based data
Typical Data: Climate, atmospheric, oceanographic data
Use Cases: Scientific datasets, model output
Python Libraries:
- netCDF4: netCDF4.Dataset('file.nc')
- xarray: xr.open_dataset('file.nc')
EDA Approach:
- Variable enumeration
- Dimension analysis
- Time series properties
- Spatial coverage
- Attribute metadata (CF conventions)
- Coordinate systems
- Chunking and compression
- Data quality flags
.grib / .grib2 - Gridded Binary
Description: Meteorological data format
Typical Data: Weather forecasts, climate data
Use Cases: Numerical weather prediction
Python Libraries:
- pygrib: GRIB file reading
- xarray with cfgrib: GRIB to xarray
EDA Approach:
- Message inventory
- Parameter and level types
- Spatial grid specification
- Temporal coverage
- Ensemble members
- Forecast vs analysis
- Data packing and precision
.hdf4 - HDF4 Format
Description: Older HDF format
Typical Data: NASA Earth Science data
Use Cases: Satellite data (MODIS, etc.)
Python Libraries:
- pyhdf: HDF4 access
- gdal: Can read HDF4
EDA Approach:
- Scientific dataset listing
- Vdata and attributes
- Dimension scales
- Metadata extraction
- Quality flags
- Conversion to HDF5 or NetCDF
Specialized Scientific Formats
.fits - Flexible Image Transport System
Description: Astronomy data format
Typical Data: Images, tables, spectra from telescopes
Use Cases: Astronomical observations
Python Libraries:
- astropy.io.fits: fits.open('file.fits')
- fitsio: Alternative FITS library
EDA Approach:
- HDU (Header Data Unit) structure
- Image dimensions and WCS
- Header keyword analysis
- Table column descriptions
- Data type and scaling
- FITS convention compliance
- Checksum validation
.asdf - Advanced Scientific Data Format
Description: Next-gen data format for astronomy
Typical Data: Complex hierarchical scientific data
Use Cases: James Webb Space Telescope data
Python Libraries:
- asdf: asdf.open('file.asdf')
EDA Approach:
- Tree structure exploration
- Schema validation
- Internal vs external arrays
- Compression methods
- YAML metadata
- Version compatibility
.root - ROOT Data Format
Description: CERN ROOT framework format
Typical Data: High-energy physics data
Use Cases: Particle physics experiments
Python Libraries:
- uproot: Pure Python ROOT reading
- ROOT: Official PyROOT bindings
EDA Approach:
- TTree structure
- Branch types and entries
- Histogram inventory
- Event loop statistics
- File compression
- Split level analysis
.txt - Plain Text Data
Description: Generic text-based data
Typical Data: Tab/space-delimited, custom formats
Use Cases: Simple data exchange, logs
Python Libraries:
- pandas: pd.read_csv() with custom delimiters
- numpy: np.loadtxt(), np.genfromtxt()
- Built-in file reading
EDA Approach:
- Format detection (delimiter, header)
- Data type inference
- Comment line handling
- Missing value codes
- Column alignment
- Encoding detection
.dat - Generic Data File
Description: Binary or text data
Typical Data: Instrument output, custom formats
Use Cases: Various scientific instruments
Python Libraries:
- Format-specific: requires knowledge of structure
- numpy: np.fromfile() for binary
- struct: Parse binary structures
EDA Approach:
- Binary vs text determination
- Header detection
- Record structure inference
- Endianness
- Data type patterns
- Validation with documentation
.log - Log Files
Description: Text logs from software/instruments
Typical Data: Timestamped events, messages
Use Cases: Troubleshooting, experiment tracking
Python Libraries:
- Built-in file reading
- pandas: Structured log parsing
- Regular expressions for parsing
EDA Approach:
- Log level distribution
- Timestamp parsing
- Error and warning frequency
- Event sequencing
- Pattern recognition
- Anomaly detection
- Session boundaries