references/general_scientific_formats.md

General Scientific Data Formats Reference

This reference covers general-purpose scientific data formats used across multiple disciplines.

Numerical and Array Data

.npy - NumPy Array

Description: Binary NumPy array format Typical Data: N-dimensional arrays of any data type Use Cases: Fast I/O for numerical data, intermediate results Python Libraries: - numpy: np.load('file.npy'), np.save() - Memory-mapped access: np.load('file.npy', mmap_mode='r') EDA Approach: - Array shape and dimensionality - Data type and precision - Statistical summary (mean, std, min, max, percentiles) - Missing or invalid values (NaN, inf) - Memory footprint - Value distribution and histogram - Sparsity analysis - Correlation structure (if 2D)

.npz - Compressed NumPy Archive

Description: Multiple NumPy arrays in one file Typical Data: Collections of related arrays Use Cases: Saving multiple arrays together, compressed storage Python Libraries: - numpy: np.load('file.npz') returns dict-like object - np.savez() or np.savez_compressed() EDA Approach: - List of contained arrays - Individual array analysis - Relationships between arrays - Total file size and compression ratio - Naming conventions - Data consistency checks

.csv - Comma-Separated Values

Description: Plain text tabular data Typical Data: Experimental measurements, results tables Use Cases: Universal data exchange, spreadsheet export Python Libraries: - pandas: pd.read_csv('file.csv') - csv: Built-in module - polars: High-performance CSV reading - numpy: np.loadtxt() or np.genfromtxt() EDA Approach: - Row and column counts - Data type inference - Missing value patterns and frequency - Column statistics (numeric: mean, std; categorical: frequencies) - Outlier detection - Correlation matrix - Duplicate row detection - Header and index validation - Encoding issues detection

.tsv / .tab - Tab-Separated Values

Description: Tab-delimited tabular data Typical Data: Similar to CSV but tab-separated Use Cases: Bioinformatics, text processing output Python Libraries: - pandas: pd.read_csv('file.tsv', sep='\t') EDA Approach: - Same as CSV format - Tab vs space validation - Quote handling

.xlsx / .xls - Excel Spreadsheets

Description: Microsoft Excel binary/XML formats Typical Data: Tabular data with formatting, formulas Use Cases: Lab notebooks, data entry, reports Python Libraries: - pandas: pd.read_excel('file.xlsx') - openpyxl: Full Excel file manipulation - xlrd: Reading .xls (legacy) EDA Approach: - Sheet enumeration and names - Per-sheet data analysis - Formula evaluation - Merged cells handling - Hidden rows/columns - Data validation rules - Named ranges - Formatting-only cells detection

.json - JavaScript Object Notation

Description: Hierarchical text data format Typical Data: Nested data structures, metadata Use Cases: API responses, configuration, results Python Libraries: - json: Built-in module - pandas: pd.read_json() - ujson: Faster JSON parsing EDA Approach: - Schema inference - Nesting depth - Key-value distribution - Array lengths - Data type consistency - Missing keys - Duplicate detection - Size and complexity metrics

.xml - Extensible Markup Language

Description: Hierarchical markup format Typical Data: Structured data with metadata Use Cases: Standards-based data exchange, APIs Python Libraries: - lxml: lxml.etree.parse() - xml.etree.ElementTree: Built-in XML - xmltodict: Convert XML to dict EDA Approach: - Schema/DTD validation - Element hierarchy and depth - Namespace handling - Attribute vs element content - CDATA sections - Text content extraction - Sibling and child counts

.yaml / .yml - YAML

Description: Human-readable data serialization Typical Data: Configuration, metadata, parameters Use Cases: Experiment configurations, pipelines Python Libraries: - yaml: yaml.safe_load() or yaml.load() - ruamel.yaml: YAML 1.2 support EDA Approach: - Configuration structure - Data type handling - List and dict depth - Anchor and alias usage - Multi-document files - Comments preservation - Validation against schema

.toml - TOML Configuration

Description: Configuration file format Typical Data: Settings, parameters Use Cases: Python package configuration, settings Python Libraries: - tomli / tomllib: TOML reading (tomllib in Python 3.11+) - toml: Reading and writing EDA Approach: - Section structure - Key-value pairs - Data type inference - Nested table validation - Required vs optional fields

.ini - INI Configuration

Description: Simple configuration format Typical Data: Application settings Use Cases: Legacy configurations, simple settings Python Libraries: - configparser: Built-in INI parser EDA Approach: - Section enumeration - Key-value extraction - Type conversion - Comment handling - Case sensitivity

Binary and Compressed Data

.hdf5 / .h5 - Hierarchical Data Format 5

Description: Container for large scientific datasets Typical Data: Multi-dimensional arrays, metadata, groups Use Cases: Large datasets, multi-modal data, parallel I/O Python Libraries: - h5py: h5py.File('file.h5', 'r') - pytables: Advanced HDF5 interface - pandas: HDF5 storage via HDFStore EDA Approach: - Group and dataset hierarchy - Dataset shapes and dtypes - Attributes and metadata - Compression and chunking strategy - Memory-efficient sampling - Dataset relationships - File size and efficiency - Access patterns optimization

.zarr - Chunked Array Storage

Description: Cloud-optimized chunked arrays Typical Data: Large N-dimensional arrays Use Cases: Cloud storage, parallel computing, streaming Python Libraries: - zarr: zarr.open('file.zarr') - xarray: Zarr backend support EDA Approach: - Array metadata and dimensions - Chunk size optimization - Compression codec and ratio - Synchronizer and store type - Multi-scale hierarchies - Parallel access performance - Attribute metadata

.gz / .gzip - Gzip Compressed

Description: Compressed data files Typical Data: Any compressed text or binary Use Cases: Compression for storage/transfer Python Libraries: - gzip: Built-in gzip module - pandas: Automatic gzip handling in read functions EDA Approach: - Compression ratio - Original file type detection - Decompression validation - Header information - Multi-member archives

.bz2 - Bzip2 Compressed

Description: Bzip2 compression Typical Data: Highly compressed files Use Cases: Better compression than gzip Python Libraries: - bz2: Built-in bz2 module - Automatic handling in pandas EDA Approach: - Compression efficiency - Decompression time - Content validation

.zip - ZIP Archive

Description: Archive with multiple files Typical Data: Collections of files Use Cases: File distribution, archiving Python Libraries: - zipfile: Built-in ZIP support - pandas: Can read zipped CSVs EDA Approach: - Archive member listing - Compression method per file - Total vs compressed size - Directory structure - File type distribution - Extraction validation

.tar / .tar.gz - TAR Archive

Description: Unix tape archive Typical Data: Multiple files and directories Use Cases: Software distribution, backups Python Libraries: - tarfile: Built-in TAR support EDA Approach: - Member file listing - Compression (if .tar.gz, .tar.bz2) - Directory structure - Permissions preservation - Extraction testing

Time Series and Waveform Data

.wav - Waveform Audio

Description: Audio waveform data Typical Data: Acoustic signals, audio recordings Use Cases: Acoustic analysis, ultrasound, signal processing Python Libraries: - scipy.io.wavfile: scipy.io.wavfile.read() - wave: Built-in module - soundfile: Enhanced audio I/O EDA Approach: - Sample rate and duration - Bit depth and channels - Amplitude distribution - Spectral analysis (FFT) - Signal-to-noise ratio - Clipping detection - Frequency content

.mat - MATLAB Data

Description: MATLAB workspace variables Typical Data: Arrays, structures, cells Use Cases: MATLAB-Python interoperability Python Libraries: - scipy.io: scipy.io.loadmat() - h5py: For MATLAB v7.3 files (HDF5-based) - mat73: Pure Python for v7.3 EDA Approach: - Variable names and types - Array dimensions - Structure field exploration - Cell array handling - Sparse matrix detection - MATLAB version compatibility - Metadata extraction

.edf - European Data Format

Description: Time series data (especially medical) Typical Data: EEG, physiological signals Use Cases: Medical signal storage Python Libraries: - pyedflib: EDF/EDF+ reading and writing - mne: Neurophysiology data (supports EDF) EDA Approach: - Signal count and names - Sampling frequencies - Signal ranges and units - Recording duration - Annotation events - Data quality (saturation, noise) - Patient/study information

.csv (Time Series)

Description: CSV with timestamp column Typical Data: Time-indexed measurements Use Cases: Sensor data, monitoring, experiments Python Libraries: - pandas: pd.read_csv() with parse_dates EDA Approach: - Temporal range and resolution - Sampling regularity - Missing time points - Trend and seasonality - Stationarity tests - Autocorrelation - Anomaly detection

Geospatial and Environmental Data

.shp - Shapefile

Description: Geospatial vector data Typical Data: Geographic features (points, lines, polygons) Use Cases: GIS analysis, spatial data Python Libraries: - geopandas: gpd.read_file('file.shp') - fiona: Lower-level shapefile access - pyshp: Pure Python shapefile reader EDA Approach: - Geometry type and count - Coordinate reference system - Bounding box - Attribute table analysis - Geometry validity - Spatial distribution - Multi-part features - Associated files (.shx, .dbf, .prj)

.geojson - GeoJSON

Description: JSON format for geographic data Typical Data: Features with geometry and properties Use Cases: Web mapping, spatial analysis Python Libraries: - geopandas: Native GeoJSON support - json: Parse as JSON then process EDA Approach: - Feature count and types - CRS specification - Bounding box calculation - Property schema - Geometry complexity - Nesting structure

.tif / .tiff (Geospatial)

Description: GeoTIFF with spatial reference Typical Data: Satellite imagery, DEMs, rasters Use Cases: Remote sensing, terrain analysis Python Libraries: - rasterio: rasterio.open('file.tif') - gdal: Geospatial Data Abstraction Library - xarray with rioxarray: N-D geospatial arrays EDA Approach: - Raster dimensions and resolution - Band count and descriptions - Coordinate reference system - Geotransform parameters - NoData value handling - Pixel value distribution - Histogram analysis - Overviews and pyramids

.nc / .netcdf - Network Common Data Form

Description: Self-describing array-based data Typical Data: Climate, atmospheric, oceanographic data Use Cases: Scientific datasets, model output Python Libraries: - netCDF4: netCDF4.Dataset('file.nc') - xarray: xr.open_dataset('file.nc') EDA Approach: - Variable enumeration - Dimension analysis - Time series properties - Spatial coverage - Attribute metadata (CF conventions) - Coordinate systems - Chunking and compression - Data quality flags

.grib / .grib2 - Gridded Binary

Description: Meteorological data format Typical Data: Weather forecasts, climate data Use Cases: Numerical weather prediction Python Libraries: - pygrib: GRIB file reading - xarray with cfgrib: GRIB to xarray EDA Approach: - Message inventory - Parameter and level types - Spatial grid specification - Temporal coverage - Ensemble members - Forecast vs analysis - Data packing and precision

.hdf4 - HDF4 Format

Description: Older HDF format Typical Data: NASA Earth Science data Use Cases: Satellite data (MODIS, etc.) Python Libraries: - pyhdf: HDF4 access - gdal: Can read HDF4 EDA Approach: - Scientific dataset listing - Vdata and attributes - Dimension scales - Metadata extraction - Quality flags - Conversion to HDF5 or NetCDF

Specialized Scientific Formats

.fits - Flexible Image Transport System

Description: Astronomy data format Typical Data: Images, tables, spectra from telescopes Use Cases: Astronomical observations Python Libraries: - astropy.io.fits: fits.open('file.fits') - fitsio: Alternative FITS library EDA Approach: - HDU (Header Data Unit) structure - Image dimensions and WCS - Header keyword analysis - Table column descriptions - Data type and scaling - FITS convention compliance - Checksum validation

.asdf - Advanced Scientific Data Format

Description: Next-gen data format for astronomy Typical Data: Complex hierarchical scientific data Use Cases: James Webb Space Telescope data Python Libraries: - asdf: asdf.open('file.asdf') EDA Approach: - Tree structure exploration - Schema validation - Internal vs external arrays - Compression methods - YAML metadata - Version compatibility

.root - ROOT Data Format

Description: CERN ROOT framework format Typical Data: High-energy physics data Use Cases: Particle physics experiments Python Libraries: - uproot: Pure Python ROOT reading - ROOT: Official PyROOT bindings EDA Approach: - TTree structure - Branch types and entries - Histogram inventory - Event loop statistics - File compression - Split level analysis

.txt - Plain Text Data

Description: Generic text-based data Typical Data: Tab/space-delimited, custom formats Use Cases: Simple data exchange, logs Python Libraries: - pandas: pd.read_csv() with custom delimiters - numpy: np.loadtxt(), np.genfromtxt() - Built-in file reading EDA Approach: - Format detection (delimiter, header) - Data type inference - Comment line handling - Missing value codes - Column alignment - Encoding detection

.dat - Generic Data File

Description: Binary or text data Typical Data: Instrument output, custom formats Use Cases: Various scientific instruments Python Libraries: - Format-specific: requires knowledge of structure - numpy: np.fromfile() for binary - struct: Parse binary structures EDA Approach: - Binary vs text determination - Header detection - Record structure inference - Endianness - Data type patterns - Validation with documentation

.log - Log Files

Description: Text logs from software/instruments Typical Data: Timestamped events, messages Use Cases: Troubleshooting, experiment tracking Python Libraries: - Built-in file reading - pandas: Structured log parsing - Regular expressions for parsing EDA Approach: - Log level distribution - Timestamp parsing - Error and warning frequency - Event sequencing - Pattern recognition - Anomaly detection - Session boundaries

← Back to exploratory-data-analysis