references/bioinformatics_genomics_formats.md

Bioinformatics and Genomics File Formats Reference

This reference covers file formats used in genomics, transcriptomics, sequence analysis, and related bioinformatics applications.

Sequence Data Formats

.fasta / .fa / .fna - FASTA Format

Description: Text-based format for nucleotide or protein sequences Typical Data: DNA, RNA, or protein sequences with headers Use Cases: Sequence storage, BLAST searches, alignments Python Libraries: - Biopython: SeqIO.parse('file.fasta', 'fasta') - pyfaidx: Fast indexed FASTA access - screed: Fast sequence parsing EDA Approach: - Sequence count and length distribution - GC content analysis - N content (ambiguous bases) - Sequence ID parsing - Duplicate detection - Quality metrics for assemblies (N50, L50)

.fastq / .fq - FASTQ Format

Description: Sequence data with base quality scores Typical Data: Raw sequencing reads with Phred quality scores Use Cases: NGS data, quality control, read mapping Python Libraries: - Biopython: SeqIO.parse('file.fastq', 'fastq') - pysam: Fast FASTQ/BAM operations - HTSeq: Sequencing data analysis EDA Approach: - Read count and length distribution - Quality score distribution (per-base, per-read) - GC content and bias - Duplicate rate estimation - Adapter contamination detection - k-mer frequency analysis - Encoding format validation (Phred33/64)

.sam - Sequence Alignment/Map

Description: Tab-delimited text format for alignments Typical Data: Aligned sequencing reads with mapping quality Use Cases: Read alignment storage, variant calling Python Libraries: - pysam: pysam.AlignmentFile('file.sam', 'r') - HTSeq: HTSeq.SAM_Reader('file.sam') EDA Approach: - Mapping rate and quality distribution - Coverage analysis - Insert size distribution (paired-end) - Alignment flags distribution - CIGAR string patterns - Mismatch and indel rates - Duplicate and supplementary alignment counts

.bam - Binary Alignment/Map

Description: Compressed binary version of SAM Typical Data: Aligned reads in compressed format Use Cases: Efficient storage and processing of alignments Python Libraries: - pysam: Full BAM support with indexing - bamnostic: Pure Python BAM reader EDA Approach: - Same as SAM plus: - Compression ratio analysis - Index file (.bai) validation - Chromosome-wise statistics - Strand bias detection - Read group analysis

.cram - CRAM Format

Description: Highly compressed alignment format Typical Data: Reference-compressed aligned reads Use Cases: Long-term storage, space-efficient archives Python Libraries: - pysam: CRAM support (requires reference) - Reference genome must be accessible EDA Approach: - Compression efficiency vs BAM - Reference dependency validation - Lossy vs lossless compression assessment - Decompression performance - Similar alignment metrics as BAM

.bed - Browser Extensible Data

Description: Tab-delimited format for genomic features Typical Data: Genomic intervals (chr, start, end) with annotations Use Cases: Peak calling, variant annotation, genome browsing Python Libraries: - pybedtools: pybedtools.BedTool('file.bed') - pyranges: pyranges.read_bed('file.bed') - pandas: Simple BED reading EDA Approach: - Feature count and size distribution - Chromosome distribution - Strand bias - Score distribution (if present) - Overlap and proximity analysis - Coverage statistics - Gap analysis between features

.bedGraph - BED with Graph Data

Description: BED format with per-base signal values Typical Data: Continuous-valued genomic data (coverage, signals) Use Cases: Coverage tracks, ChIP-seq signals, methylation Python Libraries: - pyBigWig: Can convert to bigWig - pybedtools: BedGraph operations EDA Approach: - Signal distribution statistics - Genome coverage percentage - Signal dynamics (peaks, valleys) - Chromosome-wise signal patterns - Quantile analysis - Zero-coverage regions

.bigWig / .bw - Binary BigWig

Description: Indexed binary format for genome-wide signal data Typical Data: Continuous genomic signals (compressed and indexed) Use Cases: Efficient genome browser tracks, large-scale data Python Libraries: - pyBigWig: pyBigWig.open('file.bw') - pybbi: BigWig/BigBed interface EDA Approach: - Signal statistics extraction - Zoom level analysis - Regional signal extraction - Efficient genome-wide summaries - Compression efficiency - Index structure analysis

.bigBed / .bb - Binary BigBed

Description: Indexed binary BED format Typical Data: Genomic features (compressed and indexed) Use Cases: Large feature sets, genome browsers Python Libraries: - pybbi: BigBed reading - pybigtools: Modern BigBed interface EDA Approach: - Feature density analysis - Efficient interval queries - Zoom level validation - Index performance metrics - Feature size statistics

.gff / .gff3 - General Feature Format

Description: Tab-delimited format for genomic annotations Typical Data: Gene models, transcripts, exons, regulatory elements Use Cases: Genome annotation, gene prediction Python Libraries: - BCBio.GFF: Biopython GFF module - gffutils: gffutils.create_db('file.gff3') - pyranges: GFF support EDA Approach: - Feature type distribution (gene, exon, CDS, etc.) - Gene structure validation - Strand balance - Hierarchical relationship validation - Phase validation for CDS - Attribute completeness - Gene model statistics (introns, exons per gene)

.gtf - Gene Transfer Format

Description: GFF2-based format for gene annotations Typical Data: Gene and transcript annotations Use Cases: RNA-seq analysis, gene quantification Python Libraries: - pyranges: pyranges.read_gtf('file.gtf') - gffutils: GTF database creation - HTSeq: GTF reading for counts EDA Approach: - Transcript isoform analysis - Gene structure completeness - Exon number distribution - Transcript length distribution - TSS and TES analysis - Biotype distribution - Overlapping gene detection

.vcf - Variant Call Format

Description: Text format for genetic variants Typical Data: SNPs, indels, structural variants with annotations Use Cases: Variant calling, population genetics, GWAS Python Libraries: - pysam: pysam.VariantFile('file.vcf') - cyvcf2: Fast VCF parsing - PyVCF: Older but comprehensive EDA Approach: - Variant count by type (SNP, indel, SV) - Quality score distribution - Allele frequency spectrum - Transition/transversion ratio - Heterozygosity rates - Missing genotype analysis - Hardy-Weinberg equilibrium - Annotation completeness (if annotated)

.bcf - Binary VCF

Description: Compressed binary variant format Typical Data: Same as VCF but binary Use Cases: Efficient variant storage and processing Python Libraries: - pysam: Full BCF support - cyvcf2: Optimized BCF reading EDA Approach: - Same as VCF plus: - Compression efficiency - Indexing validation - Read performance metrics

.gvcf - Genomic VCF

Description: VCF with reference confidence blocks Typical Data: All positions (variant and non-variant) Use Cases: Joint genotyping workflows, GATK Python Libraries: - pysam: GVCF support - Standard VCF parsers EDA Approach: - Reference block analysis - Coverage uniformity - Variant density - Genotype quality across genome - Reference confidence distribution

RNA-Seq and Expression Data

.counts - Gene Count Matrix

Description: Tab-delimited gene expression counts Typical Data: Gene IDs with read counts per sample Use Cases: RNA-seq quantification, differential expression Python Libraries: - pandas: pd.read_csv('file.counts', sep='\t') - scanpy (for single-cell): sc.read_csv() EDA Approach: - Library size distribution - Detection rate (genes per sample) - Zero-inflation analysis - Count distribution (log scale) - Outlier sample detection - Correlation between replicates - PCA for sample relationships

.tpm / .fpkm - Normalized Expression

Description: Normalized gene expression values Typical Data: TPM (transcripts per million) or FPKM values Use Cases: Cross-sample comparison, visualization Python Libraries: - pandas: Standard CSV reading - anndata: For integrated analysis EDA Approach: - Expression distribution - Highly expressed gene identification - Sample clustering - Batch effect detection - Coefficient of variation analysis - Dynamic range assessment

.mtx - Matrix Market Format

Description: Sparse matrix format (common in single-cell) Typical Data: Sparse count matrices (cells × genes) Use Cases: Single-cell RNA-seq, large sparse matrices Python Libraries: - scipy.io: scipy.io.mmread('file.mtx') - scanpy: sc.read_mtx('file.mtx') EDA Approach: - Sparsity analysis - Cell and gene filtering thresholds - Doublet detection metrics - Mitochondrial fraction - UMI count distribution - Gene detection per cell

.h5ad - Anndata Format

Description: HDF5-based annotated data matrix Typical Data: Expression matrix with metadata (cells, genes) Use Cases: Single-cell RNA-seq analysis with Scanpy Python Libraries: - scanpy: sc.read_h5ad('file.h5ad') - anndata: Direct AnnData manipulation EDA Approach: - Cell and gene counts - Metadata completeness - Layer availability (raw, normalized) - Embedding presence (PCA, UMAP) - QC metrics distribution - Batch information - Cell type annotation coverage

.loom - Loom Format

Description: HDF5-based format for omics data Typical Data: Expression matrices with metadata Use Cases: Single-cell data, RNA velocity analysis Python Libraries: - loompy: loompy.connect('file.loom') - scanpy: Can import loom files EDA Approach: - Layer analysis (spliced, unspliced) - Row and column attribute exploration - Graph connectivity analysis - Cluster assignments - Velocity-specific metrics

.rds - R Data Serialization

Description: R object storage (often Seurat objects) Typical Data: R analysis results, especially single-cell Use Cases: R-Python data exchange Python Libraries: - pyreadr: pyreadr.read_r('file.rds') - rpy2: For full R integration - Conversion tools to AnnData EDA Approach: - Object type identification - Data structure exploration - Metadata extraction - Conversion validation

Alignment and Assembly Formats

.maf - Multiple Alignment Format

Description: Text format for multiple sequence alignments Typical Data: Genome-wide or local multiple alignments Use Cases: Comparative genomics, conservation analysis Python Libraries: - Biopython: AlignIO.parse('file.maf', 'maf') - bx-python: MAF-specific tools EDA Approach: - Alignment block statistics - Species coverage - Gap analysis - Conservation scoring - Alignment quality metrics - Block length distribution

.axt - Pairwise Alignment Format

Description: Pairwise alignment format (UCSC) Typical Data: Pairwise genomic alignments Use Cases: Genome comparison, synteny analysis Python Libraries: - Custom parsers (simple format) - bx-python: AXT support EDA Approach: - Alignment score distribution - Identity percentage - Syntenic block identification - Gap size analysis - Coverage statistics

.chain - Chain Alignment Format

Description: Genome coordinate mapping chains Typical Data: Coordinate transformations between genome builds Use Cases: Liftover, coordinate conversion Python Libraries: - pyliftover: Chain file usage - Custom parsers for chain format EDA Approach: - Chain score distribution - Coverage of source genome - Gap analysis - Inversion detection - Mapping quality assessment

.psl - Pattern Space Layout

Description: BLAT/BLAST alignment format Typical Data: Alignment results from BLAT Use Cases: Transcript mapping, similarity searches Python Libraries: - Custom parsers (tab-delimited) - pybedtools: Can handle PSL EDA Approach: - Match percentage distribution - Gap statistics - Query coverage - Multiple mapping analysis - Alignment quality metrics

Genome Assembly and Annotation

.agp - Assembly Golden Path

Description: Assembly structure description Typical Data: Scaffold composition, gap information Use Cases: Genome assembly representation Python Libraries: - Custom parsers (simple tab-delimited) - Assembly analysis tools EDA Approach: - Scaffold statistics (N50, L50) - Gap type and size distribution - Component length analysis - Assembly contiguity metrics - Unplaced contig analysis

.scaffolds / .contigs - Assembly Sequences

Description: Assembled sequences (usually FASTA) Typical Data: Assembled genomic sequences Use Cases: Genome assembly output Python Libraries: - Same as FASTA format - Assembly-specific tools (QUAST) EDA Approach: - Assembly statistics (N50, N90, etc.) - Length distribution - Coverage analysis - Gap (N) content - Duplication assessment - BUSCO completeness (if annotations available)

.2bit - Compressed Genome Format

Description: UCSC compact genome format Typical Data: Reference genomes (highly compressed) Use Cases: Efficient genome storage and access Python Libraries: - py2bit: py2bit.open('file.2bit') - twobitreader: Alternative reader EDA Approach: - Compression efficiency - Random access performance - Sequence extraction validation - Masked region analysis - N content and distribution

.sizes - Chromosome Sizes

Description: Simple format with chromosome lengths Typical Data: Tab-delimited chromosome names and sizes Use Cases: Genome browsers, coordinate validation Python Libraries: - Simple file reading with pandas - Built into many genomic tools EDA Approach: - Genome size calculation - Chromosome count - Size distribution - Karyotype validation - Completeness check against reference

Phylogenetics and Evolution

.nwk / .newick - Newick Tree Format

Description: Parenthetical tree representation Typical Data: Phylogenetic trees with branch lengths Use Cases: Evolutionary analysis, tree visualization Python Libraries: - Biopython: Phylo.read('file.nwk', 'newick') - ete3: ete3.Tree('file.nwk') - dendropy: Phylogenetic computing EDA Approach: - Tree structure analysis (tips, internal nodes) - Branch length distribution - Tree balance metrics - Ultrametricity check - Bootstrap support analysis - Topology validation

.nexus - Nexus Format

Description: Rich format for phylogenetic data Typical Data: Alignments, trees, character matrices Use Cases: Phylogenetic software interchange Python Libraries: - Biopython: Nexus support - dendropy: Comprehensive Nexus handling EDA Approach: - Data block analysis - Character type distribution - Tree block validation - Taxa consistency - Command block parsing - Format compliance checking

.phylip - PHYLIP Format

Description: Sequence alignment format (strict/relaxed) Typical Data: Multiple sequence alignments Use Cases: Phylogenetic analysis input Python Libraries: - Biopython: AlignIO.read('file.phy', 'phylip') - dendropy: PHYLIP support EDA Approach: - Alignment dimensions - Sequence length uniformity - Gap position analysis - Informative site calculation - Format variant detection (strict vs relaxed)

.paml - PAML Output

Description: Output from PAML phylogenetic software Typical Data: Evolutionary model results, dN/dS ratios Use Cases: Molecular evolution analysis Python Libraries: - Custom parsers for specific PAML programs - Biopython: Basic PAML parsing EDA Approach: - Model parameter extraction - Likelihood values - dN/dS ratio distribution - Branch-specific results - Convergence assessment

Protein and Structure Data

.embl - EMBL Format

Description: Rich sequence annotation format Typical Data: Sequences with extensive annotations Use Cases: Sequence databases, genome records Python Libraries: - Biopython: SeqIO.read('file.embl', 'embl') EDA Approach: - Feature annotation completeness - Sequence length and type - Reference information - Cross-reference validation - Feature overlap analysis

.genbank / .gb / .gbk - GenBank Format

Description: NCBI's sequence annotation format Typical Data: Annotated sequences with features Use Cases: Sequence databases, annotation transfer Python Libraries: - Biopython: SeqIO.parse('file.gb', 'genbank') EDA Approach: - Feature type distribution - CDS analysis (start codons, stops) - Translation validation - Annotation completeness - Source organism extraction - Reference and publication info - Locus tag consistency

.sff - Standard Flowgram Format

Description: 454/Roche sequencing data format Typical Data: Raw pyrosequencing flowgrams Use Cases: Legacy 454 sequencing data Python Libraries: - Biopython: SeqIO.parse('file.sff', 'sff') - Platform-specific tools EDA Approach: - Read count and length - Flowgram signal quality - Key sequence detection - Adapter trimming validation - Quality score distribution

.hdf5 (Genomics Specific)

Description: HDF5 for genomics (10X, Hi-C, etc.) Typical Data: High-throughput genomics data Use Cases: 10X Genomics, spatial transcriptomics Python Libraries: - h5py: Low-level access - scanpy: For 10X data - cooler: For Hi-C data EDA Approach: - Dataset structure exploration - Barcode statistics - UMI counting - Feature-barcode matrix analysis - Spatial coordinates (if applicable)

.cool / .mcool - Cooler Format

Description: HDF5-based Hi-C contact matrices Typical Data: Chromatin interaction matrices Use Cases: 3D genome analysis, Hi-C data Python Libraries: - cooler: cooler.Cooler('file.cool') - hicstraw: For .hic format EDA Approach: - Resolution analysis - Contact matrix statistics - Distance decay curves - Compartment analysis - TAD boundary detection - Balance factor validation

.hic - Hi-C Binary Format

Description: Juicer binary Hi-C format Typical Data: Multi-resolution Hi-C matrices Use Cases: Hi-C analysis with Juicer tools Python Libraries: - hicstraw: hicstraw.HiCFile('file.hic') - straw: C++ library with Python bindings EDA Approach: - Available resolutions - Normalization methods - Contact statistics - Chromosomal interactions - Quality metrics

.bw (ChIP-seq / ATAC-seq specific)

Description: BigWig files for epigenomics Typical Data: Coverage or enrichment signals Use Cases: ChIP-seq, ATAC-seq, DNase-seq Python Libraries: - pyBigWig: Standard bigWig access EDA Approach: - Peak enrichment patterns - Background signal analysis - Sample correlation - Signal-to-noise ratio - Library complexity metrics

.narrowPeak / .broadPeak - ENCODE Peak Formats

Description: BED-based formats for peaks Typical Data: Peak calls with scores and p-values Use Cases: ChIP-seq peak calling output Python Libraries: - pybedtools: BED-compatible - Custom parsers for peak-specific fields EDA Approach: - Peak count and width distribution - Signal value distribution - Q-value and p-value analysis - Peak summit analysis - Overlap with known features - Motif enrichment preparation

.wig - Wiggle Format

Description: Dense continuous genomic data Typical Data: Coverage or signal tracks Use Cases: Genome browser visualization Python Libraries: - pyBigWig: Can convert to bigWig - Custom parsers for wiggle format EDA Approach: - Signal statistics - Coverage metrics - Format variant (fixedStep vs variableStep) - Span parameter analysis - Conversion efficiency to bigWig

.ab1 - Sanger Sequencing Trace

Description: Binary chromatogram format Typical Data: Sanger sequencing traces Use Cases: Capillary sequencing validation Python Libraries: - Biopython: SeqIO.read('file.ab1', 'abi') - tracy tools: For quality assessment EDA Approach: - Base calling quality - Trace quality scores - Mixed base detection - Primer and vector detection - Read length and quality region - Heterozygosity detection

.scf - Standard Chromatogram Format

Description: Sanger sequencing chromatogram Typical Data: Base calls and confidence values Use Cases: Sequencing trace analysis Python Libraries: - Biopython: SCF format support EDA Approach: - Similar to AB1 format - Quality score profiles - Peak height ratios - Signal-to-noise metrics

.idx - Index Files (Generic)

Description: Index files for various formats Typical Data: Fast random access indices Use Cases: Efficient data access (BAM, VCF, etc.) Python Libraries: - Format-specific libraries handle indices - pysam: Auto-handles BAI, CSI indices EDA Approach: - Index completeness validation - Binning strategy analysis - Access performance metrics - Index size vs data size ratio

← Back to exploratory-data-analysis