references/datasets.md

PyHealth Datasets and Data Structures

Core Data Structures

Event

Individual medical occurrences with attributes including: - code: Medical code (diagnosis, medication, procedure, lab test) - vocabulary: Coding system (ICD-9-CM, NDC, LOINC, etc.) - timestamp: Event occurrence time - value: Numeric value (for labs, vital signs) - unit: Measurement unit

Patient

Collection of events organized chronologically across visits. Each patient contains: - patient_id: Unique identifier - birth_datetime: Date of birth - gender: Patient gender - ethnicity: Patient ethnicity - visits: List of visit objects

Visit

Healthcare encounter containing: - visit_id: Unique identifier - encounter_time: Visit timestamp - discharge_time: Discharge timestamp - visit_type: Type of encounter (inpatient, outpatient, emergency) - events: List of events during this visit

BaseDataset Class

Key Methods: - get_patient(patient_id): Retrieve single patient record - iter_patients(): Iterate through all patients - stats(): Get dataset statistics (patients, visits, events) - set_task(task_fn): Define prediction task

Available Datasets

Electronic Health Record (EHR) Datasets

MIMIC-III Dataset (MIMIC3Dataset) - Intensive care unit data from Beth Israel Deaconess Medical Center - 40,000+ critical care patients - Diagnoses, procedures, medications, lab results - Usage: from pyhealth.datasets import MIMIC3Dataset

MIMIC-IV Dataset (MIMIC4Dataset) - Updated version with 70,000+ patients - Improved data quality and coverage - Enhanced demographic and clinical detail - Usage: from pyhealth.datasets import MIMIC4Dataset

eICU Dataset (eICUDataset) - Multi-center critical care database - 200,000+ admissions from 200+ hospitals - Standardized ICU data across facilities - Usage: from pyhealth.datasets import eICUDataset

OMOP Dataset (OMOPDataset) - Observational Medical Outcomes Partnership format - Standardized common data model - Interoperability across healthcare systems - Usage: from pyhealth.datasets import OMOPDataset

EHRShot Dataset (EHRShotDataset) - Benchmark dataset for few-shot learning - Specialized for testing model generalization - Usage: from pyhealth.datasets import EHRShotDataset

Physiological Signal Datasets

Sleep EEG Datasets: - SleepEDFDataset: Sleep-EDF database for sleep staging - SHHSDataset: Sleep Heart Health Study data - ISRUCDataset: ISRUC-Sleep database

Temple University EEG Datasets: - TUEVDataset: Abnormal EEG events detection - TUABDataset: Abnormal/normal EEG classification - TUSZDataset: Seizure detection

All signal datasets support: - Multi-channel EEG signals - Standardized sampling rates - Expert annotations - Sleep stage or abnormality labels

Medical Imaging Datasets

COVID-19 CXR Dataset (COVID19CXRDataset) - Chest X-ray images for COVID-19 classification - Multi-class labels (COVID-19, pneumonia, normal) - Usage: from pyhealth.datasets import COVID19CXRDataset

Text-Based Datasets

Medical Transcriptions Dataset (MedicalTranscriptionsDataset) - Clinical notes and transcriptions - Medical specialty classification - Text-based prediction tasks - Usage: from pyhealth.datasets import MedicalTranscriptionsDataset

Cardiology Dataset (CardiologyDataset) - Cardiac patient records - Cardiovascular disease prediction - Usage: from pyhealth.datasets import CardiologyDataset

Preprocessed Datasets

MIMIC Extract Dataset (MIMICExtractDataset) - Pre-extracted MIMIC features - Ready-to-use benchmarking data - Reduced preprocessing requirements - Usage: from pyhealth.datasets import MIMICExtractDataset

SampleDataset Class

Converts raw datasets into task-specific formatted samples.

Purpose: Transform patient-level data into model-ready input/output pairs

Key Attributes: - input_schema: Defines input data structure - output_schema: Defines target labels/predictions - samples: List of processed samples

Usage Pattern:

# After setting task on BaseDataset
sample_dataset = dataset.set_task(task_fn)

Data Splitting Functions

Patient-Level Split (split_by_patient) - Ensures no patient appears in multiple splits - Prevents data leakage - Recommended for clinical prediction tasks

Visit-Level Split (split_by_visit) - Splits by individual visits - Allows same patient across splits (use cautiously)

Sample-Level Split (split_by_sample) - Random sample splitting - Most flexible but may cause leakage

Parameters: - dataset: SampleDataset to split - ratios: Tuple of split ratios (e.g., [0.7, 0.1, 0.2]) - seed: Random seed for reproducibility

Common Workflow

from pyhealth.datasets import MIMIC4Dataset
from pyhealth.tasks import mortality_prediction_mimic4_fn
from pyhealth.datasets import split_by_patient

# 1. Load dataset
dataset = MIMIC4Dataset(root="/path/to/data")

# 2. Set prediction task
sample_dataset = dataset.set_task(mortality_prediction_mimic4_fn)

# 3. Split data
train, val, test = split_by_patient(sample_dataset, [0.7, 0.1, 0.2])

# 4. Get statistics
print(dataset.stats())

Performance Notes

  • PyHealth is 3x faster than pandas for healthcare data processing
  • Optimized for large-scale EHR datasets
  • Memory-efficient patient iteration
  • Vectorized operations for feature extraction
← Back to pyhealth