Skip to content

Latest commit

 

History

History
485 lines (368 loc) · 16.1 KB

File metadata and controls

485 lines (368 loc) · 16.1 KB

Python tools for working with synthetic healthcare datasets.

SynthLab provides Python interfaces for working with major synthetic healthcare datasets, including:

  • Synthea: A synthetic patient population simulator that generates realistic synthetic patient records
  • UK Biobank Synthetic Dataset: A large-scale synthetic dataset designed for system testing with UK Biobank-compatible data

Features

Synthea Support

  • Synthea Runner: Easy-to-use Python interface for running Synthea simulations
  • OMOP Conversion: Convert Synthea CSV output to OMOP CDM format
  • AWS Dataset Download: Download pre-generated Synthea OMOP datasets from AWS
  • Configuration Management: Flexible configuration system with validation

Synthea Coherent Data Set (Multimodal) - NEW!

  • EHR + Imaging + Genomics: The only publicly available synthetic dataset combining all three modalities
  • FHIR Records: Complete patient records with demographics, conditions, medications, encounters
  • MRI DICOM: Synthetic brain imaging linked to patients
  • Familial Genomes: VCF files with genetic variants for patients and family members
  • Clinical Notes: SOAP-style clinical documentation
  • Physiological Data: Time-series vital signs data

UK Biobank Synthetic Dataset Support

  • Dataset Download: Download tabular, medical, genetic, and bulk data files
  • Automatic Caching: Files are automatically cached in ~/.cache/synthlab/ukbiobank_synthetic/
  • MD5 Verification: Automatic checksum verification for downloaded files
  • Data Loading: Load data into Polars DataFrames for efficient analysis
  • Category Management: Organized downloads by category (tabular, medical, genetic, bulk)

Genomics Data Support (NEW)

  • HAPNEST Integration: Download and load HAPNEST synthetic genomics data (1M+ individuals, 6.8M variants)
  • Synthetic Genotype Generation: Generate simple synthetic genotypes for testing
  • PLINK Format Support: Work with standard genomics file formats (.pgen, .pvar, .psam)

Medical Imaging Catalog (NEW)

  • Dataset Discovery: Catalog of 15+ publicly available medical imaging datasets
  • Multi-modal Coverage: CT, MRI, X-ray, and histopathology datasets
  • Access Information: Clear documentation of open vs. registration-required datasets
  • Download Utilities: Helpers for downloading select open-access datasets

MEDS (Medical Event Data Standard) conversion (NEW)

Plug synthetic EHR into ML-native foundation models. Provides a thin, well-tested wrapper over the community meds_etl package so OMOP CSVs (produced by synthlab.synthea.convert_synthea_to_omop) become MEDS parquet shards ready for models like SMB-v1 or MOTOR:

from synthlab.meds import MedsConvertConfig, convert_omop_to_meds, load_meds_events

convert_omop_to_meds(MedsConvertConfig(
    omop_dir="~/.cache/synthlab/synthea/omop_100",
    meds_dir="~/.cache/synthlab/meds/synthea_100",
))
df = load_meds_events("~/.cache/synthlab/meds/synthea_100")

Install the optional extra: pip install synthlab[meds].

Olink NPX simulator (NEW)

Simulate case/control Olink proteomics data with LOD-driven missingness and configurable group effects — the first greenfield open-source simulator targeted at Olink's NPX / PEA readout (existing tools like MSstatsSampleSize target LC-MS/MS, and OlinkAnalyze ships demo data but no simulator). Priors reflect UKB-PPP (Sun et al. 2023) and OlinkAnalyze npx_data1 / npx_data2 baseline distributions:

from synthlab import OlinkSimConfig, default_explore_3072_panel, simulate_olink_npx

cfg = OlinkSimConfig(
    n_samples=500,
    panel=default_explore_3072_panel(),
    group_effects={"CRP": {"case": 1.8}, "IL6": {"case": 1.2}},
    group_assignments=["case"] * 250 + ["control"] * 250,
    seed=42,
)
df = simulate_olink_npx(cfg)

See synthlab/olink.py for the full API (OlinkPanelConfig, OlinkSimConfig, simulate_olink_npx, default_explore_3072_panel, write_olink_parquet, load_olink_parquet).

Disease-conditional effect catalog (NEW)

Rather than hand-pick effect sizes, plug in a curated, source-cited catalog of per-disease protein NPX shifts mined from published plasma proteomics literature. Every row in synthlab/data/olink_disease_effects.csv cites a real DOI — so a downstream user always knows where a given effect-size estimate came from. See docs/olink_disease_catalog.md for the schema and a "how to add a new disease" checklist.

from synthlab import (
    OlinkSimConfig, default_explore_3072_panel, simulate_olink_npx,
    load_disease_effect_catalog,
)

catalog = load_disease_effect_catalog()           # bundled with package
print(catalog.diseases())                         # ('Alzheimer', 'BRCA_hereditary', 'CAD', 'CKD', 'Cancer_broad', 'IBD', 'T2D')
effects = catalog.effects_for(["T2D", "CAD"])     # {protein: {disease: delta_npx}}
cfg = OlinkSimConfig(
    n_samples=900,
    panel=default_explore_3072_panel(),
    group_effects=effects,
    group_assignments=["T2D"]*300 + ["CAD"]*300 + ["baseline"]*300,
    seed=42,
)
df = simulate_olink_npx(cfg)

Disease-group row ranges in the shipped CSV (see synthlab/data/olink_disease_effects.csv):

Installation

pip install synthlab

For AWS dataset download functionality:

pip install synthlab[aws]

For imaging notebooks (DICOM):

pip install synthlab[imaging]

For genomics helpers:

pip install synthlab[genomics]

Install everything:

pip install synthlab[all]

Optional GPU performance (FlashAttention):

# Requires a supported NVIDIA GPU + CUDA toolchain
# See https://github.com/Dao-AILab/flash-attention for compatibility details
pip install flash-attn --no-build-isolation

Optional GPU performance (FlashAttention 2):

# FlashAttention 2 is provided by the same package; install as usual
# See https://github.com/Dao-AILab/flash-attention for compatibility details
pip install flash-attn --no-build-isolation

Requirements

  • Python 3.8+
  • Java 11 or newer (required by Synthea)
  • Polars (for efficient data loading)
  • Requests (for dataset downloads)
  • Matplotlib (plots)
  • Optional (imaging notebooks): pydicom

Quick Start

Synthea: Generate Synthetic Patient Data

from synthlab import SyntheaRunner, SyntheaConfig

# Create a runner (downloads Synthea JAR automatically)
runner = SyntheaRunner()

# Configure a simulation
config = SyntheaConfig(
    population_size=100,
    state="Massachusetts",
    seed=12345,
    output_dir="output/synthea_data"
)

# Run the simulation
result = runner.run(config)

if result['returncode'] == 0:
    print(f"Generated data in: {result['output_dir']}")

Synthea: Convert to OMOP CDM

from synthlab import convert_synthea_to_omop

# Convert Synthea CSV to OMOP CDM format
output_files = convert_synthea_to_omop(
    synthea_csv_dir="output/synthea_data",
    output_dir="output/omop",
    cdm_version="5.4",
    output_format="parquet"
)

print(f"Generated {len(output_files)} OMOP tables")

Synthea: Download Pre-generated Datasets

from synthlab import list_synthea_datasets, download_dataset

# List available datasets
datasets = list_synthea_datasets()

# Download a dataset
download_dataset("synthea1k", output_dir="data/synthea1k")

UK Biobank Synthetic Dataset: Download and Load Data

from synthlab import download_category, load_tabular_data, get_cache_dir

# Download tabular data (saved to ~/.cache/synthlab/ukbiobank_synthetic/tabular/)
download_category("tabular", verify_md5=True)

# Load the data into Polars DataFrames
tabular_data = load_tabular_data(sample_rows=1000)  # Load first 1000 rows for demo

# Access individual files
death_data = tabular_data["dates_death"]
integer_data = tabular_data["integer_no_arrays"]

print(f"Loaded {len(tabular_data)} tabular files")
print(f"Cache directory: {get_cache_dir()}")

Genomics: Download HAPNEST Synthetic Data

from synthlab import download_hapnest_small, load_hapnest_variants, load_hapnest_samples

# Download small HAPNEST test dataset (600 individuals)
data_dir = download_hapnest_small()

# Load variant and sample information
variants = load_hapnest_variants(data_dir)
samples = load_hapnest_samples(data_dir)

print(f"Variants: {len(variants)}")
print(f"Samples: {len(samples)}")

Genomics: Generate Synthetic Genotypes

from synthlab import generate_synthetic_genotypes

# Generate random synthetic genotype data for testing
data_dir = generate_synthetic_genotypes(
    n_samples=1000,
    n_variants=10000,
    seed=42
)

Coherent Data Set: Multimodal EHR + Imaging + Genomics

from synthlab import (
    download_coherent_dataset,
    load_fhir_patients,
    print_coherent_info,
)

# See what's available
print_coherent_info()

# Download specific components (FHIR records + genomics)
download_coherent_dataset(components=['fhir', 'genomics'])

# Or download everything (several GB)
# download_coherent_dataset()

# Load FHIR patient records
patients = load_fhir_patients(max_patients=10)
print(f"Loaded {len(patients)} patient bundles")

Medical Imaging: Explore Available Datasets

from synthlab import print_dataset_catalog, list_imaging_datasets, get_dataset_info

# Print catalog of all datasets
print_dataset_catalog()

# Filter by modality
histology = list_imaging_datasets(modality="Histopathology")

# Get info about specific dataset
mhist_info = get_dataset_info("mhist")
print(f"MHIST: {mhist_info.n_images} images, {mhist_info.size_gb} GB")

UK Biobank Synthetic Dataset: Download Specific Files

from synthlab import download_file

# Download a single file
download_file("dates_death.tsv", category="tabular")

# Files are automatically saved to ~/.cache/synthlab/ukbiobank_synthetic/tabular/

Documentation

SyntheaRunner

The main class for running Synthea simulations.

runner = SyntheaRunner(
    jar_path=None,          # Path to existing JAR (auto-downloads if None)
    jar_url=SYNTHEA_JAR_URL,  # URL to download JAR from
    cache_dir=None,         # Cache directory (defaults to OS cache)
    java_executable="java"  # Java executable path
)

SyntheaConfig

Configuration class for Synthea simulations.

config = SyntheaConfig(
    population_size=100,     # Number of patients
    seed=12345,              # Random seed
    state="Massachusetts",   # US state
    city="Boston",           # Optional city
    min_age=0,              # Minimum age
    max_age=100,            # Maximum age
    gender="M",             # "M", "F", or None
    output_dir="output"     # Output directory
)

UK Biobank Synthetic Dataset Functions

from synthlab import (
    list_available_files,
    download_file,
    download_category,
    load_tabular_data,
    load_medical_records,
    load_genetic_dictionary,
    get_cache_dir,
)

# List available files
files = list_available_files(category="tabular")

# Download entire category
download_category("tabular")  # Downloads to ~/.cache/synthlab/ukbiobank_synthetic/tabular/

# Download single file
download_file("dates_death.tsv", category="tabular")

# Load data (uses cache directory by default)
data = load_tabular_data(sample_rows=1000)
medical = load_medical_records(sample_rows=10000)
genetic_dict = load_genetic_dictionary()

# Get cache directory
cache_dir = get_cache_dir()  # Returns ~/.cache/synthlab/ukbiobank_synthetic/

Convenience Methods

# Synthea quick test run
runner.run_quick(population_size=10, state="Massachusetts")

# Synthea custom location
runner.run_custom_location(state="California", city="San Francisco", population_size=100)

# Synthea age-specific population
runner.run_age_specific(min_age=25, max_age=65, population_size=100)

Dataset Information

Synthea

Synthea generates synthetic patient records with:

  • Demographics
  • Medical history
  • Medications
  • Lab results
  • Procedures
  • Encounters

Reference: Synthea GitHub

UK Biobank Synthetic Dataset

The UK Biobank Synthetic Dataset contains:

  1. Tabular Records (23 TSV files): Main phenotype data (~600K participants × ~27K columns)

    • Survey responses, measurements, clinical data
    • Files: dates_death.tsv, integer_no_arrays.tsv, real_fields1.tsv, etc.
  2. Medical Records (6 text files): GP clinical records (~400M rows)

    • Diagnosis codes (Read 2/3), visit data, clinical events
  3. Genetic Records: SNP genotype data (~600K participants × 840K SNPs)

    • Dictionary file + 26 chromosome files (compressed)
  4. Bulk Files (37 zip archives): ~6M files for system testing

Important: This is synthetic data and may not be internally consistent (e.g., events after death, prostate cancer in females).

Reference: UK Biobank Synthetic Dataset

Examples

See the examples/ and notebooks/ directories for detailed examples:

  • examples/basic_usage.py - Basic Synthea usage
  • examples/ukbiobank_synthetic_example.py - UK Biobank Synthetic Dataset examples
  • notebooks/Synthea.ipynb - Comprehensive Synthea tutorial
  • notebooks/UKBiobank_Synthetic.ipynb - UK Biobank Synthetic Dataset tutorial
  • notebooks/Coherent_Dataset.ipynb - Multimodal EHR + Imaging + Genomics tutorial

License

MIT License

References

Synthea

Synthea Coherent Data Set (Multimodal)

UK Biobank

Genomics

Medical Imaging