Skip to content

Refactor io module #2334

@ClaudiaComito

Description

@ClaudiaComito

Currently, the heat.io module handles both single files (e.g., a single .h5 or .nc file) and sharded directory-based datasets.

The routing logic for sharded datasets is addressed directly in the format-specific function, probably at least some of it is redundant. Moreover, functions are named inconsistently (e.g., load_npy_from_path, load_csv_from_folder, load_multiple_hdf5, load_zarr_groups).

We should refactor the io module to cleanly separate the concept of sharded/chunked vs. single-file parallel ingestion. Mid-term, we might want to incorporate at least some of @hmoudaahmed 's hpc-data-loader capabilities into heat.io as well.

Task

Refactor heat.io so that the MPI directory-routing algorithm (determining which rank gets which files) is separated from the format-reading logic (HDF5, CSV, zarr, and whatever we support next). This will make it easier to plug in new formats like memmap, Parquet, ... without rewriting the sharding logic every time.

Open for Discussion

  • What kind of architecture should we target? should we separate io.py into format-specific files? or separate single-file loader functions from sharded loading?

TBC...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions