Currently, the heat.io module handles both single files (e.g., a single .h5 or .nc file) and sharded directory-based datasets.
The routing logic for sharded datasets is addressed directly in the format-specific function, probably at least some of it is redundant. Moreover, functions are named inconsistently (e.g., load_npy_from_path, load_csv_from_folder, load_multiple_hdf5, load_zarr_groups).
We should refactor the io module to cleanly separate the concept of sharded/chunked vs. single-file parallel ingestion. Mid-term, we might want to incorporate at least some of @hmoudaahmed 's hpc-data-loader capabilities into heat.io as well.
Task
Refactor heat.io so that the MPI directory-routing algorithm (determining which rank gets which files) is separated from the format-reading logic (HDF5, CSV, zarr, and whatever we support next). This will make it easier to plug in new formats like memmap, Parquet, ... without rewriting the sharding logic every time.
Open for Discussion
- What kind of architecture should we target? should we separate
io.py into format-specific files? or separate single-file loader functions from sharded loading?
TBC...
Currently, the
heat.iomodule handles both single files (e.g., a single.h5or.ncfile) and sharded directory-based datasets.The routing logic for sharded datasets is addressed directly in the format-specific function, probably at least some of it is redundant. Moreover, functions are named inconsistently (e.g.,
load_npy_from_path,load_csv_from_folder,load_multiple_hdf5,load_zarr_groups).We should refactor the
iomodule to cleanly separate the concept of sharded/chunked vs. single-file parallel ingestion. Mid-term, we might want to incorporate at least some of @hmoudaahmed 's hpc-data-loader capabilities intoheat.ioas well.Task
Refactor
heat.ioso that the MPI directory-routing algorithm (determining which rank gets which files) is separated from the format-reading logic (HDF5, CSV, zarr, and whatever we support next). This will make it easier to plug in new formats like memmap, Parquet, ... without rewriting the sharding logic every time.Open for Discussion
io.pyinto format-specific files? or separate single-file loader functions from sharded loading?TBC...