Refactor `io` module

Currently, the `heat.io` module handles both single files (e.g., a single `.h5` or `.nc` file) and sharded directory-based datasets. 

The routing logic for sharded datasets is addressed directly in the format-specific function, probably at least some of it is redundant.  Moreover, functions are named inconsistently (e.g., `load_npy_from_path`, `load_csv_from_folder`, `load_multiple_hdf5`, `load_zarr_groups`). 

We should refactor the `io` module to cleanly separate the concept of sharded/chunked vs. single-file  parallel ingestion. Mid-term, we might want to incorporate at least some of @hmoudaahmed 's [hpc-data-loader](https://github.com/helmholtz-analytics/hpc_data_loader) capabilities into `heat.io` as well.


### Task

Refactor `heat.io` so that the MPI directory-routing algorithm (determining which rank gets which files) is separated from the format-reading logic (HDF5, CSV, zarr, and whatever we support next). This will make it easier to plug in new formats like memmap, Parquet, ... without rewriting the sharding logic every time.

### Open for Discussion

- What kind of architecture should we target? should we separate `io.py` into format-specific files? or separate single-file loader functions from sharded loading? 

TBC...


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `io` module #2334

Task

Open for Discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Refactor io module #2334

Description

Task

Open for Discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Refactor `io` module #2334