Skip to content

Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

@vishesh9131

Description

@vishesh9131

I hope this message finds you well. I would like to discuss the possibility of adjusting the current codebase to enable streaming of datasets directly from HuggingFace, eliminating the need for downloading them. This enhancement can significantly streamline the workflow, reduce storage requirements, and improve efficiency, especially for users working with limited local storage or in environments where data download speeds are a bottleneck.

Implementing dataset streaming can be achieved by leveraging HuggingFace's datasets library, which supports on-the-fly data access. The modification would involve integrating this functionality into the existing data handling pipeline, ensuring compatibility and seamless transition for current users.

The high-level steps include:

  1. Updating the data loading functions to utilize HuggingFace's load_dataset with streaming enabled.
  2. Ensuring all downstream processes can handle data in a streamed format without requiring local storage.
  3. Conducting thorough testing to verify the integrity and performance of the streamed data pipeline.

If you are interested, I can raise a pull request with the proposed changes for your review. This would allow us to collaboratively refine and integrate this feature into the project.

Looking forward to your thoughts on this.
Best regards,

Vishesh Yadav;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions