Skip to content
This repository was archived by the owner on Mar 31, 2025. It is now read-only.

Update dependency datasets to v3.5.0#652

Closed
renovate[bot] wants to merge 1 commit intomainfrom
renovate/datasets-3.x-lockfile
Closed

Update dependency datasets to v3.5.0#652
renovate[bot] wants to merge 1 commit intomainfrom
renovate/datasets-3.x-lockfile

Conversation

@renovate
Copy link
Contributor

@renovate renovate bot commented Jan 5, 2025

This PR contains the following updates:

Package Change Age Adoption Passing Confidence
datasets 3.1.0 -> 3.5.0 age adoption passing confidence

Release Notes

huggingface/datasets (datasets)

v3.5.0

Compare Source

Datasets Features
>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...
What's Changed
New Contributors

Full Changelog: huggingface/datasets@3.4.1...3.5.0

v3.4.1

Compare Source

Bug Fixes

Full Changelog: huggingface/datasets@3.4.0...3.4.1

v3.4.0

Compare Source

Dataset Features

  • Faster folder based builder + parquet support + allow repeated media + use torchvideo by @​lhoestq in https://github.com/huggingface/datasets/pull/7424

    • /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
    from datasets import load_dataset, Video
    
    dataset = load_dataset("path/to/video/folder", split="train")
    dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
    • faster streaming for image/audio/video folder from Hugging Face
    • support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
  • Add IterableDataset.decode with multithreading by @​lhoestq in https://github.com/huggingface/datasets/pull/7450

    • even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
    dataset = dataset.decode(num_threads=num_threads)
  • Add with_split to DatasetDict.map by @​jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.3.2...3.4.0

v3.3.2

Compare Source

Bug fixes

Other general improvements

New Contributors

Full Changelog: huggingface/datasets@3.3.1...3.3.2

v3.3.1

Compare Source

Bug fixes

Full Changelog: huggingface/datasets@3.3.0...3.3.1

v3.3.0

Compare Source

Dataset Features

  • Support async functions in map() by @​lhoestq in https://github.com/huggingface/datasets/pull/7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
  • Add repeat method to datasets by @​alex-hh in https://github.com/huggingface/datasets/pull/7198

    ds = ds.repeat(10)
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @​lhoestq in https://github.com/huggingface/datasets/pull/7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @​alex-hh in https://github.com/huggingface/datasets/pull/7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: huggingface/datasets@3.2.0...3.3.0

v3.2.0

Compare Source

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @​lhoestq in https://github.com/huggingface/datasets/pull/7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)

Other improvements and bug fixes

New Contributors

Full Changelog: huggingface/datasets@3.1.0...3.2.0


Configuration

📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate bot requested a review from rashley-iqt as a code owner January 5, 2025 22:51
@renovate renovate bot force-pushed the renovate/datasets-3.x-lockfile branch from 86dba46 to 94a1ad5 Compare January 14, 2025 19:37
@renovate renovate bot changed the title Update dependency datasets to v3.2.0 Update dependency datasets to v3.3.0 Feb 14, 2025
@renovate renovate bot force-pushed the renovate/datasets-3.x-lockfile branch from 94a1ad5 to e0cb3f4 Compare February 17, 2025 16:54
@renovate renovate bot changed the title Update dependency datasets to v3.3.0 Update dependency datasets to v3.3.1 Feb 17, 2025
@renovate renovate bot force-pushed the renovate/datasets-3.x-lockfile branch from e0cb3f4 to 48fbf3c Compare February 20, 2025 17:58
@renovate renovate bot changed the title Update dependency datasets to v3.3.1 Update dependency datasets to v3.3.2 Feb 20, 2025
@renovate renovate bot force-pushed the renovate/datasets-3.x-lockfile branch from 48fbf3c to b98bbdf Compare March 14, 2025 16:51
@renovate renovate bot changed the title Update dependency datasets to v3.3.2 Update dependency datasets to v3.4.0 Mar 14, 2025
@renovate renovate bot force-pushed the renovate/datasets-3.x-lockfile branch from b98bbdf to c448d9b Compare March 17, 2025 19:13
@renovate renovate bot changed the title Update dependency datasets to v3.4.0 Update dependency datasets to v3.4.1 Mar 17, 2025
@renovate renovate bot force-pushed the renovate/datasets-3.x-lockfile branch from c448d9b to 5092021 Compare March 27, 2025 18:09
@renovate renovate bot changed the title Update dependency datasets to v3.4.1 Update dependency datasets to v3.5.0 Mar 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant