Skip to content

Commit 20b0acd

Browse files
authored
Add guide to upload Lance datasets to HF (#181)
1 parent 29651ca commit 20b0acd

1 file changed

Lines changed: 72 additions & 2 deletions

File tree

docs/integrations/ai/huggingface.mdx

Lines changed: 72 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Hugging Face Hub"
3-
sidebarTitle: "Hugging Face"
3+
sidebarTitle: "Hugging Face Hub"
44
description: "Use LanceDB directly on Lance datasets hosted on the Hugging Face Hub for multimodal search and retrieval."
55
---
66

@@ -241,7 +241,7 @@ fts_results = (
241241
| Dog Running in Water | https://static.wixstatic.com/m… | 14.756516 |
242242
| Dogs on the run by heidiannemo… | http://ih2.redbubble.net/image… | 14.756516 |
243243

244-
## Downloading the full dataset
244+
## Download the full dataset
245245

246246
<Warning>
247247
You may hit Hugging Face rate limits when streaming large samples from `hf://`, despite using a Hugging Face token.
@@ -255,6 +255,76 @@ Here's how to download the entire dataset via the [Hugging Face CLI](https://hug
255255
huggingface-cli download lance-format/laion-1m --repo-type dataset --local-dir ./laion-1m
256256
```
257257

258+
## Upload your own datasets to Hugging Face in Lance format
259+
260+
This section shows how you can upload your own Lance datasets to the Hugging Face Hub to share with the community.
261+
262+
First, install the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) and export both `OPENAI_API_KEY` and `HF_TOKEN`.
263+
Then, create a Lance dataset using LanceDB on a local machine, and then proceed to upload it to the Hub via a CLI command.
264+
265+
```bash
266+
export OPENAI_API_KEY=...
267+
export HF_TOKEN=hf_...
268+
hf auth login --token "$HF_TOKEN"
269+
```
270+
271+
A typical sequence of steps is given below.
272+
273+
### 1. Upload your local directory to the Hub
274+
275+
Upload the full local directory to a specified repository on the Hugging Face Hub. The command below uploads the contents of your local LanceDB directory at `/path/to/your_local_dir` to a new repository named `your_hf_org/repo_name` under your Hugging Face account.
276+
277+
```bash bash icon="code"
278+
hf upload-large-folder /path/to/your_local_dir your_hf_org/repo_name \
279+
--repo-type dataset \
280+
--revision main
281+
```
282+
283+
<Info>
284+
The `upload-large-folder` command is designed for [uploading large datasets](https://huggingface.co/docs/huggingface_hub/en/guides/upload) (potentially terabytes in size) and will handle multipart uploads, retries, and resuming interrupted uploads.
285+
</Info>
286+
287+
### 2. Inspect dataset versions
288+
289+
Because you can query your remote dataset directly from Hugging Face with `hf://` URIs in LanceDB, you can easily inspect the dataset versions and updates on the Hub without needing to download the data locally. This is very useful to keep track of changes to the dataset and iterate on your data collection and curation process.
290+
291+
```python Python icon="python"
292+
import lancedb
293+
294+
db = lancedb.connect("hf://datasets/your_hf_org/repo_name")
295+
table = db.open_table("table_name")
296+
297+
versions = table.list_versions()
298+
print(versions)
299+
```
300+
This will print out the list of versions available for the dataset on the Hub, along with their metadata such as creation date and description.
301+
302+
### 3. Add a dataset card
303+
304+
The Hub dataset card allows you to communicate the schema and usage of the dataset to other developers. It sits at the repo's root in a file named `README.md` on the Hub.
305+
This project keeps the source card text in `HF_DATASET_CARD.md`, so you can publish updates
306+
to the dataset there and upload it as `README.md` using the following command on the HF CLI:
307+
this requires a regular `hf upload` because it is a single-file upload to a specific target path (a custom commit message can be added if you wish).
308+
309+
```bash
310+
hf upload lancedb/magical_kingdom HF_DATASET_CARD.md README.md \
311+
--repo-type dataset \
312+
--commit-message "Update dataset card"
313+
```
314+
315+
### 4. Update the dataset
316+
317+
Over time, you may want to add new rows (append) or columns (backfill) to your dataset as your needs evolve. You can make the necessary updates to your local dataset using LanceDB, and then upload the updated version back to the Hub with the same `hf upload-large-folder` command.
318+
319+
```bash bash icon="code"
320+
hf upload-large-folder /path/to/your_local_dir your_hf_org/repo_name \
321+
--repo-type dataset \
322+
--revision main
323+
```
324+
The CLI will only upload the new data that has changed since the last upload, avoiding wasted I/O while making it easy to keep your dataset up-to-date on the Hub.
325+
326+
That's it! Your dataset is now updated on the Hub with the new data and schema changes, and other users can query the latest version of the dataset directly from Hugging Face with `hf://` URIs in LanceDB.
327+
258328
## Explore more Lance datasets on Hugging Face
259329

260330
The LanceDB team is actively uploading useful and interesting datasets in Lance format to the Hugging Face Hub

0 commit comments

Comments
 (0)