You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Upload your own datasets to Hugging Face in Lance format
259
+
260
+
This section shows how you can upload your own Lance datasets to the Hugging Face Hub to share with the community.
261
+
262
+
First, install the [Hugging Face CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli) and export both `OPENAI_API_KEY` and `HF_TOKEN`.
263
+
Then, create a Lance dataset using LanceDB on a local machine, and then proceed to upload it to the Hub via a CLI command.
264
+
265
+
```bash
266
+
export OPENAI_API_KEY=...
267
+
export HF_TOKEN=hf_...
268
+
hf auth login --token "$HF_TOKEN"
269
+
```
270
+
271
+
A typical sequence of steps is given below.
272
+
273
+
### 1. Upload your local directory to the Hub
274
+
275
+
Upload the full local directory to a specified repository on the Hugging Face Hub. The command below uploads the contents of your local LanceDB directory at `/path/to/your_local_dir` to a new repository named `your_hf_org/repo_name` under your Hugging Face account.
The `upload-large-folder` command is designed for [uploading large datasets](https://huggingface.co/docs/huggingface_hub/en/guides/upload) (potentially terabytes in size) and will handle multipart uploads, retries, and resuming interrupted uploads.
285
+
</Info>
286
+
287
+
### 2. Inspect dataset versions
288
+
289
+
Because you can query your remote dataset directly from Hugging Face with `hf://` URIs in LanceDB, you can easily inspect the dataset versions and updates on the Hub without needing to download the data locally. This is very useful to keep track of changes to the dataset and iterate on your data collection and curation process.
290
+
291
+
```python Python icon="python"
292
+
import lancedb
293
+
294
+
db = lancedb.connect("hf://datasets/your_hf_org/repo_name")
295
+
table = db.open_table("table_name")
296
+
297
+
versions = table.list_versions()
298
+
print(versions)
299
+
```
300
+
This will print out the list of versions available for the dataset on the Hub, along with their metadata such as creation date and description.
301
+
302
+
### 3. Add a dataset card
303
+
304
+
The Hub dataset card allows you to communicate the schema and usage of the dataset to other developers. It sits at the repo's root in a file named `README.md` on the Hub.
305
+
This project keeps the source card text in `HF_DATASET_CARD.md`, so you can publish updates
306
+
to the dataset there and upload it as `README.md` using the following command on the HF CLI:
307
+
this requires a regular `hf upload` because it is a single-file upload to a specific target path (a custom commit message can be added if you wish).
Over time, you may want to add new rows (append) or columns (backfill) to your dataset as your needs evolve. You can make the necessary updates to your local dataset using LanceDB, and then upload the updated version back to the Hub with the same `hf upload-large-folder` command.
The CLI will only upload the new data that has changed since the last upload, avoiding wasted I/O while making it easy to keep your dataset up-to-date on the Hub.
325
+
326
+
That's it! Your dataset is now updated on the Hub with the new data and schema changes, and other users can query the latest version of the dataset directly from Hugging Face with `hf://` URIs in LanceDB.
327
+
258
328
## Explore more Lance datasets on Hugging Face
259
329
260
330
The LanceDB team is actively uploading useful and interesting datasets in Lance format to the Hugging Face Hub
0 commit comments