Museum Dataset Demo for Databricks

A demo project showcasing Databricks features for data engineering using open museum datasets.

What's included

A Databricks Asset Bundle providing code in a GitOps-style manner
A Databricks ETL pipeline specification
Ingest & prep of CSV & JSON data files
- Handling of list data contained in fields, & multiple schemas across JSON files
A DBT project for creating materialized views of Met data
A Spark Declarative Pipeline for creating additional materialized views
Databricks "expectations" constraints for monitoring data quality
Several user-defined functions (UDFs) used for convenience
Data mart providing Markdown-like text describing objects (created from data fields)
Vector index to facilitate searching documents
CDC & merge into statements to allow quicker vector index rebuilding
Plenty of code for normalizing fields, handling missing values, graceful fallback of text fields, etc.

What's not included

The free edition has some notable limitations, either in terms of missing features or limited compute. Since some tasks can take days to complete using the free edition, the dev environment has been disabled and some tasks are commented out.

Getting started

Choose how you want to work on this project:

(a) Directly in your Databricks workspace, see https://docs.databricks.com/dev-tools/bundles/workspace.

(b) Locally with an IDE like Cursor or VS Code, see https://docs.databricks.com/dev-tools/vscode-ext.html.

(c) With command line tools, see https://docs.databricks.com/dev-tools/cli/databricks-cli.html

If you're developing with an IDE, dependencies for this project should be installed using uv:

Make sure you have the UV package manager installed. It's an alternative to tools like pip: https://docs.astral.sh/uv/getting-started/installation/.
Run uv sync --dev to install the project's dependencies.

Folders

src/: source code for this project.
resources/: Resource configurations (jobs, pipelines, etc.)

Configure Workspace Variables

Create the following files:

# resources/secrets/workspace.yml
workspace:
  host: "https://<your-workspace>.cloud.databricks.com/"

/* .databricks/bundle/{dev,prod}/variable-overrides.json */
{
  "workspace_host": "https://<your-workspace>.cloud.databricks.com/",
  "owner_account": "<your-account@example.com>",
  "sql_warehouse": "<sql-warehouse-id>"
}

Using this project using the CLI

The Databricks workspace and IDE extensions provide a graphical interface for working with this project. It's also possible to interact with it directly using the CLI:

Authenticate to your Databricks workspace, if you have not done so already:

$ databricks configure

To deploy this project, type:

$ databricks bundle deploy

This deploys everything that's defined for this project.

Loading datasets

First, create a managed storage volume called "uc" in the catalog & schema workspace.default.

The free edition of Databricks uses the default serverless network policy, which blocks outbound connections. To load the datasets, you'll need to download them to your computer and then upload the files to the "uc" volume using the Databricks CLI.

The Metropolitan Museum of Art CSV can be downloaded here: https://github.com/metmuseum/openaccess

The Art Institute of Chicago has instructions here: https://github.com/art-institute-of-chicago/api-data

These datasets take a long time to load using the free edition (as in, let them run overnight). There is commented out code in the pipeline for ingesting them, however, it's left disabled so that playing around with the code doesn't mean waiting forever to run.

Note that the vector index also takes a long time to do its initial sync; for me, it shows as taking several days to complete and you will need to restart it due to exceeding daily limits on the embedding API endpoint.

The files you'll need to run (in order) are:

src/met-ingest/ingest.sql
src/artic-ingest/extract.ipynb
src/artic-ingest/raw_schema.sql
src/artic-ingest/ingest_raw_json.py

Attribution

The datasets themselves are as provided by their respective museums, and modified by the code provided here.

Per the Met's license:

You must not use The Metropolitan Museum of Art’s trademarks or otherwise claim or imply that the Museum or any other third party endorses you or your use of the dataset.

Per the Art Institute of Chicago:

...content may have different licensing terms. Please be mindful of the info.license_text and info.license_links fields within each JSON data file.

(These fields appear in the API, however, in the files I downloaded it appears these fields are not present.)

Note that for both museums, media files are licensed separately.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
resources		resources
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
databricks.yml		databricks.yml
requirements-dev.txt		requirements-dev.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Museum Dataset Demo for Databricks

What's included

What's not included

Getting started

Folders

Configure Workspace Variables

Using this project using the CLI

Loading datasets

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Museum Dataset Demo for Databricks

What's included

What's not included

Getting started

Folders

Configure Workspace Variables

Using this project using the CLI

Loading datasets

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages