A demo project showcasing Databricks features for data engineering using open museum datasets.
- A Databricks Asset Bundle providing code in a GitOps-style manner
- A Databricks ETL pipeline specification
- Ingest & prep of CSV & JSON data files
- Handling of list data contained in fields, & multiple schemas across JSON files
- A DBT project for creating materialized views of Met data
- A Spark Declarative Pipeline for creating additional materialized views
- Databricks "expectations" constraints for monitoring data quality
- Several user-defined functions (UDFs) used for convenience
- Data mart providing Markdown-like text describing objects (created from data fields)
- Vector index to facilitate searching documents
- CDC &
merge intostatements to allow quicker vector index rebuilding - Plenty of code for normalizing fields, handling missing values, graceful fallback of text fields, etc.
The free edition has some notable limitations, either in terms of missing features or limited compute. Since some tasks can take days to complete using the free edition, the dev environment has been disabled and some tasks are commented out.
Choose how you want to work on this project:
(a) Directly in your Databricks workspace, see https://docs.databricks.com/dev-tools/bundles/workspace.
(b) Locally with an IDE like Cursor or VS Code, see https://docs.databricks.com/dev-tools/vscode-ext.html.
(c) With command line tools, see https://docs.databricks.com/dev-tools/cli/databricks-cli.html
If you're developing with an IDE, dependencies for this project should be installed using uv:
- Make sure you have the UV package manager installed. It's an alternative to tools like pip: https://docs.astral.sh/uv/getting-started/installation/.
- Run
uv sync --devto install the project's dependencies.
src/: source code for this project.resources/: Resource configurations (jobs, pipelines, etc.)
Create the following files:
# resources/secrets/workspace.yml
workspace:
host: "https://<your-workspace>.cloud.databricks.com/"/* .databricks/bundle/{dev,prod}/variable-overrides.json */
{
"workspace_host": "https://<your-workspace>.cloud.databricks.com/",
"owner_account": "<your-account@example.com>",
"sql_warehouse": "<sql-warehouse-id>"
}The Databricks workspace and IDE extensions provide a graphical interface for working with this project. It's also possible to interact with it directly using the CLI:
Authenticate to your Databricks workspace, if you have not done so already:
$ databricks configure
To deploy this project, type:
$ databricks bundle deploy
This deploys everything that's defined for this project.
First, create a managed storage volume called "uc" in the catalog & schema workspace.default.
The free edition of Databricks uses the default serverless network policy, which blocks outbound connections. To load the datasets, you'll need to download them to your computer and then upload the files to the "uc" volume using the Databricks CLI.
The Metropolitan Museum of Art CSV can be downloaded here: https://github.com/metmuseum/openaccess
The Art Institute of Chicago has instructions here: https://github.com/art-institute-of-chicago/api-data
These datasets take a long time to load using the free edition (as in, let them run overnight). There is commented out code in the pipeline for ingesting them, however, it's left disabled so that playing around with the code doesn't mean waiting forever to run.
Note that the vector index also takes a long time to do its initial sync; for me, it shows as taking several days to complete and you will need to restart it due to exceeding daily limits on the embedding API endpoint.
The files you'll need to run (in order) are:
src/met-ingest/ingest.sqlsrc/artic-ingest/extract.ipynbsrc/artic-ingest/raw_schema.sqlsrc/artic-ingest/ingest_raw_json.py
The datasets themselves are as provided by their respective museums, and modified by the code provided here.
Per the Met's license:
You must not use The Metropolitan Museum of Art’s trademarks or otherwise claim or imply that the Museum or any other third party endorses you or your use of the dataset.
Per the Art Institute of Chicago:
...content may have different licensing terms. Please be mindful of the info.license_text and info.license_links fields within each JSON data file.
(These fields appear in the API, however, in the files I downloaded it appears these fields are not present.)
Note that for both museums, media files are licensed separately.