-
Notifications
You must be signed in to change notification settings - Fork 3
Description
DuckDB has a concept of Hive Partitioning that we want to mimic for large datasets (e.g. player stats) to prevent massive queries from being run.
This will come in two parts:
Defining Dimensions
I propose that we use frontmatter in SQL to define dimensionality, so that each dimension defines a query that has the results
Single Dimension Query
set.sql
---
dimensions:
season: SELECT someColumn as dimension FROM someTable
---
SELECT *
FROM someTable
WHERE someColumn = ${season}This would result in a folder structure 1 level deep:
set/
season=1/
data.parquet
data.csv
data.json
season=2/
data.parquet
data.csv
data.json
Multi Dimension Query
set2.sql
---
dimensions:
season: SELECT someColumn as dimension FROM someTable
match: SELECT someColumn as dimension FROM someMatchTable
---
SELECT *
FROM someTable
WHERE someColumn = ${season}
AND someOtherColumn = ${match}This would result in a folder structure 2 levels deep:
set/
season=1/
match=1/
data.parquet
data.csv
data.json
match=2/
data.parquet
data.csv
data.json
season=2/
match=1/
data.parquet
data.csv
data.json
match=2/
data.parquet
data.csv
data.json
Proper output
SQL Snippets
Currently on our dataset pages we have a SQL snippet that creates a view, and download buttons for the files;
the SQL snippet will need to be updated to properly load all the files:
CREATE VIEW set AS (
SELECT *
FROM read_parquet([
'https://example.com/data/set/season=1/data.parquet',
'https://example.com/data/set/season=2/data.parquet',
'https://example.com/data/set/season=3/data.parquet' -- etc...
], hive_partitioning = true);
)Batch Download
We should also consider creating a .tar.gz file for each format for the entire set that reflects the file structure,
for example the parquet download button would reference a set.parquet.tar.gz file with the contents:
season=1/
data.parquet
season=2/
data.parquet
This allows users to still download the entire dataset for local analysis
Manifest
It may also be helpful to produce a manifest of URLs for a partitioned set, so that non-duckdb programs can easily reference all of the files; the structure of this is TBD, and this is not required as part of the first version