Skip to content

Support syncing DuckLake SQLite catalog to/from cloud storage #183

@MaxHalford

Description

@MaxHalford

Context

DuckLake supports SQLite as a catalog database, which is great for single-user setups. However, when data is stored in cloud storage (R2, S3, GCS), the catalog is still a local file — meaning you need to keep track of it and it's not portable across machines.

Proposal

Add a LEA_DUCKLAKE_CATALOG_URL env var that points to a remote location (e.g., r2://my-bucket/catalog). When set, lea would:

  1. Download the SQLite catalog from the remote URL to a local temp file before running
  2. Attach using ducklake:sqlite:<local_temp_file>
  3. Run the session as normal
  4. Upload the updated catalog back to the remote URL

This gives single users a fully serverless DuckLake setup with no local state — both catalog and data live in the same bucket.

Proof of concept

Tested and confirmed working with R2:

  • Download via read_parquet() (catalog stored as a blob in a parquet file)
  • Upload via COPY ... TO ... (FORMAT PARQUET) with a temp table containing the blob
  • Full round-trip preserves the SQLite file perfectly

Considerations

  • No concurrent access — SQLite is single-writer, so this is only suitable for single-user setups. For multi-user, PostgreSQL catalog is the right choice.
  • Locking — could add advisory locking (e.g., a lock file in the bucket) to prevent accidental concurrent runs.
  • First run — when no remote catalog exists yet, start fresh.
  • Upload on failure — should probably still upload even if some scripts failed, to preserve partial state (audit tables get cleaned up on next run anyway).

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions