Skip to content

Database backend abstraction: pluggable Dialect for PostgreSQL support #261

@webgress

Description

@webgress

TL;DR: I want to add PostgreSQL as an opt-in database backend (SQLite stays the default). The approach: a thin Dialect interface that the existing Store delegates to for the ~50 database-specific call sites — FTS5 vs tsvector, datetime() vs NOW(), etc. The other ~90% of queries are standard SQL and only need Rebind(). I'd split it into two PRs: first extract SQLiteDialect (zero functional change), then add PostgreSQLDialect. Happy to do the work if you're open to it.

Motivation

msgvault is excellent as a local email archive. I'm building a semantic search
and classification layer on top of it (pgvector embeddings, LLM-based labeling,
cross-entity graph connecting emails/documents/messages). The natural home for
that layer is PostgreSQL — and having email metadata in SQLite while embeddings
and intelligence live in PostgreSQL means cross-database joins that can't be
optimized by either query planner.

I suspect others building on msgvault's MCP server or HTTP API will hit similar
friction: the moment you want to enrich the archive with computed data
(embeddings, classifications, entity resolution, analytics beyond what
DuckDB/Parquet covers), you need the email metadata co-located with that data.

PostgreSQL also unlocks:

SQLite should absolutely remain the default — it's the right choice for the
single-user, local-first use case. This proposal adds PostgreSQL as an
opt-in alternative, not a replacement.

Prior art in the codebase

The codebase already anticipates this:

  1. Rebind() method (store.go:286) with an explicit TODO:

    // When PostgreSQL support is added, this will convert ? to $1, $2, etc.

    24 call sites already use Rebind().

  2. Split schema files: schema.sql (portable DDL) is already separated
    from schema_sqlite.sql (FTS5 virtual table). The base schema is
    valid PostgreSQL with minor adjustments.

  3. Standard SQL for most queries: The vast majority of the 125+ queries
    use standard SQL. SQLite-specific constructs are concentrated in a
    well-defined set of locations.

Inventory of SQLite-specific constructs

Scanning through the internal/ tree, here's what I see:

Construct Count PostgreSQL equivalent Complexity
FTS5 (messages_fts, MATCH, rowid joins) 51 refs, 6 files tsvector + GIN index + ts_rank High (but well-contained)
datetime('now') 35 refs NOW() / CURRENT_TIMESTAMP Mechanical
GROUP_CONCAT(col, sep) 9 refs STRING_AGG(col, sep) Mechanical
INSERT OR IGNORE 6 refs ON CONFLICT DO NOTHING Mechanical
INSERT OR REPLACE 3 refs FTS-specific, replaced by tsvector trigger Goes away
PRAGMA statements 11 refs Connection pool config / SET Straightforward
isSQLiteError() 11 refs Driver-agnostic error type Straightforward
rowid references 13 refs Standard PK (mostly FTS-related) Goes away with tsvector
? placeholders All queries $1, $2, ... via Rebind() Already stubbed

~90% of queries need no changes beyond Rebind(). The SQLite-specific
surface area is concentrated, not scattered.

Proposed approach: Dialect interface

Rather than two separate Store implementations (high duplication, hard to
maintain), I propose a thin Dialect interface that the single Store
delegates to for database-specific behavior:

// Dialect abstracts database-specific SQL generation and behavior.
type Dialect interface {
    // Query rewriting
    Rebind(query string) string           // ? -> $1,$2,... for PostgreSQL
    Now() string                          // datetime('now') vs NOW()
    GroupConcat(col, sep string) string    // GROUP_CONCAT vs STRING_AGG

    // Full-text search
    FTSInsert(messageID int64, subject, body, from, to, cc string) string
    FTSSearch(term string) (joinClause, whereClause string)
    FTSDelete(messageID int64) string
    SchemaFTS() string

    // Upsert patterns
    UpsertIgnore(table string, cols []string, conflictCols []string) string

    // Connection lifecycle
    InitConn(db *sql.DB) error            // PRAGMAs vs SET/pool config
    SchemaSQL() []string                  // DDL statements for init

    // Error handling
    IsConflictError(err error) bool
    IsDuplicateColumnError(err error) bool
}

Phase 1: Extract SQLiteDialect — move all current SQLite-specific
behavior behind the interface. Zero functional change. All existing tests
pass unmodified.

Phase 2: Implement PostgreSQLDialectschema_pg.sql with tsvector
columns + GIN index, pgx driver, $N placeholders, STRING_AGG, etc.

Phase 3: Config-driven selectionconfig.toml gains a
database_url option:

[data]
# SQLite (default, unchanged):
# database_url = "sqlite:///home/user/.msgvault/msgvault.db"

# PostgreSQL (opt-in):
database_url = "postgres://user:pass@localhost:5432/msgvault"

Driver is inferred from the URL scheme. No behavioral change for existing
SQLite users.

FTS5 -> tsvector migration detail

This is the interesting part, so here's the concrete mapping:

SQLite FTS5:

CREATE VIRTUAL TABLE messages_fts USING fts5(
    message_id UNINDEXED,
    subject, body, from_addr, to_addr, cc_addr,
    tokenize='unicode61 remove_diacritics 1'
);

-- Search:
SELECT m.* FROM messages_fts fts
JOIN messages m ON m.id = fts.rowid
WHERE messages_fts MATCH ?
ORDER BY rank;

-- Insert:
INSERT OR REPLACE INTO messages_fts(rowid, message_id, subject, body, ...)
VALUES (?, ?, ?, ?, ...);

PostgreSQL tsvector equivalent:

-- Column on messages table (or message_bodies):
ALTER TABLE messages ADD COLUMN search_fts TSVECTOR
    GENERATED ALWAYS AS (
        setweight(to_tsvector('simple', COALESCE(subject, '')), 'A') ||
        setweight(to_tsvector('simple', COALESCE(from_addr, '')), 'B') ||
        to_tsvector('simple', COALESCE(body_text, ''))
    ) STORED;

CREATE INDEX messages_search_fts_idx ON messages USING GIN (search_fts);

-- Search:
SELECT m.* FROM messages m
WHERE m.search_fts @@ plainto_tsquery('simple', $1)
ORDER BY ts_rank(m.search_fts, plainto_tsquery('simple', $1)) DESC;

-- Insert: automatic (generated column), no separate INSERT needed.

The PostgreSQL approach is actually simpler — no separate FTS table to maintain,
no INSERT OR REPLACE, no rowid joins. The GENERATED ALWAYS AS column
auto-updates when the source columns change.

What I'm offering

I'm prepared to implement this end-to-end and submit a PR. I'd like to
align on the approach before writing code:

  1. Is a Dialect interface the right abstraction level, or would you prefer
    a different pattern?
  2. Should schema_pg.sql live alongside schema_sqlite.sql in
    internal/store/, or in a separate package?
  3. Any preferences on the PostgreSQL driver? I'd default to pgx (most
    popular, pure Go, good performance).
  4. Should the DuckDB analytics layer remain SQLite-only (reading from
    Parquet), or should it also support reading from PostgreSQL? I'd suggest
    keeping DuckDB as-is initially — it reads from Parquet anyway, and
    build-cache can export from either backend.

Scope boundary

This PR (or 2) would cover:

  • PR 1
    • Dialect interface + SQLiteDialect (extract, no functional change)
  • PR 2
    • PostgreSQLDialect implementation
    • schema_pg.sql
    • Config-driven backend selection
    • Tests passing against both backends (CI can use a PostgreSQL service
      container)

These PRs would NOT cover:

  • pgvector / semantic search (separate feature, enabled by this foundation)
  • Migration tooling (SQLite -> PostgreSQL data migration)
  • Changes to DuckDB/Parquet analytics layer
  • Changes to the TUI, MCP server, or HTTP API

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions