Skip to content

feat(duckdb): cross-database federation via derived DuckDB connection#295

Open
kevinmessiaen wants to merge 47 commits into
mainfrom
feat/duckdb-federation
Open

feat(duckdb): cross-database federation via derived DuckDB connection#295
kevinmessiaen wants to merge 47 commits into
mainfrom
feat/duckdb-federation

Conversation

@kevinmessiaen

@kevinmessiaen kevinmessiaen commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds cross-database federation to ktx. When a project declares 2+ attach-compatible databases (postgres, mysql, sqlite), ktx derives a virtual _ktx_federated connection backed by an embedded DuckDB that ATTACHes each member read-only and runs cross-catalog joins. From the semantic layer's view there is one connection; DuckDB fans out to the real databases underneath. Live data, no copy.

Answers the question that motivated this: books in postgres, reviews in sqlite, joined in one query.

A federated query goes from ktx.yaml to returned rows for real configs, raw federated SQL works via the ktx sql CLI, the ingest path, and the MCP sql_execution path, and declared cross-DB joins survive re-scan.

Design (locked decisions)

  • Federation IS the connection. A pure deriveFederatedConnection(connections, projectDir) computes a descriptor from declared state. Never persisted — no ktx.yaml entry, no flag, no _ktx_federated/ directory. Recomputed every run.
  • Activates only when 2+ attach-compatible members exist (postgres/mysql/sqlite). 0 or 1 → nothing changes. Other drivers (snowflake/bigquery/clickhouse/sqlserver) stay standalone.
  • Embedded DuckDB, fresh in-memory instance per query (MCP-driven, sporadic — no warm pool to hold idle connections).
  • Member connectionId → DuckDB catalog alias (connectionId.schema.table), quoted so hyphenated ids attach correctly.
  • Read-only enforced two ways: ATTACH ... READ_ONLY (physical) + assertReadOnlySql (statement) + caller-side validateReadOnly with the duckdb dialect.
  • _ktx_ is a reserved connection-id namespace.
  • Member config is resolved through each connector's canonical resolver — federation and standalone scans interpret ktx.yaml identically (sqlite path:/url:/~/project-relative; postgres/mysql discrete-fields-or-URL, SSL, search_path).
  • One implementation of read-only SQL execution. Every entry point — ktx sql, ingest, and MCP sql_execution — routes through the shared executeProjectReadOnlySql, so the federated-vs-direct decision follows from the connection id, not from which caller invoked it. The CLI and the agent expose the identical set of choices.
  • Cross-DB joins are declared-only in v1. Automatic cross-DB discovery is a follow-up.

What changed

Core federation

  • context/connections/federation.ts — derivation + FEDERATED_CONNECTION_ID; FederatedMember carries the full member connection config + projectDir; federatedConnectionListing exposes the virtual connection (id, members, usage hint) for discovery surfaces.
  • connectors/duckdb/federated-attach.ts (new) — resolves each member's DuckDB ATTACH target by reusing the canonical connector resolvers (sqliteDatabasePathFromConfig, postgresPoolConfigFromConfig, mysqlConnectionPoolConfigFromConfig). sqlite path: resolves end-to-end; SSL (sslmode=require / ssl_mode=REQUIRED) and postgres search_path are preserved for discrete-field configs.
  • connectors/duckdb/federated-executor.ts — ATTACH read-only + execute, targets resolved via federated-attach. DuckDB returns integer columns as JS bigint; the executor coerces them to number once here so every consumer (CLI/MCP/ingest/SL) gets a JSON-safe result.

Unified query execution

  • context/connections/project-sql-executor.ts (new) — single shared executeProjectReadOnlySql that owns the _ktx_federated routing decision. The ingest executor (ingest-query-executor.ts), the MCP sql_execution port (context/mcp/local-project-ports.ts), and the ktx sql CLI command (sql.ts) all delegate to it. MCP federated errors are classified via KtxQueryError consistently with non-federated SQL.

ktx sql CLI parity

  • sql.tsktx sql -c _ktx_federated "<join>" now runs federated cross-database queries, matching MCP. The command's forked connection-lookup + single-scan-connector path is removed and replaced by a call to executeProjectReadOnlySql; the local duplicate dialect helper and the up-front config guard are deleted (the shared connector factory raises the same "not configured" error for unknown ids). Direct -c <member> queries are unchanged. KtxSqlQueryExecutionResult gained an optional headerTypes so --json output is preserved.

Federated-connection discoverability

  • connection.ts (CLI ktx connection) and context/mcp/local-project-ports.ts (MCP connection_list) both surface the _ktx_federated entry — id, member connection ids, and a short usage hint — via the one shared federatedConnectionListing builder, so an agent can discover that cross-database querying exists and how to address it. members/hint thread through LocalConnectionInfo, KtxConnectionSummary, and the MCP output schema as optional fields; DUCKDB is a list-only label and is not added to the driver→connection-type map. The connection_list tool description points agents at the federated id for cross-database joins.

Cross-DB join preservation through ingest

  • context/ingest/.../manifest.ts + context/scan/local-enrichment-artifacts.ts — declared cross-DB joins to federated siblings survive a re-scan. The sibling-target set is derived from scanned member state at the producer and honored wherever a cross-DB to: is evaluated.

Semantic layer

  • context/sl/local-sl.ts — read-time union of member dirs for _ktx_federated, with member-namespaced source names (pg_books.books) so two members owning a same-named table don't collide. Physical source.table is unchanged.
  • context/sl/local-query.ts — duckdb dialect + federated id resolution; federated executed-plan metadata reports duckdb.
  • context/sl/source-files.ts — reserve _ktx_ prefix.

Setup / deps / docs

  • setup-databases.ts — informational federation notice (no prompt, no persisted state).
  • @duckdb/node-api dependency.
  • docs-site/.../concepts/cross-database-federation.mdx + nav — documents the config shape, fully-qualified table naming, member-namespaced federated source names, and querying _ktx_federated directly via ktx sql and the MCP sql_execution tool.

Test plan

  • pnpm --filter @kaelio/ktx run type-check — clean
  • Dead-code (Biome + Knip default + production) — clean on packages/cli
  • Federation tests pass, including live DuckDB integration tests: cross-catalog sqlite join with correct per-book averages; INSERT rejected by read-only; hyphenated catalog ids attach and join; sqlite path: resolved end-to-end; the shared executor's federated path against real DuckDB; the MCP sql_execution path running a real _ktx_federated join; a production-path test proving a manual cross-DB join survives re-scan.
  • ktx sql parity tests: member-direct execution, _ktx_federated routing to the shared executor (member connector not used), unknown-id error, --json headerTypes preserved; an end-to-end ktx sql -c _ktx_federated cross-file sqlite join through the real executor.
  • Discoverability tests: _ktx_federated appears in ktx connection and MCP connection_list with members + hint when 2+ attach-compatible members exist, and is absent otherwise.
  • bigint regression guards: a federated query selecting an integer column returns JSON-safe numbers and survives JSON.stringify on both the executor result and the MCP sql_execution path (previously the MCP federated path threw Do not know how to serialize a BigInt on any integer column).
  • Reviewer: confirm CI's full suite is green (the local broad suite has pre-existing GPG-signing failures in git-init test fixtures, unrelated to this feature — they fail at project init before the test body; federation tests using the file-store seed pattern are git-free).

Follow-ups (not blocking)

  • Automatic cross-DB relationship discovery (v1 is declared-only).
  • Quote user-authored three-part table:/to: references in generated SQL where a reserved identifier could appear.
  • Federated results carry no column types (headerTypes); the DuckDB executor produces none. Direct member queries still report types.
  • bigint values above 2^53 lose precision when coerced to number (consistent with the existing plain/pretty CLI output); a string-fallback for out-of-range integers is a possible follow-up.

kevinmessiaen and others added 18 commits June 12, 2026 18:20
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…esolveStringReference

Collapse the 5 remaining private copies in bigquery, clickhouse, mysql,
snowflake, and sqlserver into the shared module. Fix a latent bug in the
shared module where `~/path` was incorrectly sliced (dropping only `~`,
leaving the leading `/` and making resolve() ignore homedir). Add a
tilde-expansion test that caught the bug and now covers that branch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e members

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ach url

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bypass assertSafeConnectionId for _ktx_federated in resolveLocalConnectionId
and loadComputableSources, and resolve the compute dialect to 'duckdb' when
connectionId is FEDERATED_CONNECTION_ID instead of falling through to the
default postgres lookup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… member

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Also marks attachTypeForDriver, buildAttachStatements, and
isReservedConnectionId @internal — all three are exported solely for
unit-test access with no production cross-file consumer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kevinmessiaen kevinmessiaen self-assigned this Jun 12, 2026
@vercel

vercel Bot commented Jun 12, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
ktx-docs-site Ready Ready Preview, Comment Jun 14, 2026 6:36am

Request Review

@kevinmessiaen kevinmessiaen changed the title feat(duckdb): cross-database federation via derived DuckDB connection" feat(duckdb): cross-database federation via derived DuckDB connection Jun 12, 2026
…loads

Collapse the parallel ATTACH_COMPATIBLE_DRIVERS set and ATTACH_TYPE_BY_DRIVER
map into one map in federation.ts whose keys are the membership rule. Replace
FederatedMember.config (read only via a type-erasing cast) with a typed url
field extracted at derive time. Emit INSTALL/LOAD once per distinct driver
type instead of once per member.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…p id validation

Wrap the federated DuckDB instance in its own try/finally so a failing
connect() or a throwing connection.closeSync() no longer leaks the native
instance. Route setup-sources connection-id validation through the canonical
assertSafeConnectionId so the reserved _ktx_ prefix guard applies there too.
Derive the federated dialect through sqlAnalysisDialectForDriver instead of a
hardcoded literal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kevinmessiaen and others added 4 commits June 13, 2026 09:04
…n FederatedMember

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nector resolvers

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s, supporting sqlite path:

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kevinmessiaen and others added 10 commits June 13, 2026 09:24
…es _ktx_federated

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…h real DuckDB

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…utor

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d executor

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ifest re-emit

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts in test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…collisions

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…federated MCP errors

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ling reads

Dedup the federated driver ternary in local-query, derive the prefixed
source.name from the already-built name, drop the duplicated error in
federatedAttachTarget's exhaustive switch, inline the one-line
cleanupConnector wrapper, and parallelize federatedSiblingTargets' shard
reads (was sequential await-in-for on the scan hot path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kevinmessiaen and others added 3 commits June 13, 2026 18:44
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
kevinmessiaen and others added 5 commits June 13, 2026 18:55
…ated parity

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Surfaces the virtual federated connection in the output of
`ktx connection list` so agents and users can discover cross-database
querying when 2+ attach-compatible connections are configured.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drive runKtxSql with the real federated DuckDB executor against two on-disk
sqlite files, stubbing only SQL validation. The test surfaced that the JSON
output path could not serialize bigint values DuckDB returns for integer
columns; printJson now coerces bigint to JSON numbers, matching the
plain/pretty paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xecutor

DuckDB returns integer columns as JS bigint, which JSON.stringify cannot
serialize. The CLI --json path worked around this with a replacer, but the
MCP sql_execution tool serializes via plain JSON.stringify and crashed on
any federated query selecting an integer column. Coerce bigint to Number
once in executeFederatedQuery so every consumer (CLI, MCP, ingest, SL)
gets a JSON-safe result, and remove the now-redundant CLI replacer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… path

- Replace the identity-valued ATTACH_TYPE_BY_DRIVER record with a
  ATTACH_COMPATIBLE_DRIVERS Set; the driver name doubles as the attach
  type, so the map encoded nothing beyond membership.
- Switch federatedAttachTarget directly on the driver with a default
  throw, dropping the unreachable post-switch throw and its comment.
- Route the MCP sql_execution standard-connection case through the
  shared executeProjectReadOnlySql instead of reimplementing the
  connector create/capability-check/execute/cleanup ceremony, so
  federated and standard connections share one execution path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The federation doc example URL and the federated-attach test fixtures use
literal placeholder credentials that trip detect-secrets. Mark them with
line-scoped pragma allowlist comments so a real secret added later is still
caught.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kevinmessiaen kevinmessiaen marked this pull request as ready for review June 14, 2026 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants