[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation by robreeves · Pull Request #501 · linkedin/openhouse

robreeves · 2026-03-13T19:22:21Z

Summary

Add a custom SQLGlot dialect for DataFusion and a to_datafusion_sql function that transpiles SQL from any supported source dialect to DataFusion SQL.

This is the first step in decoupling the TableTransformer API from DataFusion internals. Instead of returning a DataFusion DataFrame (leaking the execution engine to users), the TableTransformer will return a SQL string and its dialect. The data loader will then use SQLGlot to translate that SQL to DataFusion for execution. It will be used in #496.

We maintain the DataFusion dialect in-repo rather than contributing it upstream to SQLGlot because the SQLGlot maintainers don't have capacity to review more community dialects right now (source).

Context: #496 (comment)

Changes

DataFusion dialect (datafusion_sql.py): custom SQLGlot dialect with DataFusion-specific function mappings (e.g. SIZE → cardinality, ARRAY() → make_array, CURRENT_TIMESTAMP() → now()), type mappings (e.g. CHAR/TEXT → VARCHAR, BINARY → BYTEA), and identifier/normalization rules.

SQL translator (datafusion_sql.py): to_datafusion_sql(sql, source_dialect) accepts any supported source dialect (spark, postgres, mysql, etc.) and transpiles to DataFusion. When source_dialect is "datafusion" it returns the SQL unchanged. Validates the dialect with a clear error listing all supported options.

Dependency: added sqlglot>=29.0.0.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

Parametrized transpilation tests cover spark, mysql, postgres, and datafusion identity. Edge case tests for unsupported dialects and multi-statement errors. E2E test executes transpiled SQL against DataFusion and validates output data.

make check  # All checks passed (ruff, mypy)
make test   # 19 dialect tests pass

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

This is the first PR. Follow-up PRs will integrate the translator into the TableTransformer API and data loader pipeline.

Adds a comprehensive DataFusion dialect as a SQLGlot plugin, enabling transpilation from Spark SQL (and other dialects) to DataFusion SQL. Includes parser function mappings, generator type/function transforms, a SparkToDataFusionSQLTranslator helper, and 36 tests covering translation, identity round-trips, type mappings, and execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rename SparkToDataFusionSQLTranslator to DataFusionSQLTranslator with a required source_dialect parameter. Validates the dialect on construction and provides a clear error listing all supported dialects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consolidate the translator function into datafusion_dialect.py and remove the separate sql_translator.py file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…usion_sql Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ages Return SQL unchanged when source_dialect is already datafusion. Include parsed statements in the multi-statement error for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>