[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501
Merged
robreeves merged 18 commits intolinkedin:mainfrom Mar 16, 2026
Merged
[DataLoader] Add DataFusion SQLGlot dialect for SQL transpilation#501robreeves merged 18 commits intolinkedin:mainfrom
robreeves merged 18 commits intolinkedin:mainfrom
Conversation
Adds a comprehensive DataFusion dialect as a SQLGlot plugin, enabling transpilation from Spark SQL (and other dialects) to DataFusion SQL. Includes parser function mappings, generator type/function transforms, a SparkToDataFusionSQLTranslator helper, and 36 tests covering translation, identity round-trips, type mappings, and execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rename SparkToDataFusionSQLTranslator to DataFusionSQLTranslator with a required source_dialect parameter. Validates the dialect on construction and provides a clear error listing all supported dialects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidate the translator function into datafusion_dialect.py and remove the separate sql_translator.py file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…usion_sql Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ages Return SQL unchanged when source_dialect is already datafusion. Include parsed statements in the multi-statement error for debugging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace _spark_to_df helper with to_datafusion_sql calls. Remove duplicate test cases between TestTranslator and other test classes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tion test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rized test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
integrations/python/dataloader/src/openhouse/dataloader/datafusion_sql.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/datafusion_sql.py
Outdated
Show resolved
Hide resolved
integrations/python/dataloader/src/openhouse/dataloader/datafusion_sql.py
Show resolved
Hide resolved
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove incorrect approx_median/approx_percentile_cont mappings that silently downgraded exact functions. DataFusion has exact median and percentile_cont which sqlglot handles by default. Add ApproxQuantile mapping so Spark PERCENTILE_APPROX transpiles to approx_percentile_cont. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Verifies that custom UDFs pass through transpilation unchanged from Spark dialect and execute correctly in DataFusion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ShreyeshArangath
previously approved these changes
Mar 16, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ShreyeshArangath
approved these changes
Mar 16, 2026
17 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add a custom SQLGlot dialect for DataFusion and a
to_datafusion_sqlfunction that transpiles SQL from any supported source dialect to DataFusion SQL.This is the first step in decoupling the
TableTransformerAPI from DataFusion internals. Instead of returning a DataFusion DataFrame (leaking the execution engine to users), theTableTransformerwill return a SQL string and its dialect. The data loader will then use SQLGlot to translate that SQL to DataFusion for execution. It will be used in #496.We maintain the DataFusion dialect in-repo rather than contributing it upstream to SQLGlot because the SQLGlot maintainers don't have capacity to review more community dialects right now (source).
Context: #496 (comment)
Changes
DataFusion dialect (
datafusion_sql.py): custom SQLGlot dialect with DataFusion-specific function mappings (e.g.SIZE→cardinality,ARRAY()→make_array,CURRENT_TIMESTAMP()→now()), type mappings (e.g.CHAR/TEXT→VARCHAR,BINARY→BYTEA), and identifier/normalization rules.SQL translator (
datafusion_sql.py):to_datafusion_sql(sql, source_dialect)accepts any supported source dialect (spark, postgres, mysql, etc.) and transpiles to DataFusion. When source_dialect is"datafusion"it returns the SQL unchanged. Validates the dialect with a clear error listing all supported options.Dependency: added
sqlglot>=29.0.0.Testing Done
Parametrized transpilation tests cover spark, mysql, postgres, and datafusion identity. Edge case tests for unsupported dialects and multi-statement errors. E2E test executes transpiled SQL against DataFusion and validates output data.
Additional Information
This is the first PR. Follow-up PRs will integrate the translator into the
TableTransformerAPI and data loader pipeline.