[DataLoader] Predicate pushdown and column projection in scan optimizer#507
Draft
robreeves wants to merge 23 commits intolinkedin:mainfrom
Draft
[DataLoader] Predicate pushdown and column projection in scan optimizer#507robreeves wants to merge 23 commits intolinkedin:mainfrom
robreeves wants to merge 23 commits intolinkedin:mainfrom
Conversation
Replace the ValueError when both columns and a TableTransformer are provided with scan optimization that rewrites the transform SQL to only produce the requested columns and projects only the needed source columns from the Iceberg scan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend the scan optimizer to extract pushable predicates from both transform SQL (inner WHERE) and user filters (outer WHERE on passthrough columns), pushing them to Iceberg's row_filter. Rewrites the combined SQL to only produce needed output columns, reducing both I/O and DataFusion work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hand-rolled passthrough analysis and inner/outer WHERE handling with sqlglot's built-in qualify, pushdown_predicates, and pushdown_projections optimizers. The scan optimizer now just finds the table scan in the optimized AST and extracts simple column-op-literal predicates from it. Extract query building (merging transform SQL + user columns + filters) into a separate _query_builder module so the scan optimizer has no knowledge of how the query was constructed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predicates extracted to Iceberg row_filter now stay in the SQL for DataFusion to evaluate as well. This removes the remaining/split logic from predicate extraction, since duplicate filtering is harmless and keeps the code simpler. Add test proving BETWEEN decomposition works end-to-end with DataFusion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move sqlglot AST to Filter DSL conversion logic out of scan_optimizer into its own module with independent unit tests. scan_optimizer now only contains the optimization pipeline and AST traversal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assert exactly one table scan exists in _find_table_scan, raising ValueError with the input SQL otherwise. Refactor scan optimizer tests to only test the public optimize_scan method — move filter conversion round-trip test to test_filter_converter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan optimization is pure SQL analysis — it doesn't need to execute queries. Remove DataFusion execution tests and keep tests focused on the ScanPlan output (source_columns, row_filter, sql). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove build_combined_query dependency from scan optimizer tests. Each test now passes a plain SQL string to optimize_scan and asserts the ScanPlan output directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make _find_table_scan raise ValueError instead of silently falling back when the query has zero or multiple table scans. Narrow the try/except in optimize_scan to only catch sqlglot parse/optimize errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Invalid SQL and invalid query structure now raise exceptions instead of silently returning a fallback ScanPlan. Remove unused logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan optimizer tests now assert source_columns and row_filter only, not the SQL output. Tests for non-pushable predicates verify they are absent from row_filter rather than checking SQL string content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test OR-of-ANDs, AND-of-ORs, and double-nested filter combinations to verify correct predicate extraction with parenthesized grouping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test that a non-convertible predicate (function call) inside a nested OR causes the entire OR to be skipped while sibling AND predicates are still extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove hardcoded _DIALECT constant from scan_optimizer. The dialect is now passed in by the caller — data_loader passes "datafusion", tests use a local constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test that when a transformer references columns the user didn't request, the full pipeline correctly prunes and materializes only the requested columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ansformer Test that when a transformer has a WHERE clause and references columns the user didn't request, the optimizer prunes unused columns from the SQL and extracts the WHERE predicate for Iceberg pushdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The filter converter now only exposes convert() for pure AST→Filter conversion. The AND-flattening and non-convertible conjunct skipping logic moves to the scan optimizer's _extract_row_filter, which is the only caller that needs it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_to_sql Each Filter type now has a _to_datafusion_sql() method that renders itself as a DataFusion SQL expression. This replaces the match statement in _query_builder._filter_to_sql. The round-trip test now exercises every filter type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uery_builder The query builder module was a single function. Move it to a private method on OpenHouseDataLoader and delete the module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move AST→Filter conversion functions from _filter_converter module into scan_optimizer as private functions. Delete _filter_converter.py and test_filter_converter.py. All filter type coverage is now in the scan optimizer tests via test_comparison_types, test_filter_dsl_to_sql_round_trip, and test_non_convertible_predicates_not_pushed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge _build_transform_sql and _build_combined_query into a single _build_query method that handles the full chain: call transformer, transpile dialect, wrap with user columns/filters. Returns None when there is no transformer, simplifying the iterator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When no table transformer is present, all user-defined projects and filters are directly applied to the dataset read. This is less clear when a table transformer is provided because it can return any arbitrarily complex SQL query. This PR combines the user inputs and table transformer into one query, optimizes it to figure put predicate pushdown and projections to apply to the read. The optimized SQL is passed to execute on the split. This is required because DataFusion requires the input arrow data to have a schema that matches the query (e.g. cant have columns missing due to projection).
Changes
Testing Done
New unit tests
Additional Information