[DataLoader] Predicate pushdown and column projection in scan optimizer by robreeves · Pull Request #507 · linkedin/openhouse

robreeves · 2026-03-20T23:41:59Z

Summary

When no table transformer is present, all user-defined projects and filters are directly applied to the dataset read. This is less clear when a table transformer is provided because it can return any arbitrarily complex SQL query. This PR combines the user inputs and table transformer into one query, optimizes it to figure put predicate pushdown and projections to apply to the read. The optimized SQL is passed to execute on the split. This is required because DataFusion requires the input arrow data to have a schema that matches the query (e.g. cant have columns missing due to projection).

Changes

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

New unit tests

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Replace the ValueError when both columns and a TableTransformer are provided with scan optimization that rewrites the transform SQL to only produce the requested columns and projects only the needed source columns from the Iceberg scan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extend the scan optimizer to extract pushable predicates from both transform SQL (inner WHERE) and user filters (outer WHERE on passthrough columns), pushing them to Iceberg's row_filter. Rewrites the combined SQL to only produce needed output columns, reducing both I/O and DataFusion work. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace hand-rolled passthrough analysis and inner/outer WHERE handling with sqlglot's built-in qualify, pushdown_predicates, and pushdown_projections optimizers. The scan optimizer now just finds the table scan in the optimized AST and extracts simple column-op-literal predicates from it. Extract query building (merging transform SQL + user columns + filters) into a separate _query_builder module so the scan optimizer has no knowledge of how the query was constructed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Predicates extracted to Iceberg row_filter now stay in the SQL for DataFusion to evaluate as well. This removes the remaining/split logic from predicate extraction, since duplicate filtering is harmless and keeps the code simpler. Add test proving BETWEEN decomposition works end-to-end with DataFusion. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move sqlglot AST to Filter DSL conversion logic out of scan_optimizer into its own module with independent unit tests. scan_optimizer now only contains the optimization pipeline and AST traversal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Assert exactly one table scan exists in _find_table_scan, raising ValueError with the input SQL otherwise. Refactor scan optimizer tests to only test the public optimize_scan method — move filter conversion round-trip test to test_filter_converter. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scan optimization is pure SQL analysis — it doesn't need to execute queries. Remove DataFusion execution tests and keep tests focused on the ScanPlan output (source_columns, row_filter, sql). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove build_combined_query dependency from scan optimizer tests. Each test now passes a plain SQL string to optimize_scan and asserts the ScanPlan output directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Make _find_table_scan raise ValueError instead of silently falling back when the query has zero or multiple table scans. Narrow the try/except in optimize_scan to only catch sqlglot parse/optimize errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Invalid SQL and invalid query structure now raise exceptions instead of silently returning a fallback ScanPlan. Remove unused logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Scan optimizer tests now assert source_columns and row_filter only, not the SQL output. Tests for non-pushable predicates verify they are absent from row_filter rather than checking SQL string content. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test OR-of-ANDs, AND-of-ORs, and double-nested filter combinations to verify correct predicate extraction with parenthesized grouping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test that a non-convertible predicate (function call) inside a nested OR causes the entire OR to be skipped while sibling AND predicates are still extracted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove hardcoded _DIALECT constant from scan_optimizer. The dialect is now passed in by the caller — data_loader passes "datafusion", tests use a local constant. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Test that when a transformer references columns the user didn't request, the full pipeline correctly prunes and materializes only the requested columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ansformer Test that when a transformer has a WHERE clause and references columns the user didn't request, the optimizer prunes unused columns from the SQL and extracts the WHERE predicate for Iceberg pushdown. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The filter converter now only exposes convert() for pure AST→Filter conversion. The AND-flattening and non-convertible conjunct skipping logic moves to the scan optimizer's _extract_row_filter, which is the only caller that needs it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…_to_sql Each Filter type now has a _to_datafusion_sql() method that renders itself as a DataFusion SQL expression. This replaces the match statement in _query_builder._filter_to_sql. The round-trip test now exercises every filter type. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uery_builder The query builder module was a single function. Move it to a private method on OpenHouseDataLoader and delete the module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move AST→Filter conversion functions from _filter_converter module into scan_optimizer as private functions. Delete _filter_converter.py and test_filter_converter.py. All filter type coverage is now in the scan optimizer tests via test_comparison_types, test_filter_dsl_to_sql_round_trip, and test_non_convertible_predicates_not_pushed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge _build_transform_sql and _build_combined_query into a single _build_query method that handles the full chain: call transformer, transpile dialect, wrap with user columns/filters. Returns None when there is no transformer, simplifying the iterator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

robreeves and others added 23 commits March 19, 2026 14:49

[DataLoader] Remove silent fallback from optimize_scan

d1dd8ef

Invalid SQL and invalid query structure now raise exceptions instead of silently returning a fallback ScanPlan. Remove unused logging. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tweaks

43b0e0b

[DataLoader] Add complex nested filter tests to scan optimizer

91a7171

Test OR-of-ANDs, AND-of-ORs, and double-nested filter combinations to verify correct predicate extraction with parenthesized grouping. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

[DataLoader] Add e2e test for column projection with transformer

c2b6ea6

Test that when a transformer references columns the user didn't request, the full pipeline correctly prunes and materializes only the requested columns. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

quotes

43318b3

[DataLoader] Inline _build_combined_query into data loader, remove _q…

b9879d5

…uery_builder The query builder module was a single function. Move it to a private method on OpenHouseDataLoader and delete the module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Predicate pushdown and column projection in scan optimizer#507

[DataLoader] Predicate pushdown and column projection in scan optimizer#507
robreeves wants to merge 23 commits intolinkedin:mainfrom
robreeves:projection

robreeves commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robreeves commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

robreeves commented Mar 20, 2026 •

edited

Loading