Skip to content

[DataLoader] Predicate pushdown and column projection in scan optimizer#507

Draft
robreeves wants to merge 23 commits intolinkedin:mainfrom
robreeves:projection
Draft

[DataLoader] Predicate pushdown and column projection in scan optimizer#507
robreeves wants to merge 23 commits intolinkedin:mainfrom
robreeves:projection

Conversation

@robreeves
Copy link
Collaborator

@robreeves robreeves commented Mar 20, 2026

Summary

When no table transformer is present, all user-defined projects and filters are directly applied to the dataset read. This is less clear when a table transformer is provided because it can return any arbitrarily complex SQL query. This PR combines the user inputs and table transformer into one query, optimizes it to figure put predicate pushdown and projections to apply to the read. The optimized SQL is passed to execute on the split. This is required because DataFusion requires the input arrow data to have a schema that matches the query (e.g. cant have columns missing due to projection).

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

New unit tests

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

robreeves and others added 23 commits March 19, 2026 14:49
Replace the ValueError when both columns and a TableTransformer are
provided with scan optimization that rewrites the transform SQL to only
produce the requested columns and projects only the needed source
columns from the Iceberg scan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend the scan optimizer to extract pushable predicates from both
transform SQL (inner WHERE) and user filters (outer WHERE on passthrough
columns), pushing them to Iceberg's row_filter. Rewrites the combined
SQL to only produce needed output columns, reducing both I/O and
DataFusion work.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace hand-rolled passthrough analysis and inner/outer WHERE handling
with sqlglot's built-in qualify, pushdown_predicates, and
pushdown_projections optimizers. The scan optimizer now just finds the
table scan in the optimized AST and extracts simple column-op-literal
predicates from it.

Extract query building (merging transform SQL + user columns + filters)
into a separate _query_builder module so the scan optimizer has no
knowledge of how the query was constructed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Predicates extracted to Iceberg row_filter now stay in the SQL for
DataFusion to evaluate as well. This removes the remaining/split logic
from predicate extraction, since duplicate filtering is harmless and
keeps the code simpler. Add test proving BETWEEN decomposition works
end-to-end with DataFusion.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move sqlglot AST to Filter DSL conversion logic out of scan_optimizer
into its own module with independent unit tests. scan_optimizer now
only contains the optimization pipeline and AST traversal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Assert exactly one table scan exists in _find_table_scan, raising
ValueError with the input SQL otherwise. Refactor scan optimizer tests
to only test the public optimize_scan method — move filter conversion
round-trip test to test_filter_converter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan optimization is pure SQL analysis — it doesn't need to execute
queries. Remove DataFusion execution tests and keep tests focused on
the ScanPlan output (source_columns, row_filter, sql).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove build_combined_query dependency from scan optimizer tests. Each
test now passes a plain SQL string to optimize_scan and asserts the
ScanPlan output directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Make _find_table_scan raise ValueError instead of silently falling back
when the query has zero or multiple table scans. Narrow the try/except
in optimize_scan to only catch sqlglot parse/optimize errors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Invalid SQL and invalid query structure now raise exceptions instead of
silently returning a fallback ScanPlan. Remove unused logging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Scan optimizer tests now assert source_columns and row_filter only,
not the SQL output. Tests for non-pushable predicates verify they are
absent from row_filter rather than checking SQL string content.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test OR-of-ANDs, AND-of-ORs, and double-nested filter combinations
to verify correct predicate extraction with parenthesized grouping.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test that a non-convertible predicate (function call) inside a nested
OR causes the entire OR to be skipped while sibling AND predicates are
still extracted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove hardcoded _DIALECT constant from scan_optimizer. The dialect is
now passed in by the caller — data_loader passes "datafusion", tests
use a local constant.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test that when a transformer references columns the user didn't request,
the full pipeline correctly prunes and materializes only the requested
columns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ansformer

Test that when a transformer has a WHERE clause and references columns
the user didn't request, the optimizer prunes unused columns from the
SQL and extracts the WHERE predicate for Iceberg pushdown.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The filter converter now only exposes convert() for pure AST→Filter
conversion. The AND-flattening and non-convertible conjunct skipping
logic moves to the scan optimizer's _extract_row_filter, which is the
only caller that needs it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_to_sql

Each Filter type now has a _to_datafusion_sql() method that renders
itself as a DataFusion SQL expression. This replaces the match
statement in _query_builder._filter_to_sql. The round-trip test now
exercises every filter type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uery_builder

The query builder module was a single function. Move it to a private
method on OpenHouseDataLoader and delete the module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move AST→Filter conversion functions from _filter_converter module into
scan_optimizer as private functions. Delete _filter_converter.py and
test_filter_converter.py. All filter type coverage is now in the scan
optimizer tests via test_comparison_types, test_filter_dsl_to_sql_round_trip,
and test_non_convertible_predicates_not_pushed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Merge _build_transform_sql and _build_combined_query into a single
_build_query method that handles the full chain: call transformer,
transpile dialect, wrap with user columns/filters. Returns None when
there is no transformer, simplifying the iterator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant