[DataLoader] Add projection and filter pushdown through table transformers#505
Closed
robreeves wants to merge 2 commits intolinkedin:mainfrom
Closed
[DataLoader] Add projection and filter pushdown through table transformers#505robreeves wants to merge 2 commits intolinkedin:mainfrom
robreeves wants to merge 2 commits intolinkedin:mainfrom
Conversation
Throwaway scripts for exploring predicate pushdown and projection extraction using sqlglot's optimizer vs DataFusion's logical plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rmers Use DataFusion's query optimizer to determine which columns and filters can be pushed down to the Iceberg scan when a table transformer is used. This enables column projections with transformers (previously unsupported) and pushes row filters through the transformer subquery. The approach: 1. Build a combined SQL query wrapping the transformer with the user's projection and filters 2. Optimize via DataFusion to extract scan columns and pushable filters 3. Convert pushed-down filters from DataFusion Expr to PyIceberg BooleanExpression (best-effort, falls back to AlwaysTrue) 4. Pass the combined SQL to splits for full execution at read time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add projection and filter pushdown through table transformers using DataFusion's query optimizer. When a transformer is present, the data loader now determines which columns and filters can be pushed down to the Iceberg scan, reducing I/O. This removes the previous limitation where column projections with table transformers were unsupported.
Changes
New Features: Added
_pushdown.pymodule that uses DataFusion's query optimizer to analyze which columns and filters can be pushed through a transformer subquery to the Iceberg scan. The combined SQL (user projection/filter wrapping the transformer) is built and optimized to extract:Exprto PyIcebergBooleanExpression(best-effort, falls back toAlwaysTrue)Performance Improvements: When a transformer is used, only the columns needed by the transformer are read from Iceberg (previously all columns were read). Row filters on passthrough columns are pushed to the scan for partition pruning and row-group filtering.
Tests: Added 35 tests covering
_filter_to_sql,_build_combined_sql, andanalyze_pushdown(projection pruning, filter pushdown on passthrough vs computed columns, mixed filters, inner transformer filters). Updated the existingtest_iter_with_transformer_and_columns_raisesto verify that columns + transformers now works correctly.Testing Done
make verifypasses (lint, format, mypy, and all 169 tests). The only failing test is a pre-existing ORC/macOSsysctlbynamesystem error unrelated to these changes.Additional Information