Skip to content

[DataLoader] Add projection and filter pushdown through table transformers#505

Closed
robreeves wants to merge 2 commits intolinkedin:mainfrom
robreeves:dl_pushdown
Closed

[DataLoader] Add projection and filter pushdown through table transformers#505
robreeves wants to merge 2 commits intolinkedin:mainfrom
robreeves:dl_pushdown

Conversation

@robreeves
Copy link
Collaborator

Summary

Add projection and filter pushdown through table transformers using DataFusion's query optimizer. When a transformer is present, the data loader now determines which columns and filters can be pushed down to the Iceberg scan, reducing I/O. This removes the previous limitation where column projections with table transformers were unsupported.

Changes

  • Client-facing API Changes
  • Internal API Changes
  • Bug Fixes
  • New Features
  • Performance Improvements
  • Code Style
  • Refactoring
  • Documentation
  • Tests

New Features: Added _pushdown.py module that uses DataFusion's query optimizer to analyze which columns and filters can be pushed through a transformer subquery to the Iceberg scan. The combined SQL (user projection/filter wrapping the transformer) is built and optimized to extract:

  • Scan columns from the transformer SQL's TableScan projection
  • Pushable filters converted from DataFusion Expr to PyIceberg BooleanExpression (best-effort, falls back to AlwaysTrue)

Performance Improvements: When a transformer is used, only the columns needed by the transformer are read from Iceberg (previously all columns were read). Row filters on passthrough columns are pushed to the scan for partition pruning and row-group filtering.

Tests: Added 35 tests covering _filter_to_sql, _build_combined_sql, and analyze_pushdown (projection pruning, filter pushdown on passthrough vs computed columns, mixed filters, inner transformer filters). Updated the existing test_iter_with_transformer_and_columns_raises to verify that columns + transformers now works correctly.

Testing Done

  • Manually Tested on local docker setup. Please include commands ran, and their output.
  • Added new tests for the changes made.
  • Updated existing tests to reflect the changes made.
  • No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
  • Some other form of testing like staging or soak time in production. Please explain.

make verify passes (lint, format, mypy, and all 169 tests). The only failing test is a pre-existing ORC/macOS sysctlbyname system error unrelated to these changes.

Additional Information

  • Breaking Changes
  • Deprecations
  • Large PR broken into smaller PRs, and PR plan linked in the description.

robreeves and others added 2 commits March 19, 2026 09:12
Throwaway scripts for exploring predicate pushdown and projection
extraction using sqlglot's optimizer vs DataFusion's logical plan.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rmers

Use DataFusion's query optimizer to determine which columns and filters
can be pushed down to the Iceberg scan when a table transformer is used.
This enables column projections with transformers (previously unsupported)
and pushes row filters through the transformer subquery.

The approach:
1. Build a combined SQL query wrapping the transformer with the user's
   projection and filters
2. Optimize via DataFusion to extract scan columns and pushable filters
3. Convert pushed-down filters from DataFusion Expr to PyIceberg
   BooleanExpression (best-effort, falls back to AlwaysTrue)
4. Pass the combined SQL to splits for full execution at read time

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@robreeves robreeves closed this Mar 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant