[DataLoader] Add projection and filter pushdown through table transformers by robreeves · Pull Request #505 · linkedin/openhouse

robreeves · 2026-03-19T17:12:05Z

Summary

Add projection and filter pushdown through table transformers using DataFusion's query optimizer. When a transformer is present, the data loader now determines which columns and filters can be pushed down to the Iceberg scan, reducing I/O. This removes the previous limitation where column projections with table transformers were unsupported.

Changes

New Features: Added _pushdown.py module that uses DataFusion's query optimizer to analyze which columns and filters can be pushed through a transformer subquery to the Iceberg scan. The combined SQL (user projection/filter wrapping the transformer) is built and optimized to extract:

Scan columns from the transformer SQL's TableScan projection
Pushable filters converted from DataFusion Expr to PyIceberg BooleanExpression (best-effort, falls back to AlwaysTrue)

Performance Improvements: When a transformer is used, only the columns needed by the transformer are read from Iceberg (previously all columns were read). Row filters on passthrough columns are pushed to the scan for partition pruning and row-group filtering.

Tests: Added 35 tests covering _filter_to_sql, _build_combined_sql, and analyze_pushdown (projection pruning, filter pushdown on passthrough vs computed columns, mixed filters, inner transformer filters). Updated the existing test_iter_with_transformer_and_columns_raises to verify that columns + transformers now works correctly.

Testing Done

Manually Tested on local docker setup. Please include commands ran, and their output.
Added new tests for the changes made.
Updated existing tests to reflect the changes made.
No tests added or updated. Please explain why. If unsure, please feel free to ask for help.
Some other form of testing like staging or soak time in production. Please explain.

make verify passes (lint, format, mypy, and all 169 tests). The only failing test is a pre-existing ORC/macOS sysctlbyname system error unrelated to these changes.

Additional Information

Breaking Changes
Deprecations
Large PR broken into smaller PRs, and PR plan linked in the description.

Throwaway scripts for exploring predicate pushdown and projection extraction using sqlglot's optimizer vs DataFusion's logical plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…rmers Use DataFusion's query optimizer to determine which columns and filters can be pushed down to the Iceberg scan when a table transformer is used. This enables column projections with transformers (previously unsupported) and pushes row filters through the transformer subquery. The approach: 1. Build a combined SQL query wrapping the transformer with the user's projection and filters 2. Optimize via DataFusion to extract scan columns and pushable filters 3. Convert pushed-down filters from DataFusion Expr to PyIceberg BooleanExpression (best-effort, falls back to AlwaysTrue) 4. Pass the combined SQL to splits for full execution at read time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

robreeves and others added 2 commits March 19, 2026 09:12

Add sqlglot and datafusion pushdown exploration scripts

48be4e7

Throwaway scripts for exploring predicate pushdown and projection extraction using sqlglot's optimizer vs DataFusion's logical plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

robreeves closed this Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DataLoader] Add projection and filter pushdown through table transformers#505

[DataLoader] Add projection and filter pushdown through table transformers#505
robreeves wants to merge 2 commits intolinkedin:mainfrom
robreeves:dl_pushdown

robreeves commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robreeves commented Mar 19, 2026

Summary

Changes

Testing Done

Additional Information

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant