feat(plugin-iceberg): Bound MV refresh by tdcmeehan · Pull Request #27774 · prestodb/presto

tdcmeehan · 2026-05-12T01:49:20Z

Description

Adds a bounded incremental-refresh mode for Iceberg materialized views. A new max_snapshots_per_refresh MV property (with a matching session/config default) caps how far each base table can advance in a single REFRESH MATERIALIZED VIEW, using the _last_updated_sequence_number row-lineage column to limit the refresh scan. Remaining backlog is picked up by subsequent refreshes.

Motivation and Context

On busy tables an MV can fall far behind, and a single REFRESH may be too large to complete in one statement. This lets operators chunk catch-up into bounded steps. V2 tables, which lack row lineage, fall back to unbounded refresh.

Impact

Test Plan

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Iceberg Connector Changes
* Add ``max_snapshots_per_refresh`` materialized view property to bound how far each base table advances per ``REFRESH MATERIALIZED VIEW``. Defaults to ``0`` (unbounded). Requires Iceberg V3 row lineage; V2 tables fall back to unbounded refresh.
* Add ``iceberg.materialized-view-default-max-snapshots-per-refresh`` config property and matching session property to set the default bound.

Summary by Sourcery

Add bounded incremental refresh support for Iceberg materialized views, including configuration, planning, and metadata handling to cap the number of snapshots consumed per refresh and integrate with row lineage.

New Features:

Introduce a max_snapshots_per_refresh materialized view property for Iceberg to limit how far each base table can advance per refresh.
Add an Iceberg catalog/session configuration property to set the default maximum snapshots per refresh when the MV does not override it.

Enhancements:

Track per-base snapshot watermarks in Iceberg view properties and derive bounded target snapshots using row lineage where supported.
Extend materialized view status and planner logic to carry incremental-refresh predicates that are applied only during refresh, while keeping stitching reads unbounded.
Ensure statistics and partition-change detection ignore metadata-column predicates that Iceberg cannot bind, and gate lineage-based features on format version support.

Tests:

Add REST materialized view integration tests covering bounded refresh behavior, multi-base independence, compaction, rollback, V2 fallback, session defaults, overrides, partitioned MVs, and stitching semantics.
Add planner tests verifying that bounded refreshes push _last_updated_sequence_number predicates into table scans, while unbounded and V2 refreshes and stitched reads do not.
Add unit tests for Iceberg MV properties and resolution of max_snapshots_per_refresh, including validation rules and session/default interactions.
Update Iceberg config tests to cover the new materialized-view-default-max-snapshots-per-refresh setting.

sourcery-ai · 2026-05-12T01:50:04Z

Reviewer's Guide

Implements bounded incremental refresh for Iceberg materialized views using a new max_snapshots_per_refresh MV property plus matching config/session default, threads the bound through metadata/status computation and the planner so that refresh scans are capped per base table using Iceberg V3 row lineage, and adds extensive tests around bounded behavior, lineage, compaction, rollback, partitioned MVs, and configuration resolution.

Sequence diagram for bounded Iceberg MV refresh planning

sequenceDiagram
    actor User
    participant Session
    participant IcebergMetadata as IcebergAbstractMetadata
    participant Status as MaterializedViewStatus
    participant Planner as IncrementalRefreshRule
    participant Rewriter as IncrementalRefreshPredicateRewriter
    participant Scan as TableScanNode

    User->>Session: REFRESH MATERIALIZED VIEW
    Session->>IcebergMetadata: getMaterializedViewStatus(session, mvName)
    IcebergMetadata->>IcebergMetadata: resolveMaxSnapshotsPerRefresh(session, viewProperties)
    IcebergMetadata->>IcebergMetadata: chooseTargetSnapshot(baseTable, watermark, maxSnapshots)
    IcebergMetadata-->>Session: MaterializedViewStatus(partitionDisjuncts, incrementalRefreshPredicate)

    Session->>Planner: apply(RefreshMaterializedViewNode)
    Planner->>Status: getPartitionsFromBaseTables()
    Planner->>Planner: buildDeltaPlanForRefresh(...)
    Planner->>Planner: applyIncrementalRefreshPredicates(plan, incrementalRefreshPredicates, ...)
    Planner->>Rewriter: SimplePlanRewriter.rewriteWith(perBasePredicates, plan)
    Rewriter->>Scan: visitTableScan(TableScanNode)
    Rewriter-->>Planner: FilterNode(TableScanNode)
    Planner-->>Session: plan with bounded base table scans

File-Level Changes

Change	Details	Files
Introduce max_snapshots_per_refresh materialized view property, default resolution, and persistence in Iceberg metadata and configuration/session settings.	Add MAX_SNAPSHOTS_PER_REFRESH MV property with validation and accessor in IcebergMaterializedViewProperties and persist it as PRESTO_MATERIALIZED_VIEW_MAX_SNAPSHOTS_PER_REFRESH when creating views. Add IcebergConfig and IcebergSessionProperties plumbing for iceberg.materialized-view-default-max-snapshots-per-refresh, including default value, validation, and session accessor getMaterializedViewDefaultMaxSnapshotsPerRefresh. Implement resolveMaxSnapshotsPerRefresh in IcebergAbstractMetadata to pick the persisted MV property if present or fall back to the session default, and wire this into getMaterializedViewStatus and finishRefreshMaterializedView.	`presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergMaterializedViewProperties.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergConfig.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergConfig.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergMaterializedViewProperties.java`
Add bounded refresh snapshot selection and watermark management using Iceberg V3 row lineage, with rollback/ancestry safety and V2 fallback.	Introduce MIN_FORMAT_VERSION_FOR_ROW_LINEAGE and use it to gate lineage-based behavior, including early returns for tables without lineage in validateTableForPresto and chooseTargetSnapshot. Implement chooseTargetSnapshot to compute a bounded target snapshot between the stored watermark and HEAD using ancestorIdsBetween, honoring max_snapshots_per_refresh, skipping when watermark is off HEAD’s ancestry, or when the table lacks lineage. Update getMaterializedViewStatus to compute per-base targetSnapshotId, detect rollbacks where watermarks are off ancestry and mark the MV NOT_MATERIALIZED, and build MaterializedDataPredicates with an incrementalRefreshPredicate over _last_updated_sequence_number when a bound applies. Update finishRefreshMaterializedView to calculate new base watermarks per table using chooseTargetSnapshot and the previous watermark, ensuring that bounds are respected and rollbacks lead to full catch-up to the new head snapshot.	`presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergAbstractMetadata.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java`
Extend MaterializedViewStatus and the planner to push incremental refresh predicates on _last_updated_sequence_number into base table scans only during refresh, without affecting stitching plans.	Extend MaterializedViewStatus.MaterializedDataPredicates to carry an incrementalRefreshPredicate TupleDomain in addition to partition disjuncts and key column names. In IncrementalRefreshRule, extract any non-trivial incrementalRefreshPredicate per base table and introduce applyIncrementalRefreshPredicates to wrap delta or full-refresh plans, or fallback full-refresh plans, with base-table filters derived from these predicates. Implement IncrementalRefreshPredicateRewriter, a SimplePlanRewriter that finds TableScanNodes by SchemaTableName, ensures all predicate columns are available (adding hidden columns to scan outputs as needed), builds a RowExpression from the TupleDomain, and wraps the scan in a FilterNode plus optional ProjectNode to restore original outputs. Ensure that stitching queries (e.g., reading the MV with USE_STITCHING) do not see or apply the incrementalRefreshPredicate, so stale-read stitching still reads all needed base rows.	`presto-spi/src/main/java/com/facebook/presto/spi/MaterializedViewStatus.java` `presto-main-base/src/main/java/com/facebook/presto/sql/planner/iterative/rule/materializedview/IncrementalRefreshRule.java`
Adjust Iceberg statistics and constraint handling so lineage predicates are honored for refresh pruning but not pushed into metadata paths that cannot handle metadata columns.	Update TableStatisticsMaker.getDataTableSummary to strip metadata-column filters by intersecting with non-metadata constraints via IcebergUtil.getNonMetadataColumnConstraints before converting to an Iceberg expression. Rely on the new incrementalRefreshPredicate plumbing so that _last_updated_sequence_number predicates are applied at the scan/filter level instead of in manifest-level or metadata-only filters that would otherwise reject metadata columns.	`presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java`
Add comprehensive tests for bounded MV refresh behavior, lineage, config resolution, and planning of sequence predicates for Iceberg REST and optimizer.	Extend TestIcebergRestMaterializedViews with a RESTCatalog, schema constant, helpers to read head snapshot, current snapshot, and base watermarks, and a suite of tests covering bounded single/multi-base refresh, compaction, V2 fallback, session defaults vs table overrides, partitioned MVs, stitching semantics, rollback handling, property persistence, invalid bounds, and hidden sequence column usage. Add TestIcebergMaterializedViewOptimizer tests asserting that bounded refresh pushes a _last_updated_sequence_number <= constant predicate into the layout and scan constraints, while unbounded refresh, V2 bounded refresh, and stitched queries do not carry sequence predicates on the base scan. Add TestIcebergMaterializedViewProperties to validate the new MV property’s decoding, validation, and resolveMaxSnapshotsPerRefresh behavior, including invalid values, session-default behavior, and precedence of persisted properties. Update TestIcebergConfig to cover the new materialized-view-default-max-snapshots-per-refresh config property in defaults and explicit mappings.	`presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergRestMaterializedViews.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergMaterializedViewOptimizer.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergMaterializedViewProperties.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergConfig.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

Advance each base's watermark to the N-th-oldest ancestor in (watermark, HEAD] and push a _last_updated_sequence_number bound into the refresh scan via existing V3 row-lineage pushdown.

sourcery-ai

Hey - I've found 2 issues, and left some high level feedback:

In IcebergAbstractMetadata.getMaterializedViewStatus, the rollback/branch guard uses isAncestorOf(baseIcebergTable, currentSnapshotId, recordedSnapshotId) but SnapshotUtil.isAncestorOf expects (ancestor, descendant); since recordedSnapshotId is older, this arg order appears reversed and will incorrectly force NOT_MATERIALIZED whenever the base advances, effectively disabling incremental refresh after the first change.
chooseTargetSnapshot already contains an ancestry guard (isAncestorOf(baseTable, head, watermark)), but getMaterializedViewStatus adds a separate ancestry check with different parameter ordering and semantics; consider centralizing this logic in chooseTargetSnapshot (or a shared helper) to avoid divergence and subtle bugs in rollback/branch handling.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `IcebergAbstractMetadata.getMaterializedViewStatus`, the rollback/branch guard uses `isAncestorOf(baseIcebergTable, currentSnapshotId, recordedSnapshotId)` but `SnapshotUtil.isAncestorOf` expects `(ancestor, descendant)`; since `recordedSnapshotId` is older, this arg order appears reversed and will incorrectly force `NOT_MATERIALIZED` whenever the base advances, effectively disabling incremental refresh after the first change.
- `chooseTargetSnapshot` already contains an ancestry guard (`isAncestorOf(baseTable, head, watermark)`), but `getMaterializedViewStatus` adds a separate ancestry check with different parameter ordering and semantics; consider centralizing this logic in `chooseTargetSnapshot` (or a shared helper) to avoid divergence and subtle bugs in rollback/branch handling.

## Individual Comments

### Comment 1
<location path="presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergRestMaterializedViews.java" line_range="596-604" />
<code_context>
+        }
+    }
+
+    private static void assertAdvanceLeq(long oldWatermark, long newWatermark, long head, int bound, String baseTable)
+    {
+        if (oldWatermark == newWatermark) {
+            fail("expected watermark to advance from " + oldWatermark + " on base " + baseTable);
+        }
+        if (newWatermark == head) {
+            return;
+        }
+        assertNotEquals(newWatermark, oldWatermark);
+    }
 }
</code_context>
<issue_to_address>
**issue (testing):** assertAdvanceLeq helper does not actually enforce the "<= bound" invariant

The helper’s behavior doesn’t match its name or call-site intent: it only checks that the watermark advanced (and possibly hit `head`), but never that it advanced by *no more than* `bound` snapshots. A change that advances by 10 when `bound=2` would still pass. Please tighten this by explicitly asserting the step count is `<= bound` (e.g., via `ancestorIdsBetween(head, oldWatermark, ...)` and checking `newWatermark` is within `bound` steps), or by exposing a small test hook that reveals the chosen target snapshot so the boundedness can be asserted directly.
</issue_to_address>

### Comment 2
<location path="presto-docs/src/main/sphinx/connector/iceberg.rst" line_range="551" />
<code_context>
-                                                      is hashed to a particular node when determining the which worker to
-                                                      assign a split to. Splits which read data from the same file within
-                                                      the same chunk will hash to the same node. A smaller chunk size will
-                                                      result in a higher probability splits being distributed evenly across
-                                                      the cluster, but reduce locality. 
-                                                      See :ref:`develop/connectors:Node Selection Strategy`.
</code_context>
<issue_to_address>
**issue (typo):** Insert "of" after "probability" for correct phrasing.

```suggestion
                                                      result in a higher probability of splits being distributed evenly across
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-05-14T15:45:10Z

+    private static void assertAdvanceLeq(long oldWatermark, long newWatermark, long head, int bound, String baseTable)
+    {
+        if (oldWatermark == newWatermark) {
+            fail("expected watermark to advance from " + oldWatermark + " on base " + baseTable);
+        }
+        if (newWatermark == head) {
+            return;
+        }
+        assertNotEquals(newWatermark, oldWatermark);


issue (testing): assertAdvanceLeq helper does not actually enforce the "<= bound" invariant

The helper’s behavior doesn’t match its name or call-site intent: it only checks that the watermark advanced (and possibly hit head), but never that it advanced by no more than bound snapshots. A change that advances by 10 when bound=2 would still pass. Please tighten this by explicitly asserting the step count is <= bound (e.g., via ancestorIdsBetween(head, oldWatermark, ...) and checking newWatermark is within bound steps), or by exposing a small test hook that reveals the chosen target snapshot so the boundedness can be asserted directly.

sourcery-ai · 2026-05-14T15:45:10Z

+                                                        is hashed to a particular node when determining the which worker to
+                                                        assign a split to. Splits which read data from the same file within
+                                                        the same chunk will hash to the same node. A smaller chunk size will
+                                                        result in a higher probability splits being distributed evenly across


issue (typo): Insert "of" after "probability" for correct phrasing.

Suggested change

result in a higher probability splits being distributed evenly across

result in a higher probability of splits being distributed evenly across

tdcmeehan · 2026-05-14T18:28:11Z

@steveburnett I refactored the docs a bit and added more user facing docs, let me know what you think!

steveburnett

LGTM! (docs)

prestodb-ci added the from:IBM PR from IBM label May 12, 2026

tdcmeehan force-pushed the iceseq_mvr branch from 42f1042 to 697a7a8 Compare May 14, 2026 13:46

feat(plugin-iceberg): Bound MV refresh via max_snapshots_per_refresh

5acc5c1

Advance each base's watermark to the N-th-oldest ancestor in (watermark, HEAD] and push a _last_updated_sequence_number bound into the refresh scan via existing V3 row-lineage pushdown.

tdcmeehan force-pushed the iceseq_mvr branch from 697a7a8 to 5acc5c1 Compare May 14, 2026 15:40

tdcmeehan changed the title ~~feat(plugin-iceberg): Bound MV refresh via max_snapshots_per_refresh~~ feat(plugin-iceberg): Bound MV refresh May 14, 2026

tdcmeehan marked this pull request as ready for review May 14, 2026 15:43

tdcmeehan requested review from a team, ZacBlanco, elharo, feilong-liu, hantangwangd, jaystarshot and steveburnett as code owners May 14, 2026 15:43

prestodb-ci requested review from a team, aaneja and wanglinsong and removed request for a team May 14, 2026 15:43

sourcery-ai Bot reviewed May 14, 2026

View reviewed changes

steveburnett previously approved these changes May 14, 2026

View reviewed changes

Address sourcery feedback

2521e39

tdcmeehan dismissed steveburnett’s stale review via 2521e39 May 14, 2026 16:45

tdcmeehan force-pushed the iceseq_mvr branch from 23826a1 to 2521e39 Compare May 14, 2026 16:45

Fix docs, and add more user documentation for bounded refresh

87b5523

tdcmeehan requested a review from agrawalreetika May 14, 2026 18:57

steveburnett requested changes May 14, 2026

View reviewed changes

Comment thread presto-docs/src/main/sphinx/connector/iceberg.rst Outdated

via -> with

d4d7cec

steveburnett previously approved these changes May 14, 2026

View reviewed changes

Emit a warning when table is v2

67e8d56

tdcmeehan dismissed steveburnett’s stale review via 67e8d56 May 15, 2026 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(plugin-iceberg): Bound MV refresh#27774

feat(plugin-iceberg): Bound MV refresh#27774
tdcmeehan wants to merge 5 commits into
prestodb:masterfrom
tdcmeehan:iceseq_mvr

tdcmeehan commented May 12, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented May 12, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

sourcery-ai Bot May 14, 2026

Uh oh!

sourcery-ai Bot May 14, 2026

Uh oh!

tdcmeehan commented May 14, 2026

Uh oh!

Uh oh!

steveburnett left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	result in a higher probability splits being distributed evenly across
	result in a higher probability of splits being distributed evenly across

Conversation

tdcmeehan commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for bounded Iceberg MV refresh planning

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

tdcmeehan commented May 14, 2026

Uh oh!

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tdcmeehan commented May 12, 2026 •

edited

Loading

sourcery-ai Bot commented May 12, 2026 •

edited

Loading