Skip to content

fix: use KeyBasedFileGroupRecordBuffer for non-Parquet base files in MOR merge#18588

Open
yihua wants to merge 1 commit into
apache:masterfrom
yihua:fix-lance-mor-delete-merge
Open

fix: use KeyBasedFileGroupRecordBuffer for non-Parquet base files in MOR merge#18588
yihua wants to merge 1 commit into
apache:masterfrom
yihua:fix-lance-mor-delete-merge

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented Apr 24, 2026

Describe the issue this Pull Request addresses

Closes #18558

Deletes are not merged correctly with Lance format on MOR table type.

Summary and Changelog

DefaultFileGroupRecordBufferLoader selects PositionBasedFileGroupRecordBuffer when useRecordPosition is true, regardless of the base file format. However, HoodieFileGroupReader sets shouldMergeUseRecordPosition = false for non-Parquet files via the isParquetBaseFile() gate. This mismatch causes the buffer to enter a fallback hybrid-strategy mode that fails to merge delete blocks for Lance base files.

Added the isParquetBaseFile() check to the buffer selection so non-Parquet formats (Lance, ORC) use KeyBasedFileGroupRecordBuffer directly, matching the merge strategy set by HoodieFileGroupReader.

Impact

Non-Parquet base file formats in MOR tables now correctly merge deletes. No impact on Parquet tables.

Risk Level

Low — narrows PositionBasedFileGroupRecordBuffer selection to Parquet only, which is the only format that supports row-index-based positional merging today.

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label Apr 24, 2026
@yihua yihua force-pushed the fix-lance-mor-delete-merge branch from 3d0299e to 98206d6 Compare April 24, 2026 23:42
@yihua yihua marked this pull request as ready for review April 25, 2026 00:12
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This narrows PositionBasedFileGroupRecordBuffer selection to Parquet base files only, matching the existing isParquetBaseFile() gate that HoodieFileGroupReader uses for shouldMergeUseRecordPosition. The Lance MOR delete test is un-skipped accordingly. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@github-actions github-actions Bot added size:XS PR with lines of changes in <= 10 and removed size:S PR with lines of changes in (10, 100] labels Apr 25, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This narrows PositionBasedFileGroupRecordBuffer selection to Parquet base files, matching the existing shouldMergeUseRecordPosition gate in HoodieFileGroupReader so non-Parquet formats (Lance, ORC) consistently use KeyBasedFileGroupRecordBuffer. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@yihua yihua force-pushed the fix-lance-mor-delete-merge branch 3 times, most recently from a1bcc94 to 2a2516f Compare April 25, 2026 01:56
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This narrows PositionBasedFileGroupRecordBuffer selection to Parquet base files, aligning with the merge strategy HoodieFileGroupReader already enforces via shouldMergeUseRecordPosition, and unblocks the previously-skipped MOR DELETE/INSERT cases in TestLanceDataSource. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Copy link
Copy Markdown
Member

@voonhous voonhous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@github-actions github-actions Bot added size:S PR with lines of changes in (10, 100] and removed size:XS PR with lines of changes in <= 10 labels Apr 25, 2026
@rahil-c
Copy link
Copy Markdown
Collaborator

rahil-c commented Apr 26, 2026

@yihua can you look at test failures on ci

@yihua yihua added this to the release-1.2.0 milestone May 8, 2026
@yihua yihua force-pushed the fix-lance-mor-delete-merge branch from 2a2516f to 9854f52 Compare May 15, 2026 22:18
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This PR removes the MOR-specific workarounds in the Lance test now that deletes merge correctly via KeyBasedFileGroupRecordBuffer for non-Parquet base files. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

…18558)

Remove the COW-only guard on the DELETE test in TestLanceDataSource
so that deletes are verified for both COW and MOR table types.

Closes apache#18558
@yihua yihua force-pushed the fix-lance-mor-delete-merge branch from 9854f52 to 2c413fd Compare May 15, 2026 22:24
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.11%. Comparing base (4035f70) to head (2c413fd).

Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18588       +/-   ##
=============================================
+ Coverage     45.00%   68.11%   +23.10%     
- Complexity     8571    29089    +20518     
=============================================
  Files          1202     2518     +1316     
  Lines         62816   141221    +78405     
  Branches       6811    17531    +10720     
=============================================
+ Hits          28271    96188    +67917     
- Misses        31428    37139     +5711     
- Partials       3117     7894     +4777     
Flag Coverage Δ
common-and-other-modules 44.40% <ø> (?)
hadoop-mr-java-client 45.01% <ø> (+<0.01%) ⬆️
spark-client-hadoop-common 48.33% <ø> (?)
spark-java-tests 48.62% <ø> (?)
spark-scala-tests 44.89% <ø> (?)
utilities 37.61% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.
see 2115 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the fix! This PR aligns the buffer selection with HoodieFileGroupReader's merge strategy gating so non-Parquet base files (Lance/ORC) use KeyBasedFileGroupRecordBuffer, and the test changes simply remove the now-unnecessary CoW-only workarounds. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deletes are not merged correctly with Lance format on MOR table type

6 participants