[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios by vamsikarnika · Pull Request #18314 · apache/hudi

vamsikarnika · 2026-03-12T11:54:40Z

Describe the issue this Pull Request addresses

Support bootstrapping of col stats for MOR table.
Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Summary and Changelog

Support bootstrapping of col stats for MOR table.
Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Impact

We could enable col stats for MOR table at any given state.
Ran into other issues along the way which I had to fix to get the patch ready.

DirectoryInfo was not accounting for files fetched from MDT. When a new MDT partition is initialized, to fetch file info, we poll MDT rather than doing FS based listing. This had some a bug and had to fix it.
When clean from data table is applied to MDT, we were nullifying the stats or marking it as deleted, but the record as such is not deleted from col stats partition and was lingering. Fixed the same in this patch.

Tests covered:

bootstrapping of both COW and MOR table.
Covered both partitioned and non-partitioned table.
Ensure log files w/ delete block, partially failed log blocks and rollback blocks are accounted for in tests.
Added tests to validate clean does remove the entry from col stats for both table types and partition and non-partitioned table.

Risk Level

low

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…geMetadata Convert HoodieRecord list to IndexedRecord before calling collectColumnRangeMetadata, matching the 3-arg signature in 0.14.x (master's version accepted HoodieRecord + Schema). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace Collector wildcard pattern with forEach+map in collectColumnRangeMetadata (HoodieTableMetadataUtil) and readRangeFromParquetMetadata (ParquetUtils) to fix Java 8 type inference failures - Replace FileSlice.hasLogFiles() with getLogFiles().findAny().isPresent() since hasLogFiles() doesn't exist in 0.14.x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Collect flatMap result to List before grouping to avoid raw type inference issue where Java 8 loses generic type parameter through the flatMap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…x-oss-cp

nsivabalan · 2026-04-09T15:17:38Z

…lstats_initializationfix-oss-cp

hudi-bot · 2026-04-10T14:19:02Z

CI report:

07602e0 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — A few minor naming and documentation inconsistencies: inconsistent parameter naming in overloaded constructors, a Javadoc tag typo, and one vague comment that could be more descriptive.

yihua · 2026-04-14T22:21:32Z

+     * When files are directly fetched from Metadata table we do not need to validate HoodiePartitions.
+     */
+    public DirectoryInfo(String relativePath, FileStatus[] fileStatus, String maxInstantTime, Set<String> pendingDataInstants,
+                         boolean validateHoodiePartitions) {


🤖 nit: the parameter name fileStatus (singular) differs from the overload above which uses fileStatuses (plural) — could you make these consistent?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:21:32Z

+  /**
+   * @param filePath
+   * @returns the filename from the given path. Path could be the absolute path or just partition path and file name.
+   */


🤖 nit: Javadoc tag should be @return (not @returns) to match standard conventions.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:21:32Z

@@ -79,7 +79,7 @@ public <T> void processNextRecord(HoodieRecord<T> hoodieRecord) throws Exception

  @Override
  protected void processNextDeletedRecord(DeleteRecord deleteRecord) {


🤖 nit: the comment // no op is too vague — consider explaining why delete records should be ignored in this scan mode, e.g., // Ignore delete records; unmerged scan only reads inserts/updates.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The fixes for MOR col stats bootstrapping, DirectoryInfo path handling from MDT, and clean operation with col stats deletion are meaningful improvements. A few concerns to address: a latent bug where maxBufferSize = -1 can reach the log scanner via the write-stat path, an unsafe .get() on an Optional in the log file column stats reader, and a behavioral change to HoodieUnMergedLogRecordScanner that could mask issues for other callers. Details in the inline comments.

yihua · 2026-04-14T22:25:13Z

+  protected static List<HoodieColumnRangeMetadata<Comparable>> getLogFileColumnRangeMetadata(String filePath,
+                                                                                             HoodieTableMetaClient datasetMetaClient,
+                                                                                             List<String> columnsToIndex,
+                                                                                             Option<Schema> writerSchemaOpt,


🤖 The 5-arg getColumnStatsRecords overload passes maxBufferSize = -1 to readColumnRangeMetadataFrom. If a log file's HoodieDeltaWriteStat doesn't have column stats populated (i.e., getColumnStats() returns empty), translateWriteStatToColumnStats falls through to this path, and -1 gets passed to HoodieUnMergedLogRecordScanner.withBufferSize(). Could you either use a sensible default (like MAX_DFS_STREAM_BUFFER_SIZE.defaultValue()) instead of -1, or add a guard?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:25:13Z

+      HoodieUnMergedLogRecordScanner scanner = HoodieUnMergedLogRecordScanner.newBuilder()
+          .withFileSystem(datasetMetaClient.getFs())
+          .withBasePath(datasetMetaClient.getBasePath())
+          .withLogFilePaths(Collections.singletonList(filePath))


🤖 .lastInstant().get() is called without checking isPresent(). If the commits timeline is empty (e.g., during an unusual bootstrap edge case), this will throw NoSuchElementException. Could you add a presence check and return an empty list if no completed commits exist?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:25:13Z

+      scanner.scan();
+      if (records.isEmpty()) {
+        return Collections.emptyList();
+      }


🤖 All log file records are collected into an in-memory List<HoodieRecord>, then converted to List<IndexedRecord>. For large log files this could cause OOM. Have you considered using an iterator-based approach, or at least documenting the expected size constraints? This was also flagged as a concern in a previous review of a related PR.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:25:13Z

  @Override
  protected void processNextDeletedRecord(DeleteRecord deleteRecord) {
-    throw new IllegalStateException("Not expected to see delete records in this log-scan mode. Check Job Config");
+    // no op


🤖 Changing this from throwing to no-op changes the contract for all callers of HoodieUnMergedLogRecordScanner, not just the column stats path. Previously, this exception served as a safety net to detect unexpected delete records. Could you instead handle delete records specifically in the column stats scanner callback, or add a builder option like withIgnoreDeleteRecords(true) to make this opt-in?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:25:13Z

-          if (HoodieTimeline.compareTimestamps(dataFileCommitTime, HoodieTimeline.LESSER_THAN_OR_EQUALS, maxInstantTime)) {
+          // Limit the file listings to files which were created by successful commits before the maxInstant time.
+          if (!pendingDataInstants.contains(dataFileCommitTime) && HoodieTimeline.compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) {
            filenameToSizeMap.put(status.getPath().getName(), status.getLen());


🤖 When isHoodiePartition is determined before the loop and is true, the code skips adding subdirectories entirely (!isHoodiePartition && status.isDirectory()). Is it intentional to not recurse into subdirectories within a Hudi partition? Previously, directories were added to subDirectories regardless of isHoodiePartition status since the partition metafile check happened mid-loop.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

vamsikarnika and others added 4 commits March 12, 2026 16:37

[HUDI-8371] Fixing column stats index with MDT for few scenarios

c97664f

Fix Java 8 type inference in ParquetUtils flatMap chain

03a878b

Collect flatMap result to List before grouping to avoid raw type inference issue where Java 8 loses generic type parameter through the flatMap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the size:XL PR with lines of changes > 1000 label Mar 12, 2026

Fix conflicts

a7d9668

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Mar 13, 2026

vamsikarnika added 6 commits March 13, 2026 23:42

Fix TestHoodieMetadataPayload tests

5419e5d

fix scala issues

099f023

Fix checkstyle

74b57ec

fix CI

d80a4b6

Fix tests in TestColumnStatsIndex

82913a2

Fix MDT Bootstrap tests

b27240d

github-actions Bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Mar 18, 2026

apache deleted a comment from hudi-bot Mar 20, 2026

vamsikarnika added 2 commits March 24, 2026 16:02

Merge branch 'release-0.14.2-prep' into mor_colstats_initializationfi…

e4306a7

…x-oss-cp

Fix checkstyle and tests

6abac32

nsivabalan approved these changes Apr 9, 2026

View reviewed changes

Merge remote-tracking branch 'apache/release-0.14.2-prep' into mor_co…

07602e0

…lstats_initializationfix-oss-cp

yihua reviewed Apr 14, 2026

View reviewed changes

yihua mentioned this pull request Apr 14, 2026

[OSS PR #18314] [HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios yihua/hudi#40

Open

nsivabalan merged commit 0c47d30 into apache:release-0.14.2-prep Apr 16, 2026
30 of 32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314

[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314
nsivabalan merged 14 commits into
apache:release-0.14.2-prepfrom
vamsikarnika:mor_colstats_initializationfix-oss-cp

vamsikarnika commented Mar 12, 2026 •

edited

Loading

Uh oh!

nsivabalan commented Apr 9, 2026

Uh oh!

hudi-bot commented Apr 10, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua left a comment

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua Apr 14, 2026

Uh oh!

yihua Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		@@ -79,7 +79,7 @@ public <T> void processNextRecord(HoodieRecord<T> hoodieRecord) throws Exception

		@Override
		protected void processNextDeletedRecord(DeleteRecord deleteRecord) {

Conversation

vamsikarnika commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan commented Apr 9, 2026

Uh oh!

hudi-bot commented Apr 10, 2026

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vamsikarnika commented Mar 12, 2026 •

edited

Loading