Skip to content

[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314

Merged
nsivabalan merged 14 commits into
apache:release-0.14.2-prepfrom
vamsikarnika:mor_colstats_initializationfix-oss-cp
Apr 16, 2026
Merged

[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314
nsivabalan merged 14 commits into
apache:release-0.14.2-prepfrom
vamsikarnika:mor_colstats_initializationfix-oss-cp

Conversation

@vamsikarnika
Copy link
Copy Markdown
Collaborator

@vamsikarnika vamsikarnika commented Mar 12, 2026

Describe the issue this Pull Request addresses

#18376

  • Support bootstrapping of col stats for MOR table.
  • Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Summary and Changelog

  • Support bootstrapping of col stats for MOR table.
  • Fix clean operation with col stats. Even though stats are nullified, the records apparently were not deleted from the col stats partition.

Impact

We could enable col stats for MOR table at any given state.
Ran into other issues along the way which I had to fix to get the patch ready.

  • DirectoryInfo was not accounting for files fetched from MDT. When a new MDT partition is initialized, to fetch file info, we poll MDT rather than doing FS based listing. This had some a bug and had to fix it.
  • When clean from data table is applied to MDT, we were nullifying the stats or marking it as deleted, but the record as such is not deleted from col stats partition and was lingering. Fixed the same in this patch.

Tests covered:

  • bootstrapping of both COW and MOR table.
  • Covered both partitioned and non-partitioned table.
  • Ensure log files w/ delete block, partially failed log blocks and rollback blocks are accounted for in tests.
  • Added tests to validate clean does remove the entry from col stats for both table types and partition and non-partitioned table.

Risk Level

low

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

vamsikarnika and others added 4 commits March 12, 2026 16:37
…geMetadata

Convert HoodieRecord list to IndexedRecord before calling collectColumnRangeMetadata,
matching the 3-arg signature in 0.14.x (master's version accepted HoodieRecord + Schema).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace Collector wildcard pattern with forEach+map in collectColumnRangeMetadata
  (HoodieTableMetadataUtil) and readRangeFromParquetMetadata (ParquetUtils) to fix
  Java 8 type inference failures
- Replace FileSlice.hasLogFiles() with getLogFiles().findAny().isPresent() since
  hasLogFiles() doesn't exist in 0.14.x

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collect flatMap result to List before grouping to avoid raw type inference
issue where Java 8 loses generic type parameter through the flatMap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Mar 12, 2026
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Mar 13, 2026
@github-actions github-actions Bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Mar 18, 2026
@apache apache deleted a comment from hudi-bot Mar 20, 2026
@nsivabalan
Copy link
Copy Markdown
Contributor

image

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — A few minor naming and documentation inconsistencies: inconsistent parameter naming in overloaded constructors, a Javadoc tag typo, and one vague comment that could be more descriptive.

* When files are directly fetched from Metadata table we do not need to validate HoodiePartitions.
*/
public DirectoryInfo(String relativePath, FileStatus[] fileStatus, String maxInstantTime, Set<String> pendingDataInstants,
boolean validateHoodiePartitions) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the parameter name fileStatus (singular) differs from the overload above which uses fileStatuses (plural) — could you make these consistent?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

/**
* @param filePath
* @returns the filename from the given path. Path could be the absolute path or just partition path and file name.
*/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: Javadoc tag should be @return (not @returns) to match standard conventions.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@@ -79,7 +79,7 @@ public <T> void processNextRecord(HoodieRecord<T> hoodieRecord) throws Exception

@Override
protected void processNextDeletedRecord(DeleteRecord deleteRecord) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the comment // no op is too vague — consider explaining why delete records should be ignored in this scan mode, e.g., // Ignore delete records; unmerged scan only reads inserts/updates.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The fixes for MOR col stats bootstrapping, DirectoryInfo path handling from MDT, and clean operation with col stats deletion are meaningful improvements. A few concerns to address: a latent bug where maxBufferSize = -1 can reach the log scanner via the write-stat path, an unsafe .get() on an Optional in the log file column stats reader, and a behavioral change to HoodieUnMergedLogRecordScanner that could mask issues for other callers. Details in the inline comments.

protected static List<HoodieColumnRangeMetadata<Comparable>> getLogFileColumnRangeMetadata(String filePath,
HoodieTableMetaClient datasetMetaClient,
List<String> columnsToIndex,
Option<Schema> writerSchemaOpt,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The 5-arg getColumnStatsRecords overload passes maxBufferSize = -1 to readColumnRangeMetadataFrom. If a log file's HoodieDeltaWriteStat doesn't have column stats populated (i.e., getColumnStats() returns empty), translateWriteStatToColumnStats falls through to this path, and -1 gets passed to HoodieUnMergedLogRecordScanner.withBufferSize(). Could you either use a sensible default (like MAX_DFS_STREAM_BUFFER_SIZE.defaultValue()) instead of -1, or add a guard?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

HoodieUnMergedLogRecordScanner scanner = HoodieUnMergedLogRecordScanner.newBuilder()
.withFileSystem(datasetMetaClient.getFs())
.withBasePath(datasetMetaClient.getBasePath())
.withLogFilePaths(Collections.singletonList(filePath))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 .lastInstant().get() is called without checking isPresent(). If the commits timeline is empty (e.g., during an unusual bootstrap edge case), this will throw NoSuchElementException. Could you add a presence check and return an empty list if no completed commits exist?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

scanner.scan();
if (records.isEmpty()) {
return Collections.emptyList();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 All log file records are collected into an in-memory List<HoodieRecord>, then converted to List<IndexedRecord>. For large log files this could cause OOM. Have you considered using an iterator-based approach, or at least documenting the expected size constraints? This was also flagged as a concern in a previous review of a related PR.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@Override
protected void processNextDeletedRecord(DeleteRecord deleteRecord) {
throw new IllegalStateException("Not expected to see delete records in this log-scan mode. Check Job Config");
// no op
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Changing this from throwing to no-op changes the contract for all callers of HoodieUnMergedLogRecordScanner, not just the column stats path. Previously, this exception served as a safety net to detect unexpected delete records. Could you instead handle delete records specifically in the column stats scanner callback, or add a builder option like withIgnoreDeleteRecords(true) to make this opt-in?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

if (HoodieTimeline.compareTimestamps(dataFileCommitTime, HoodieTimeline.LESSER_THAN_OR_EQUALS, maxInstantTime)) {
// Limit the file listings to files which were created by successful commits before the maxInstant time.
if (!pendingDataInstants.contains(dataFileCommitTime) && HoodieTimeline.compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) {
filenameToSizeMap.put(status.getPath().getName(), status.getLen());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 When isHoodiePartition is determined before the loop and is true, the code skips adding subdirectories entirely (!isHoodiePartition && status.isDirectory()). Is it intentional to not recurse into subdirectories within a Hudi partition? Previously, directories were added to subDirectories regardless of isHoodiePartition status since the partition metafile check happened mid-loop.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

@nsivabalan nsivabalan merged commit 0c47d30 into apache:release-0.14.2-prep Apr 16, 2026
30 of 32 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants