[HUDI-8371][CHERRYPICK] Fix column stats index with MDT for a few scenarios#18314
Conversation
…geMetadata Convert HoodieRecord list to IndexedRecord before calling collectColumnRangeMetadata, matching the 3-arg signature in 0.14.x (master's version accepted HoodieRecord + Schema). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Replace Collector wildcard pattern with forEach+map in collectColumnRangeMetadata (HoodieTableMetadataUtil) and readRangeFromParquetMetadata (ParquetUtils) to fix Java 8 type inference failures - Replace FileSlice.hasLogFiles() with getLogFiles().findAny().isPresent() since hasLogFiles() doesn't exist in 0.14.x Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collect flatMap result to List before grouping to avoid raw type inference issue where Java 8 loses generic type parameter through the flatMap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lstats_initializationfix-oss-cp
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — A few minor naming and documentation inconsistencies: inconsistent parameter naming in overloaded constructors, a Javadoc tag typo, and one vague comment that could be more descriptive.
| * When files are directly fetched from Metadata table we do not need to validate HoodiePartitions. | ||
| */ | ||
| public DirectoryInfo(String relativePath, FileStatus[] fileStatus, String maxInstantTime, Set<String> pendingDataInstants, | ||
| boolean validateHoodiePartitions) { |
There was a problem hiding this comment.
🤖 nit: the parameter name fileStatus (singular) differs from the overload above which uses fileStatuses (plural) — could you make these consistent?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| /** | ||
| * @param filePath | ||
| * @returns the filename from the given path. Path could be the absolute path or just partition path and file name. | ||
| */ |
There was a problem hiding this comment.
🤖 nit: Javadoc tag should be @return (not @returns) to match standard conventions.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| @@ -79,7 +79,7 @@ public <T> void processNextRecord(HoodieRecord<T> hoodieRecord) throws Exception | |||
|
|
|||
| @Override | |||
| protected void processNextDeletedRecord(DeleteRecord deleteRecord) { | |||
There was a problem hiding this comment.
🤖 nit: the comment // no op is too vague — consider explaining why delete records should be ignored in this scan mode, e.g., // Ignore delete records; unmerged scan only reads inserts/updates.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! The fixes for MOR col stats bootstrapping, DirectoryInfo path handling from MDT, and clean operation with col stats deletion are meaningful improvements. A few concerns to address: a latent bug where maxBufferSize = -1 can reach the log scanner via the write-stat path, an unsafe .get() on an Optional in the log file column stats reader, and a behavioral change to HoodieUnMergedLogRecordScanner that could mask issues for other callers. Details in the inline comments.
| protected static List<HoodieColumnRangeMetadata<Comparable>> getLogFileColumnRangeMetadata(String filePath, | ||
| HoodieTableMetaClient datasetMetaClient, | ||
| List<String> columnsToIndex, | ||
| Option<Schema> writerSchemaOpt, |
There was a problem hiding this comment.
🤖 The 5-arg getColumnStatsRecords overload passes maxBufferSize = -1 to readColumnRangeMetadataFrom. If a log file's HoodieDeltaWriteStat doesn't have column stats populated (i.e., getColumnStats() returns empty), translateWriteStatToColumnStats falls through to this path, and -1 gets passed to HoodieUnMergedLogRecordScanner.withBufferSize(). Could you either use a sensible default (like MAX_DFS_STREAM_BUFFER_SIZE.defaultValue()) instead of -1, or add a guard?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| HoodieUnMergedLogRecordScanner scanner = HoodieUnMergedLogRecordScanner.newBuilder() | ||
| .withFileSystem(datasetMetaClient.getFs()) | ||
| .withBasePath(datasetMetaClient.getBasePath()) | ||
| .withLogFilePaths(Collections.singletonList(filePath)) |
There was a problem hiding this comment.
🤖 .lastInstant().get() is called without checking isPresent(). If the commits timeline is empty (e.g., during an unusual bootstrap edge case), this will throw NoSuchElementException. Could you add a presence check and return an empty list if no completed commits exist?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| scanner.scan(); | ||
| if (records.isEmpty()) { | ||
| return Collections.emptyList(); | ||
| } |
There was a problem hiding this comment.
🤖 All log file records are collected into an in-memory List<HoodieRecord>, then converted to List<IndexedRecord>. For large log files this could cause OOM. Have you considered using an iterator-based approach, or at least documenting the expected size constraints? This was also flagged as a concern in a previous review of a related PR.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| @Override | ||
| protected void processNextDeletedRecord(DeleteRecord deleteRecord) { | ||
| throw new IllegalStateException("Not expected to see delete records in this log-scan mode. Check Job Config"); | ||
| // no op |
There was a problem hiding this comment.
🤖 Changing this from throwing to no-op changes the contract for all callers of HoodieUnMergedLogRecordScanner, not just the column stats path. Previously, this exception served as a safety net to detect unexpected delete records. Could you instead handle delete records specifically in the column stats scanner callback, or add a builder option like withIgnoreDeleteRecords(true) to make this opt-in?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| if (HoodieTimeline.compareTimestamps(dataFileCommitTime, HoodieTimeline.LESSER_THAN_OR_EQUALS, maxInstantTime)) { | ||
| // Limit the file listings to files which were created by successful commits before the maxInstant time. | ||
| if (!pendingDataInstants.contains(dataFileCommitTime) && HoodieTimeline.compareTimestamps(dataFileCommitTime, LESSER_THAN_OR_EQUALS, maxInstantTime)) { | ||
| filenameToSizeMap.put(status.getPath().getName(), status.getLen()); |
There was a problem hiding this comment.
🤖 When isHoodiePartition is determined before the loop and is true, the code skips adding subdirectories entirely (!isHoodiePartition && status.isDirectory()). Is it intentional to not recurse into subdirectories within a Hudi partition? Previously, directories were added to subDirectories regardless of isHoodiePartition status since the partition metafile check happened mid-loop.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
0c47d30
into
apache:release-0.14.2-prep

Describe the issue this Pull Request addresses
#18376
Summary and Changelog
Impact
We could enable col stats for MOR table at any given state.
Ran into other issues along the way which I had to fix to get the patch ready.
Tests covered:
Risk Level
low
Documentation Update
Contributor's checklist