feat(spark): refresh parquet tools clustering strategy for current master by suryaprasanna · Pull Request #18409 · apache/hudi

suryaprasanna · 2026-03-28T07:20:36Z

Describe the issue this Pull Request addresses

This PR refreshes the parquet-tools based clustering strategy from the older parquet-tools branch so it can be proposed against current apache/master.

The original implementation had drifted from current Hudi internals and test APIs. This refresh keeps the existing simple rewrite hook shape while aligning the implementation with current clustering and storage behavior.

Summary and Changelog

Refresh the parquet-tools clustering strategy and its supporting tests for current master.

keep the ParquetToolsExecutionStrategy API simple with the existing file-to-file rewrite hook
generate a new output file id for clustering rewrites instead of reusing the source file id
migrate helper code to current StoragePath / HoodieStorage based APIs
replace brittle previous-commit extraction with FSUtils.getCommitTime(...)
update write-status generation to use current parquet/storage utilities
refresh the related tests to match current writer, meta client, and clustering strategy APIs

Impact

No public API change intended.

This keeps the existing parquet-tools rewrite extension point, but makes it compatible with current Hudi master and current clustering output semantics.

Risk Level

low
The change is localized to the parquet-tools rewrite path and related test scaffolding.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

suryaprasanna · 2026-03-29T20:31:28Z

@nsivabalan Executing any parquet tools operations special jar to be included in the runtime, so I am not adding the column nullifying parquet tools execution strategy. Let us just keep the test class itself.

nsivabalan

I feel, we could name the classes better

Summary Table

Class	Current Name	Recommended Name	Why
Strategy	ParquetToolsExecutionStrategy	SparkExternalFileClusteringExecutionStrategy	Describes what (external file processing) not how (parquet-tools)
Handle	HoodieFileWriteHandle	ExternalFileClusteringWriteHandle	Too generic → specific to clustering use case
Converter	ParquetFileMetaToWriteStatusConvertor	FileMetadataWriteStatusConverter	Shorter, fixes typo (Convertor→Converter), more general
Method	executeTools	transformFile	Clear contract: transform source to target

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The parquet-tools clustering strategy is a nice extension point. I flagged one likely bug — the empty-list guard in ParquetToolsExecutionStrategy.performClusteringForGroup uses > 1 instead of != 1, which would cause an IndexOutOfBoundsException on an empty clustering operations list. A couple of other items around error handling and API consistency are worth discussing in the inline comments.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — nice updates addressing all prior feedback. The != 1 guard fix, StoragePath migration in the abstract API, file-existence check in close(), and the try-catch with partial file cleanup in performClusteringForGroup all look correct. The renames to SparkExternalFileClusteringExecutionStrategy / ExternalFileClusteringWriteHandle / FileMetadataWriteStatusConverter are cleaner and more descriptive. All prior review comments (both mine and @nsivabalan's) have been addressed.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This pull request introduces a new external file clustering feature for Hudi. The changes add a metadata-driven approach to write status generation, a specialized write handle for external file clustering operations, and execution strategies (both abstract and test implementations) that orchestrate the clustering transformation workflow with pluggable file transformation logic.

Greptile Summary: This PR introduces a new external-file clustering execution strategy for Apache Spark, enabling Hudi to delegate the actual file rewrite to an arbitrary external transformation (e.g., a Parquet-tools command) rather than reading and re-writing records through the standard Hudi write path. The key additions are:

FileMetadataWriteStatusConverter – builds a WriteStatus / HoodieWriteStat by reading file metadata (row count, file size) directly from the output Parquet file rather than from individual record writes.
ExternalFileClusteringWriteHandle – a HoodieWriteHandle subclass that creates the output path and marker file upfront, then delegates actual writing to the caller before finalising the write status in close().
SparkExternalFileClusteringExecutionStrategy – abstract strategy that enforces one-operation-per-group, invokes the user-supplied transformFile(), and handles cleanup on failure.
Test helpers ClusteringIdentityTestExecutionStrategy and ExternalFileClusteringTestExecutionStrategy to exercise the new skeleton.

Issues found:

ExternalFileClusteringWriteHandle.close() wraps IOException in HoodieInsertException — the wrong exception type for a clustering context; should use HoodieClusteringException or HoodieIOException.
FileMetadataWriteStatusConverter.generateHoodieWriteStat() uses String.valueOf(executionConfigs.get(PREV_COMMIT)), which silently produces the literal string "null" if the value is null, corrupting the prevCommit field on the write stat.
ExternalFileClusteringWriteHandle.close() never calls markClosed(), leaving the parent-class closed flag permanently false.
generateWriteStatus() instantiates FileMetadataWriteStatusConverter as a raw type — generic parameters should be forwarded.
TestFileMetadataWriteStatusConverter hard-codes the map keys "totalCreateTime" and "prevCommit" rather than referencing the exported constants, making the test fragile to future renames.

Greptile Confidence Score: 3/5

The PR introduces a useful and well-structured feature, but two logic-level issues in production code (wrong exception type in close() and String.valueOf(null) → "null" string corruption) should be addressed before merging.
The overall architecture is sound and the test coverage is reasonable. However, the HoodieInsertException wrapping in a clustering context could cause exception-type-sensitive error handling to misbehave, and the String.valueOf(null) issue in generateHoodieWriteStat could silently corrupt write statistics. These are targeted, concrete fixes rather than deep rearchitecting, so the PR is close to ready.
ExternalFileClusteringWriteHandle.java and FileMetadataWriteStatusConverter.java need the two logic fixes before merge.

Sequence Diagram (CodeRabbit):

sequenceDiagram
    actor User
    participant Strategy as SparkExternalFileClusteringExecutionStrategy
    participant Handle as ExternalFileClusteringWriteHandle
    participant Converter as FileMetadataWriteStatusConverter
    participant Storage as HoodieStorage/ParquetUtils
    
    User->>Strategy: performClusteringForGroup()
    Strategy->>Handle: create ExternalFileClusteringWriteHandle
    Strategy->>Strategy: transformFile(oldPath, newPath)
    Note over Strategy: Abstract method<br/>implemented by subclass
    alt Transformation Success
        Strategy->>Handle: close()
        Handle->>Handle: build executionConfigs<br/>(prevCommit, timeElapsed)
        Handle->>Converter: convert(parquetFile, partition, configs)
        Converter->>Storage: ParquetUtils.getRowCount()
        Converter->>Storage: storage.getPathInfo().getLength()
        Converter->>Converter: generateHoodieWriteStat()
        Converter-->>Handle: WriteStatus
        Handle-->>Strategy: List<WriteStatus>
        Strategy-->>User: List<WriteStatus>
    else Transformation Failure
        Strategy->>Storage: delete output file
        Strategy-->>User: throw HoodieClusteringException
    end

Sequence Diagram (Greptile):

sequenceDiagram
    participant S as SparkExternalFileClustering<br/>ExecutionStrategy
    participant H as ExternalFileClusteringWriteHandle
    participant T as transformFile()<br/>(subclass impl)
    participant C as FileMetadataWrite<br/>StatusConverter
    participant FS as HoodieStorage

    S->>H: new(config, instantTime, table, partitionPath, fileId, oldFilePath)
    H->>FS: makeNewPath(partitionPath) → newFilePath
    H->>FS: createMarkerFile(partitionPath, newFilePath)

    S->>T: transformFile(oldFilePath, newFilePath)
    alt transformation fails
        T-->>S: Exception
        S->>FS: deleteFile(newFilePath)
        S-->>S: throw HoodieClusteringException
    end

    S->>H: close()
    H->>FS: exists(newFilePath)
    H->>C: "convert(newFilePath, partitionPath, {PREV_COMMIT, TIME_TAKEN})"
    C->>FS: getRowCount(parquetFilePath)
    C->>FS: getPathInfo(parquetFilePath).getLength()
    C-->>H: WriteStatus (with HoodieWriteStat)
    H-->>S: List[WriteStatus]

CodeRabbit: yihua#37 (review)
Greptile: yihua#37 (review)

codecov-commenter · 2026-04-15T23:16:29Z

Codecov Report

❌ Patch coverage is 40.00000% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.83%. Comparing base (1eb97b3) to head (6d5c8b4).
⚠️ Report is 52 commits behind head on master.

Files with missing lines	Patch %	Lines
.../SparkExternalFileClusteringExecutionStrategy.java	0.00%	24 Missing ⚠️
...che/hudi/io/ExternalFileClusteringWriteHandle.java	0.00%	23 Missing ⚠️
...di/execution/FileMetadataWriteStatusConverter.java	96.96%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18409      +/-   ##
============================================
+ Coverage     68.21%   68.83%   +0.61%     
- Complexity    27709    28235     +526     
============================================
  Files          2440     2463      +23     
  Lines        134249   135336    +1087     
  Branches      16179    16394     +215     
============================================
+ Hits          91578    93153    +1575     
+ Misses        35565    34806     -759     
- Partials       7106     7377     +271

Flag	Coverage Δ
common-and-other-modules	`44.59% <40.00%> (+0.26%)`	⬆️
hadoop-mr-java-client	`44.83% <0.00%> (-0.10%)`	⬇️
spark-client-hadoop-common	`48.40% <0.00%> (+0.08%)`	⬆️
spark-java-tests	`48.89% <0.00%> (+0.17%)`	⬆️
spark-scala-tests	`45.48% <0.00%> (+0.24%)`	⬆️
utilities	`38.22% <0.00%> (-0.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...di/execution/FileMetadataWriteStatusConverter.java	`96.96% <96.96%> (ø)`
...che/hudi/io/ExternalFileClusteringWriteHandle.java	`0.00% <0.00%> (ø)`
.../SparkExternalFileClusteringExecutionStrategy.java	`0.00% <0.00%> (ø)`

... and 194 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-04-16T00:06:44Z

CI report:

6d5c8b4 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

suryaprasanna added 2 commits March 27, 2026 23:13

feat(spark): add parquet tools clustering strategy

3a114c7

Refactor

0df6e53

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Mar 28, 2026

nsivabalan reviewed Apr 3, 2026

View reviewed changes

yihua reviewed Apr 3, 2026

View reviewed changes

suryaprasanna commented Apr 13, 2026

View reviewed changes

Comment thread .../main/java/org/apache/hudi/client/clustering/run/strategy/ParquetToolsExecutionStrategy.java Outdated

Addressing feedback

e0c6343

yihua reviewed Apr 14, 2026

View reviewed changes

yihua mentioned this pull request Apr 14, 2026

[OSS PR #18409] feat(spark): refresh parquet tools clustering strategy for current master yihua/hudi#37

Open

yihua reviewed Apr 14, 2026

View reviewed changes

Fixing last few feedback

6d5c8b4

nsivabalan approved these changes Apr 15, 2026

View reviewed changes

nsivabalan merged commit a649188 into apache:master Apr 16, 2026
56 checks passed

Conversation

suryaprasanna commented Mar 28, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

suryaprasanna commented Mar 29, 2026

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 15, 2026

Codecov Report

Uh oh!

hudi-bot commented Apr 16, 2026

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants