Skip to content

[OSS PR #18409] feat(spark): refresh parquet tools clustering strategy for current master#37

Open
yihua wants to merge 3 commits into
masterfrom
oss-18409
Open

[OSS PR #18409] feat(spark): refresh parquet tools clustering strategy for current master#37
yihua wants to merge 3 commits into
masterfrom
oss-18409

Conversation

@yihua
Copy link
Copy Markdown
Owner

@yihua yihua commented Apr 14, 2026

Mirror of apache#18409 for automated bot review.

Original author: @suryaprasanna
Base branch: master

Summary by CodeRabbit

  • New Features

    • Added external file clustering support with enhanced file metadata tracking for more efficient data reorganization operations.
  • Tests

    • Added comprehensive test coverage for file metadata handling and clustering execution strategies to ensure reliability and correctness.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a new external file clustering feature for Hudi. The changes add a metadata-driven approach to write status generation, a specialized write handle for external file clustering operations, and execution strategies (both abstract and test implementations) that orchestrate the clustering transformation workflow with pluggable file transformation logic.

Changes

Cohort / File(s) Summary
Core Clustering Infrastructure
FileMetadataWriteStatusConverter.java, ExternalFileClusteringWriteHandle.java
Introduces utilities for converting parquet output files to WriteStatus objects. FileMetadataWriteStatusConverter extracts row counts and file sizes from parquet metadata via reflection-based loader pattern. ExternalFileClusteringWriteHandle manages the write lifecycle and integrates the converter to generate status from output files.
Abstract Execution Strategy
SparkExternalFileClusteringExecutionStrategy.java
Provides abstract execution strategy extending SingleSparkJobExecutionStrategy with overridden performClusteringForGroup. Orchestrates external file clustering by creating write handles, calling abstract transformFile method, handling failures with cleanup, and returning write status. Defines transformFile as abstract for subclass implementation.
Test Infrastructure
TestFileMetadataWriteStatusConverter.java, ClusteringIdentityTestExecutionStrategy.java, ExternalFileClusteringTestExecutionStrategy.java
Test classes covering converter functionality and execution strategies. TestFileMetadataWriteStatusConverter validates metadata extraction from parquet. ClusteringIdentityTestExecutionStrategy enforces single-operation grouping. ExternalFileClusteringTestExecutionStrategy implements transformFile via Hadoop FileSystem copy for testing.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Strategy as SparkExternalFileClusteringExecutionStrategy
    participant Handle as ExternalFileClusteringWriteHandle
    participant Converter as FileMetadataWriteStatusConverter
    participant Storage as HoodieStorage/ParquetUtils
    
    User->>Strategy: performClusteringForGroup()
    Strategy->>Handle: create ExternalFileClusteringWriteHandle
    Strategy->>Strategy: transformFile(oldPath, newPath)
    Note over Strategy: Abstract method<br/>implemented by subclass
    alt Transformation Success
        Strategy->>Handle: close()
        Handle->>Handle: build executionConfigs<br/>(prevCommit, timeElapsed)
        Handle->>Converter: convert(parquetFile, partition, configs)
        Converter->>Storage: ParquetUtils.getRowCount()
        Converter->>Storage: storage.getPathInfo().getLength()
        Converter->>Converter: generateHoodieWriteStat()
        Converter-->>Handle: WriteStatus
        Handle-->>Strategy: List<WriteStatus>
        Strategy-->>User: List<WriteStatus>
    else Transformation Failure
        Strategy->>Storage: delete output file
        Strategy-->>User: throw HoodieClusteringException
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A hop, skip, and clustering delight,
Files transform through the metadata night!
Write status converters, handlers so neat,
External clustering made bittersweet! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature addition: a new Parquet tools clustering strategy implementation for Spark with multiple supporting classes and test utilities.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch oss-18409

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR introduces a new external-file clustering execution strategy for Apache Spark, enabling Hudi to delegate the actual file rewrite to an arbitrary external transformation (e.g., a Parquet-tools command) rather than reading and re-writing records through the standard Hudi write path. The key additions are:

  • FileMetadataWriteStatusConverter – builds a WriteStatus / HoodieWriteStat by reading file metadata (row count, file size) directly from the output Parquet file rather than from individual record writes.
  • ExternalFileClusteringWriteHandle – a HoodieWriteHandle subclass that creates the output path and marker file upfront, then delegates actual writing to the caller before finalising the write status in close().
  • SparkExternalFileClusteringExecutionStrategy – abstract strategy that enforces one-operation-per-group, invokes the user-supplied transformFile(), and handles cleanup on failure.
  • Test helpers ClusteringIdentityTestExecutionStrategy and ExternalFileClusteringTestExecutionStrategy to exercise the new skeleton.

Issues found:

  • ExternalFileClusteringWriteHandle.close() wraps IOException in HoodieInsertException — the wrong exception type for a clustering context; should use HoodieClusteringException or HoodieIOException.
  • FileMetadataWriteStatusConverter.generateHoodieWriteStat() uses String.valueOf(executionConfigs.get(PREV_COMMIT)), which silently produces the literal string "null" if the value is null, corrupting the prevCommit field on the write stat.
  • ExternalFileClusteringWriteHandle.close() never calls markClosed(), leaving the parent-class closed flag permanently false.
  • generateWriteStatus() instantiates FileMetadataWriteStatusConverter as a raw type — generic parameters should be forwarded.
  • TestFileMetadataWriteStatusConverter hard-codes the map keys "totalCreateTime" and "prevCommit" rather than referencing the exported constants, making the test fragile to future renames.

Confidence Score: 3/5

  • The PR introduces a useful and well-structured feature, but two logic-level issues in production code (wrong exception type in close() and String.valueOf(null)"null" string corruption) should be addressed before merging.
  • The overall architecture is sound and the test coverage is reasonable. However, the HoodieInsertException wrapping in a clustering context could cause exception-type-sensitive error handling to misbehave, and the String.valueOf(null) issue in generateHoodieWriteStat could silently corrupt write statistics. These are targeted, concrete fixes rather than deep rearchitecting, so the PR is close to ready.
  • ExternalFileClusteringWriteHandle.java and FileMetadataWriteStatusConverter.java need the two logic fixes before merge.

Important Files Changed

Filename Overview
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/FileMetadataWriteStatusConverter.java New utility that builds a WriteStatus from a finished Parquet file's metadata. Has a logic issue: String.valueOf(executionConfigs.get(PREV_COMMIT)) silently coerces null to the literal "null" string, corrupting the write stat's prevCommit field.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/ExternalFileClusteringWriteHandle.java New write handle for file-level external clustering. Two issues: close() wraps IOException in the semantically wrong HoodieInsertException (should use HoodieClusteringException or HoodieIOException), and markClosed() is never invoked, leaving the parent's closed flag perpetually false.
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkExternalFileClusteringExecutionStrategy.java Abstract Spark strategy that dispatches one file at a time to transformFile(). Cleanup on failure is reasonable; the one-operation-per-group invariant is properly enforced. Uses a raw ExternalFileClusteringWriteHandle type.
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/execution/TestFileMetadataWriteStatusConverter.java Unit test for the converter. Core assertions are solid, but uses hardcoded string keys "totalCreateTime" and "prevCommit" instead of the public constants, reducing test robustness against future renames.
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/ClusteringIdentityTestExecutionStrategy.java Test-only identity strategy that re-routes records through SparkLazyInsertIterable with SingleFileHandleCreateFactory. Only the first batch is consumed, which is correct given single-file semantics. Unused LOG field is a minor nit.
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/ExternalFileClusteringTestExecutionStrategy.java Test implementation of the abstract strategy that performs an in-process Hadoop FileUtil.copy. Clean and straightforward; no issues.

Sequence Diagram

sequenceDiagram
    participant S as SparkExternalFileClustering<br/>ExecutionStrategy
    participant H as ExternalFileClusteringWriteHandle
    participant T as transformFile()<br/>(subclass impl)
    participant C as FileMetadataWrite<br/>StatusConverter
    participant FS as HoodieStorage

    S->>H: new(config, instantTime, table, partitionPath, fileId, oldFilePath)
    H->>FS: makeNewPath(partitionPath) → newFilePath
    H->>FS: createMarkerFile(partitionPath, newFilePath)

    S->>T: transformFile(oldFilePath, newFilePath)
    alt transformation fails
        T-->>S: Exception
        S->>FS: deleteFile(newFilePath)
        S-->>S: throw HoodieClusteringException
    end

    S->>H: close()
    H->>FS: exists(newFilePath)
    H->>C: "convert(newFilePath, partitionPath, {PREV_COMMIT, TIME_TAKEN})"
    C->>FS: getRowCount(parquetFilePath)
    C->>FS: getPathInfo(parquetFilePath).getLength()
    C-->>H: WriteStatus (with HoodieWriteStat)
    H-->>S: List[WriteStatus]
Loading

Comments Outside Diff (1)

  1. hudi-client/hudi-client-common/src/test/java/org/apache/hudi/execution/TestFileMetadataWriteStatusConverter.java, line 338-339 (link)

    P2 Test uses hardcoded string keys instead of the exported constants

    The map keys "totalCreateTime" and "prevCommit" are hardcoded here, but FileMetadataWriteStatusConverter exports TIME_TAKEN and PREV_COMMIT as public static final constants precisely so callers can use them. If the constant values are ever refactored, this test would silently break (keys wouldn't match, the converter would receive null/missing values).

Reviews (1): Last reviewed commit: "Addressing feedback" | Re-trigger Greptile


return Collections.singletonList(writeStatus);
} catch (IOException e) {
throw new HoodieInsertException("Failed to close the ExternalFileClusteringWriteHandle for path " + path, e);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong exception type for clustering context

The catch block wraps the IOException in HoodieInsertException, but this is a clustering operation — not an insert. If downstream code catches HoodieClusteringException specifically (e.g., for retry or rollback logic), this mismatch will silently bypass that handling.

Suggested change
throw new HoodieInsertException("Failed to close the ExternalFileClusteringWriteHandle for path " + path, e);
throw new HoodieIOException("Failed to close the ExternalFileClusteringWriteHandle for path " + path, e);

Alternatively, HoodieClusteringException would be the most semantically correct choice here.

Comment on lines +73 to +92
public List<WriteStatus> close() {
try {
if (!hoodieTable.getStorage().exists(path)) {
throw new HoodieIOException("Output file does not exist, transformation may not have been invoked: " + path);
}

Map<String, Object> executionConfigs = new HashMap<>();
executionConfigs.put(PREV_COMMIT, prevCommit);
executionConfigs.put(TIME_TAKEN, timer.endTimer());

this.writeStatus = generateWriteStatus(path.toString(), partitionPath, executionConfigs);
LOG.info(String.format("ExternalFileClusteringWriteHandle for partitionPath %s fileID %s, took %d ms.",
writeStatus.getStat().getPartitionPath(), writeStatus.getStat().getFileId(),
writeStatus.getStat().getRuntimeStats().getTotalCreateTime()));

return Collections.singletonList(writeStatus);
} catch (IOException e) {
throw new HoodieInsertException("Failed to close the ExternalFileClusteringWriteHandle for path " + path, e);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 markClosed() never called

All other HoodieWriteHandle implementations (BaseCreateHandle, HoodieWriteMergeHandle, etc.) call markClosed() at the end of close() to maintain the closed state flag from the parent class. Omitting it here means isClosed() will always return false for this handle, which could allow re-entrancy or confuse any future code that guards on this state.

Suggested change
public List<WriteStatus> close() {
try {
if (!hoodieTable.getStorage().exists(path)) {
throw new HoodieIOException("Output file does not exist, transformation may not have been invoked: " + path);
}
Map<String, Object> executionConfigs = new HashMap<>();
executionConfigs.put(PREV_COMMIT, prevCommit);
executionConfigs.put(TIME_TAKEN, timer.endTimer());
this.writeStatus = generateWriteStatus(path.toString(), partitionPath, executionConfigs);
LOG.info(String.format("ExternalFileClusteringWriteHandle for partitionPath %s fileID %s, took %d ms.",
writeStatus.getStat().getPartitionPath(), writeStatus.getStat().getFileId(),
writeStatus.getStat().getRuntimeStats().getTotalCreateTime()));
return Collections.singletonList(writeStatus);
} catch (IOException e) {
throw new HoodieInsertException("Failed to close the ExternalFileClusteringWriteHandle for path " + path, e);
}
}
return Collections.singletonList(writeStatus);
} catch (IOException e) {
throw new HoodieInsertException("Failed to close the ExternalFileClusteringWriteHandle for path " + path, e);
} finally {
markClosed();
}

*/
protected WriteStatus generateWriteStatus(
String outputFile, String partitionPath, Map<String, Object> executionConfigs) throws IOException {
return new FileMetadataWriteStatusConverter(hoodieTable, config).convert(outputFile, partitionPath, executionConfigs);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Raw type instantiation

FileMetadataWriteStatusConverter is instantiated without its generic type parameters. This suppresses compile-time type-safety and generates an unchecked-cast warning. The generic parameters should be propagated from the enclosing class:

Suggested change
return new FileMetadataWriteStatusConverter(hoodieTable, config).convert(outputFile, partitionPath, executionConfigs);
return new FileMetadataWriteStatusConverter<>(hoodieTable, config).convert(outputFile, partitionPath, executionConfigs);

stat.setFileId(writeStatus.getFileId());
stat.setPartitionPath(writeStatus.getPartitionPath());
stat.setPath(new StoragePath(writeConfig.getBasePath()), parquetFilePath);
stat.setPrevCommit(String.valueOf(executionConfigs.get(PREV_COMMIT)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 String.valueOf(null) silently produces the literal string "null"

If executionConfigs.get(PREV_COMMIT) is null for any reason (e.g., a caller that doesn't set the key, or future refactoring), String.valueOf(null) returns the four-character string "null" rather than null or an empty string. This would corrupt the prevCommit field in the written HoodieWriteStat in a way that is very hard to diagnose.

Prefer an explicit null check:

Suggested change
stat.setPrevCommit(String.valueOf(executionConfigs.get(PREV_COMMIT)));
String prevCommit = (String) executionConfigs.get(PREV_COMMIT);
stat.setPrevCommit(prevCommit);

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/FileMetadataWriteStatusConverter.java`:
- Around line 63-64: FileMetadataWriteStatusConverter currently calls
ReflectionUtils.loadClass with two arguments (index implicit flag and failure
fraction) when instantiating WriteStatus; update the call to pass the third
isMetadataTable boolean for consistency with other call sites (e.g.,
HoodieWriteHandle, HoodieAppendHandle) by including the isMetadataTable flag
from this context so
ReflectionUtils.loadClass(this.writeConfig.getWriteStatusClassName(),
!this.hoodieTable.getIndex().isImplicitWithStorage(),
this.writeConfig.getWriteStatusFailureFraction()) becomes the three-argument
form that includes isMetadataTable; ensure you reference
FileMetadataWriteStatusConverter and the writeConfig/getWriteStatusClassName,
hoodieTable.getIndex().isImplicitWithStorage(), and
writeConfig.getWriteStatusFailureFraction() symbols when making the change.

In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkExternalFileClusteringExecutionStrategy.java`:
- Around line 57-63: The external clustering branch drops the
preserveHoodieMetadata flag causing callers requesting fresh Hudi metadata to be
ignored; update performClusteringForGroup and the external path so
preserveHoodieMetadata is either validated and rejected when false
(throwing/returning an explicit error) or passed through into transformFile(...)
and into SingleFileHandleCreateFactory so the external transform can honor it;
locate the external strategy code paths around performClusteringForGroup,
transformFile, and SingleFileHandleCreateFactory and either add plumbing to pass
the boolean through to transformFile(...) or add an explicit check that throws a
clear exception when preserveHoodieMetadata==false.
- Around line 78-90: The new output file isn't cleaned up if
writeHandler.close() throws; wrap the close() call so any exception from
writeHandler.close() also triggers deletion of the partial output and rethrows a
HoodieClusteringException. Specifically, keep the existing transformFile(...)
catch logic but extend it (or use a try/finally around transformFile +
writeHandler.close()) so that failures from writeHandler.close() result in
getHoodieTable().getStorage().deleteFile(writeHandler.getPath()) being called
(log/delete exceptions similarly to the existing catch) and then throw a
HoodieClusteringException that includes the original exception; ensure you
reference transformFile, writeHandler.close(),
getHoodieTable().getStorage().deleteFile(...) and HoodieClusteringException in
your change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: af1eb28b-fcf5-4d3b-9cf5-9c504952a2e6

📥 Commits

Reviewing files that changed from the base of the PR and between 35e2bbf and e0c6343.

📒 Files selected for processing (6)
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/FileMetadataWriteStatusConverter.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/ExternalFileClusteringWriteHandle.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/execution/TestFileMetadataWriteStatusConverter.java
  • hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkExternalFileClusteringExecutionStrategy.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/ClusteringIdentityTestExecutionStrategy.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/client/ExternalFileClusteringTestExecutionStrategy.java

Comment on lines +63 to +64
WriteStatus writeStatus = (WriteStatus) ReflectionUtils.loadClass(this.writeConfig.getWriteStatusClassName(),
!this.hoodieTable.getIndex().isImplicitWithStorage(), this.writeConfig.getWriteStatusFailureFraction());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
sed -n '1,180p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/WriteStatus.java
rg -nC2 --type=java 'public\s+[A-Za-z0-9_]*WriteStatus\s*\(' hudi-client hudi-common
rg -nC2 --type=java 'getWriteStatusClassName\s*\(' hudi-client

Repository: yihua/hudi

Length of output: 19298


Pass the isMetadataTable flag for consistency.

Lines 63-64 instantiate WriteStatus with only 2 arguments, while other call sites in HoodieWriteHandle and HoodieAppendHandle pass 3 arguments including the isMetadataTable flag. All supported WriteStatus implementations (WriteStatus, BootstrapWriteStatus, FailOnFirstErrorWriteStatus, MetadataMergeWriteStatus) expose both constructors, so runtime failure won't occur, but the semantic mismatch should be addressed for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/execution/FileMetadataWriteStatusConverter.java`
around lines 63 - 64, FileMetadataWriteStatusConverter currently calls
ReflectionUtils.loadClass with two arguments (index implicit flag and failure
fraction) when instantiating WriteStatus; update the call to pass the third
isMetadataTable boolean for consistency with other call sites (e.g.,
HoodieWriteHandle, HoodieAppendHandle) by including the isMetadataTable flag
from this context so
ReflectionUtils.loadClass(this.writeConfig.getWriteStatusClassName(),
!this.hoodieTable.getIndex().isImplicitWithStorage(),
this.writeConfig.getWriteStatusFailureFraction()) becomes the three-argument
form that includes isMetadataTable; ensure you reference
FileMetadataWriteStatusConverter and the writeConfig/getWriteStatusClassName,
hoodieTable.getIndex().isImplicitWithStorage(), and
writeConfig.getWriteStatusFailureFraction() symbols when making the change.

Comment on lines +57 to +63
protected List<WriteStatus> performClusteringForGroup(ReaderContextFactory<T> readerContextFactory,
ClusteringGroupInfo clusteringOps,
Map<String, String> strategyParams,
boolean preserveHoodieMetadata,
HoodieSchema schema,
TaskContextSupplier taskContextSupplier,
String instantTime) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Propagate or explicitly reject preserveHoodieMetadata=false.

The identity clustering path still threads this flag into SingleFileHandleCreateFactory, but the new external path drops it entirely. Right now a caller can request fresh Hudi metadata and silently get whatever the external transform happened to emit instead.

♻️ Suggested fix
   protected List<WriteStatus> performClusteringForGroup(ReaderContextFactory<T> readerContextFactory,
                                                         ClusteringGroupInfo clusteringOps,
                                                         Map<String, String> strategyParams,
                                                         boolean preserveHoodieMetadata,
                                                         HoodieSchema schema,
                                                         TaskContextSupplier taskContextSupplier,
                                                         String instantTime) {
+    if (!preserveHoodieMetadata) {
+      throw new HoodieClusteringException(
+          "External file clustering currently requires preserveHoodieMetadata=true: " + getClass().getName());
+    }
+
     LOG.info("Starting clustering operation on input file ids.");

If a concrete external strategy can truly rewrite Hudi metadata, plumb this flag into transformFile(...) instead of ignoring it.

Also applies to: 78-80

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkExternalFileClusteringExecutionStrategy.java`
around lines 57 - 63, The external clustering branch drops the
preserveHoodieMetadata flag causing callers requesting fresh Hudi metadata to be
ignored; update performClusteringForGroup and the external path so
preserveHoodieMetadata is either validated and rejected when false
(throwing/returning an explicit error) or passed through into transformFile(...)
and into SingleFileHandleCreateFactory so the external transform can honor it;
locate the external strategy code paths around performClusteringForGroup,
transformFile, and SingleFileHandleCreateFactory and either add plumbing to pass
the boolean through to transformFile(...) or add an explicit check that throws a
clear exception when preserveHoodieMetadata==false.

Comment on lines +78 to +90
try {
// Executes the file transformation.
transformFile(oldFilePath, writeHandler.getPath());
} catch (Exception e) {
// Clean up partial output file if transformation fails.
try {
getHoodieTable().getStorage().deleteFile(writeHandler.getPath());
} catch (Exception deleteEx) {
LOG.warn("Failed to clean up partial output file: " + writeHandler.getPath(), deleteEx);
}
throw new HoodieClusteringException("Failed to transform file: " + dataFilePathStr, e);
}
return writeHandler.close();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clean up the output file when writeHandler.close() fails too.

transformFile() is not the only failing step here. close() immediately re-opens the generated file to derive row counts and write stats, so a bad footer/path currently skips this cleanup block and leaves the new file behind.

♻️ Suggested fix
-    try {
-      // Executes the file transformation.
-      transformFile(oldFilePath, writeHandler.getPath());
-    } catch (Exception e) {
+    try {
+      // Executes the file transformation.
+      transformFile(oldFilePath, writeHandler.getPath());
+      return writeHandler.close();
+    } catch (Exception e) {
       // Clean up partial output file if transformation fails.
       try {
         getHoodieTable().getStorage().deleteFile(writeHandler.getPath());
       } catch (Exception deleteEx) {
         LOG.warn("Failed to clean up partial output file: " + writeHandler.getPath(), deleteEx);
       }
-      throw new HoodieClusteringException("Failed to transform file: " + dataFilePathStr, e);
+      throw new HoodieClusteringException("Failed to externalize clustered file: " + dataFilePathStr, e);
     }
-    return writeHandler.close();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/clustering/run/strategy/SparkExternalFileClusteringExecutionStrategy.java`
around lines 78 - 90, The new output file isn't cleaned up if
writeHandler.close() throws; wrap the close() call so any exception from
writeHandler.close() also triggers deletion of the partial output and rethrows a
HoodieClusteringException. Specifically, keep the existing transformFile(...)
catch logic but extend it (or use a try/finally around transformFile +
writeHandler.close()) so that failures from writeHandler.close() result in
getHoodieTable().getStorage().deleteFile(writeHandler.getPath()) being called
(log/delete exceptions similarly to the existing catch) and then throw a
HoodieClusteringException that includes the original exception; ensure you
reference transformFile, writeHandler.close(),
getHoodieTable().getStorage().deleteFile(...) and HoodieClusteringException in
your change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants