Conversation
|
Important Review skippedToo many files! This PR contains 206 files, which is 56 over the limit of 150. ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (206)
You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis pull request introduces an empty clean feature for Hudi, allowing scheduled empty clean commits when no partitions require deletion. A new configuration property Changes
Sequence DiagramsequenceDiagram
participant Client as Clean Planner
participant Executor as CleanPlanActionExecutor
participant Timeline as Active Timeline
participant Metadata as Metadata Generator
participant Config as HoodieWriteConfig
Client->>Executor: requestClean()
Executor->>Config: maxDurationToCreateEmptyCleanMs()
Config-->>Executor: duration threshold
alt Plan has work
Executor->>Executor: createNormalCleanerPlan()
Executor-->>Client: return plan
else Plan is empty & incremental mode
Executor->>Timeline: getLastCleanInstantTime()
Timeline-->>Executor: last clean timestamp
Executor->>Executor: compare elapsed time vs threshold
alt Time threshold exceeded & retention valid
Executor->>Executor: getEmptyCleanerPlan()
Executor->>Timeline: reload active timeline
Executor->>Metadata: createEmptyCleanMetadata()
Metadata-->>Executor: empty clean metadata
Executor-->>Client: return empty plan & metadata
else Threshold not met or retention invalid
Executor-->>Client: return empty (skip clean)
end
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR introduces an empty clean commit feature to Hudi's cleaning subsystem. For datasets using incremental clean that receive few or no updates, the cleaner planner previously never generated a plan (nothing to clean), meaning the
Key findings:
Confidence Score: 3/5Not safe to merge as-is due to a latent NPE in The feature design and the threshold/regression-guard logic are sound. However, the NPE on
Important Files Changed
Sequence DiagramsequenceDiagram
participant CE as CleanActionExecutor
participant CPAE as CleanPlanActionExecutor
participant CP as CleanPlanner
participant AT as ActiveTimeline
CE->>CPAE: execute()
CPAE->>CP: getEarliestCommitToRetain()
CP-->>CPAE: earliestInstant (may be empty)
CPAE->>CP: getPartitionPathsToClean(earliestInstant)
CP-->>CPAE: partitionsToClean
alt partitionsToClean is empty
CPAE->>CPAE: getEmptyCleanerPlan(earliestInstant)
CPAE-->>CE: cleanerPlan (empty files, set/null earliestInstantToRetain)
else partitions exist
CPAE->>CP: getDeletePaths per partition
CPAE-->>CE: cleanerPlan (with file paths)
end
CE->>CE: Check empty plan eligibility
note over CE: cleanPlanOpt.isEmpty() &&<br/>incrementalCleanerMode &&<br/>earliestInstantToRetain != null &&<br/>maxDuration > 0 &&<br/>time since last clean > threshold
alt eligible for empty clean
CE->>AT: saveToCleanRequested (empty plan)
CE->>CE: runClean(table, cleanInstant, cleanerPlan)
CE->>CE: clean(context, cleanerPlan) → cleanStats (empty list)
CE->>CE: createEmptyCleanMetadata(cleanerPlan, ...)
note over CE: NPE if getEarliestInstantToRetain() == null
CE->>AT: transitionCleanInflightToComplete
else not eligible
CE-->>CE: return cleanPlanOpt (empty)
end
|
| .setTimeTakenInMillis(timeTakenMillis) | ||
| .setTotalFilesDeleted(0) | ||
| .setLastCompletedCommitTimestamp(cleanerPlan.getLastCompletedCommitTimestamp()) | ||
| .setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain().getTimestamp()) |
There was a problem hiding this comment.
Potential NullPointerException on
getEarliestInstantToRetain()
createEmptyCleanMetadata is reached whenever cleanStats.isEmpty(). cleanStats is built by streaming over cleanerPlan.getFilePathsToBeDeletedPerPartition().keySet(), so it is empty when the partition-file map is empty — which is exactly the case for any plan that has only partitionsToBeDeleted entries with no file paths.
In the regular (non-empty-clean-commit) code path inside requestClean(context):
return new HoodieCleanerPlan(
earliestInstant.map(...).orElse(null), // ← null when no earliest instant
...
);If earliestInstant is absent (e.g. KEEP_LATEST_FILE_VERSIONS policy with no commits beyond the threshold) the plan's EarliestInstantToRetain field is null. If partitionsToBeDeleted is non-empty for that plan and there are no file paths, cleanStats will be empty, createEmptyCleanMetadata will be called, and line 250 will NPE:
.setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain().getTimestamp())
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can be null| .setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain().getTimestamp()) | |
| .setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain() != null | |
| ? cleanerPlan.getEarliestInstantToRetain().getTimestamp() : null) |
| } else { | ||
| cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()); | ||
| } | ||
| return cleanBuilder.build(); |
There was a problem hiding this comment.
setVersion is missing in the else branch of getEmptyCleanerPlan
When earliestInstant is absent, the builder never calls .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION), so the serialised plan will have the Avro-schema default (typically 1 or 0 depending on the schema evolution). When this plan is later read back by CleanerUtils.getCleanerPlan, a version mismatch can cause silent compatibility issues or an UnsupportedOperationException in version-dispatch logic.
setVersion should be applied unconditionally:
| } else { | |
| cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()); | |
| } | |
| return cleanBuilder.build(); | |
| } else { | |
| cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()) | |
| .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION); | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java (1)
244-255: Prefer the latest clean-metadata version constant here.Line 251 hardcodes
CLEAN_METADATA_VERSION_2for the empty-clean path. The normal path goes throughCleanerUtils.convertCleanMetadata(...), so this branch will drift the next time clean metadata is version-bumped.♻️ Minimal alignment
-import static org.apache.hudi.common.util.CleanerUtils.CLEAN_METADATA_VERSION_2; +import static org.apache.hudi.common.util.CleanerUtils.LATEST_CLEAN_METADATA_VERSION; ... - .setVersion(CLEAN_METADATA_VERSION_2) + .setVersion(LATEST_CLEAN_METADATA_VERSION)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java` around lines 244 - 255, The empty-clean branch in createEmptyCleanMetadata hardcodes CLEAN_METADATA_VERSION_2 which will drift when versions bump; replace that hardcoded constant by using the same version source as the normal path — e.g., call CleanerUtils.convertCleanMetadata(...) to produce canonical metadata (or retrieve the current clean-metadata version via CleanerUtils’ public accessor) and use that value instead of CLEAN_METADATA_VERSION_2 so both paths stay aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java`:
- Around line 247-257: Add explicit validation for
MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS by enforcing the value is either -1 or >=
0; in the builder setter method maxDurationToCreateEmptyCleanMs(...) validate
the incoming long and throw an IllegalArgumentException for invalid values, and
add the same check in the build() method to catch property-file or deserialized
configurations before constructing the HoodieCleanConfig instance.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`:
- Around line 218-228: The code in CleanActionExecutor should not treat
cleanStats.isEmpty() alone as the signal for an empty clean because partition
deletions (tracked on the CleanerPlan) are real work; change the conditional so
that createEmptyCleanMetadata(...) is used only when both cleanStats.isEmpty()
AND there are no partition deletes recorded on the cleanerPlan (e.g., check
cleanerPlan.getPartitionsToDelete() or equivalent), otherwise call
CleanerUtils.convertCleanMetadata(...) as before (passing
inflightInstant.requestedTime() and Option.of(timer.endTimer())) so
partition-delete metadata is preserved; update the logic around cleanStats,
cleanerPlan, createEmptyCleanMetadata, and CleanerUtils.convertCleanMetadata
accordingly.
In
`@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java`:
- Around line 893-894: The test creates a write client just to call savepoint
via getHoodieWriteClient(config).savepoint(...) but never closes that client;
change the code to obtain the HoodieWriteClient instance (e.g.,
HoodieWriteClient client = getHoodieWriteClient(config)), call
client.savepoint(fourthCommitTs, "user", "comment"), and then close the client
(client.close()) or use a try-with-resources block to ensure the write client is
always closed to release background resources.
In
`@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieCleanerTestBase.java`:
- Around line 103-109: The overload of runCleaner creates a SparkRDDWriteClient
via getHoodieWriteClient(config) and returns without closing it; update this
method (the runCleaner overload that builds cleanInstantTs) to ensure the
allocated writeClient is closed in all cases by calling writeClient.close() (or
try-with-resources/try-finally) after delegating to the other runCleaner(String
cleanInstantTs, SparkRDDWriteClient<?> writeClient, ...) so the client is always
closed even on exceptions; reference the methods/variables: runCleaner(...),
getHoodieWriteClient, writeClient, makeNewCommitTime, and ensure the close
happens after runCleaner returns or in finally.
---
Nitpick comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`:
- Around line 244-255: The empty-clean branch in createEmptyCleanMetadata
hardcodes CLEAN_METADATA_VERSION_2 which will drift when versions bump; replace
that hardcoded constant by using the same version source as the normal path —
e.g., call CleanerUtils.convertCleanMetadata(...) to produce canonical metadata
(or retrieve the current clean-metadata version via CleanerUtils’ public
accessor) and use that value instead of CLEAN_METADATA_VERSION_2 so both paths
stay aligned.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f2e6c2ce-8bde-4af9-9312-b151090a4f18
📒 Files selected for processing (8)
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.javahudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.javahudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.javahudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.javahudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieCleanerTestBase.java
| public static final ConfigProperty<Long> MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS = ConfigProperty | ||
| .key("hoodie.write.empty.clean.create.duration.ms") | ||
| .defaultValue(-1L) | ||
| .markAdvanced() | ||
| .withDocumentation("In some cases empty clean commit needs to be created to ensure the clean planner " | ||
| + "does not look through entire dataset if there are no clean plans. This is possible for append-only " | ||
| + "dataset. Also, for these datasets we cannot ignore clean completely since in the future there could " | ||
| + "be upsert or replace operations. By creating empty clean commit, earliest_commit_to_retain value " | ||
| + "will be updated so that now clean planner can only check for partitions that are modified after the " | ||
| + "last empty clean's earliest_commit_toRetain value there by optimizing the clean planning"); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Usages of empty-clean duration config =="
rg -n -C3 --type=java 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS|maxDurationToCreateEmptyCleanMs|withMaxDurationToCreateEmptyClean'
echo
echo "== Inspect CleanPlanActionExecutor logic =="
fd -i "CleanPlanActionExecutor.java" --exec sed -n '1,260p' {}
echo
echo "== Inspect validations in clean/write config =="
fd -i "HoodieCleanConfig.java" --exec rg -n -C3 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS|withMaxDurationToCreateEmptyClean|build\(' {}
fd -i "HoodieWriteConfig.java" --exec rg -n -C3 'maxDurationToCreateEmptyCleanMs|validate\(' {}Repository: yihua/hudi
Length of output: 37609
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check for all maxDurationToCreateEmptyCleanMs accesses ==="
rg -n 'maxDurationToCreateEmptyCleanMs' --type=java
echo
echo "=== Verify no other executors or code paths use this config ==="
rg -n 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS' --type=java
echo
echo "=== Check if this config can be loaded from properties directly ==="
rg -n 'fromProperties\|fromFile' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | head -20Repository: yihua/hudi
Length of output: 1473
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check HoodieWriteConfig initialization and getLong method ==="
rg -n 'public long getLong|private long getLong' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java -A 5 | head -30
echo
echo "=== Check if HoodieCleanConfig has any factory methods ==="
fd -i "HoodieCleanConfig.java" --exec grep -n 'public static\|public HoodieCleanConfig' {} | head -20
echo
echo "=== Verify the build() method in HoodieCleanConfig ==="
fd -i "HoodieCleanConfig.java" --exec sed -n '433,445p' {}Repository: yihua/hudi
Length of output: 119
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Find the getLong implementation in config hierarchy ==="
fd -i "HoodieConfig.java" --exec grep -n 'public long getLong' {}
echo
echo "=== Check HoodieWriteConfig build method and validation ==="
fd -i "HoodieWriteConfig.java" --exec sed -n '3789,3800p' {}
echo
echo "=== Full HoodieCleanConfig build method ==="
fd -i "HoodieCleanConfig.java" --exec sed -n '433,450p' {}
echo
echo "=== Check if value is parsed/validated anywhere in build or getter ==="
rg -n 'parseLong|Long.parseLong' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/ --type=java | grep -i 'empty\|clean' | head -10Repository: yihua/hudi
Length of output: 113
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Search for AbstractHoodieConfig or base config class ==="
fd 'AbstractHoodieConfig.java|HoodieConfig.java' --type f
echo
echo "=== Get full HoodieCleanConfig build method ==="
cat hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | sed -n '433,445p'
echo
echo "=== Check all references to getLong in config directory ==="
find hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config -name '*.java' -exec grep -l 'getLong' {} \;
echo
echo "=== Search for where config value is actually retrieved (check if Long.parseLong is used) ==="
rg 'Long.parseLong.*MAX_DURATION|getLong.*MAX_DURATION' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/ --type=javaRepository: yihua/hudi
Length of output: 1138
🏁 Script executed:
#!/bin/bash
set -euoripefail
echo "=== Check HoodieConfig.getLong implementation ==="
sed -n '1,500p' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java | grep -A 10 'public.*getLong'
echo
echo "=== Verify how getLong handles Long.parseLong ==="
rg -n 'Long.parseLong' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java -A 3
echo
echo "=== Check if validation happens on config build in HoodieWriteConfig ==="
sed -n '3789,3810p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
echo
echo "=== Verify if properties file can bypass builder validation ==="
rg -n 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java -B 5 -A 5 | head -30Repository: yihua/hudi
Length of output: 728
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check HoodieConfig.getLong implementation ==="
sed -n '1,500p' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java | grep -A 10 'public.*getLong'
echo
echo "=== Verify how getLong handles Long.parseLong ==="
rg -n 'Long.parseLong' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java -A 3
echo
echo "=== Check if validation happens on config build in HoodieWriteConfig ==="
sed -n '3789,3810p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
echo
echo "=== Verify if properties file can bypass builder validation ==="
rg -n 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java -B 5 -A 5 | head -30Repository: yihua/hudi
Length of output: 3584
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "=== Check HoodieWriteConfig.validate() method to see if it validates MAX_DURATION ==="
rg -n 'private void validate' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java -A 100 | head -150Repository: yihua/hudi
Length of output: 5864
Consider adding validation for MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS to reject invalid ranges.
Line 429 accepts any long value without validation. While downstream code at line 251 safely gates the feature with config.maxDurationToCreateEmptyCleanMs() > 0, adding explicit validation would improve clarity and prevent silent misconfigurations. Enforce duration == -1 || duration >= 0 either in the builder setter (line 428) or the build() method (line 433) to match the documented semantics.
The proposed fix remains valid:
🔧 Suggested fix
public HoodieCleanConfig.Builder withMaxDurationToCreateEmptyClean(long duration) {
+ if (duration < -1) {
+ throw new IllegalArgumentException(
+ "hoodie.write.empty.clean.create.duration.ms must be -1 (disabled) or >= 0");
+ }
cleanConfig.setValue(MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS, String.valueOf(duration));
return this;
}Also applies to: 433-439 (add similar validation to build() method to catch property-file configurations)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java`
around lines 247 - 257, Add explicit validation for
MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS by enforcing the value is either -1 or >=
0; in the builder setter method maxDurationToCreateEmptyCleanMs(...) validate
the incoming long and throw an IllegalArgumentException for invalid values, and
add the same check in the build() method to catch property-file or deserialized
configurations before constructing the HoodieCleanConfig instance.
| table.getMetaClient().reloadActiveTimeline(); | ||
| HoodieCleanMetadata metadata; | ||
| if (cleanStats.isEmpty()) { | ||
| return HoodieCleanMetadata.newBuilder().build(); | ||
| metadata = createEmptyCleanMetadata(cleanerPlan, inflightInstant, timer.endTimer()); | ||
| } else { | ||
| metadata = CleanerUtils.convertCleanMetadata( | ||
| inflightInstant.requestedTime(), | ||
| Option.of(timer.endTimer()), | ||
| cleanStats, | ||
| cleanerPlan.getExtraMetadata() | ||
| ); |
There was a problem hiding this comment.
Don't use cleanStats.isEmpty() as the empty-clean signal.
Line 220 now treats every stat-less execution as a no-op clean, but clean() still deletes partitionsToBeDeleted without emitting a HoodieCleanStat for them. After CleanPlanActionExecutor started treating partition deletes as real work, a partition-only clean will land here, get serialized as an empty clean, and lose its partition-delete metadata.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`
around lines 218 - 228, The code in CleanActionExecutor should not treat
cleanStats.isEmpty() alone as the signal for an empty clean because partition
deletions (tracked on the CleanerPlan) are real work; change the conditional so
that createEmptyCleanMetadata(...) is used only when both cleanStats.isEmpty()
AND there are no partition deletes recorded on the cleanerPlan (e.g., check
cleanerPlan.getPartitionsToDelete() or equivalent), otherwise call
CleanerUtils.convertCleanMetadata(...) as before (passing
inflightInstant.requestedTime() and Option.of(timer.endTimer())) so
partition-delete metadata is preserved; update the logic around cleanStats,
cleanerPlan, createEmptyCleanMetadata, and CleanerUtils.convertCleanMetadata
accordingly.
Mirror of apache#18337 for automated bot review.
Original author: @nsivabalan
Base branch: master
Summary by CodeRabbit
Release Notes
New Features
hoodie.write.empty.clean.create.duration.msto control when empty clean commits are created based on time intervals.Bug Fixes
Tests