-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(clean): Adding empty clean support to hudi #18337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -250,6 +250,16 @@ public class HoodieCleanConfig extends HoodieConfig { | |
| .markAdvanced() | ||
| .withDocumentation("Maximum number of commits to clean in one clean commit. Applicable only when the clean policy is based on KEEP_LATEST_COMMITS or KEEP_LATEST_HOURS"); | ||
|
|
||
| public static final ConfigProperty<Long> MAX_INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS = ConfigProperty | ||
| .key("hoodie.write.empty.clean.internval.hours") | ||
| .defaultValue(-1L) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't reject the property's own default. Line 255 sets Suggested fix- if (maxDurationMs == 0 || maxDurationMs <= -1) {
- throw new IllegalArgumentException(MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS.key() + " must be >= 1, but was " + maxDurationMs);
+ if (maxDurationMs == 0 || maxDurationMs < -1) {
+ throw new IllegalArgumentException(
+ MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS.key() + " must be -1 (disabled) or >= 1, but was " + maxDurationMs);
}Also applies to: 452-455 🤖 Prompt for AI Agents— CodeRabbit (original) (source:comment#3111681444) |
||
| .markAdvanced() | ||
| .withDocumentation("In some cases empty clean commit needs to be created to ensure the clean planner " | ||
|
nsivabalan marked this conversation as resolved.
|
||
| + "does not look through entire dataset if there are no clean plans. This is possible for append-only " | ||
| + "dataset. Also, for these datasets we cannot ignore clean completely since in the future there could " | ||
| + "be upsert or replace operations. By creating empty clean commit, earliest_commit_to_retain value " | ||
| + "will be updated so that now clean planner can only check for partitions that are modified after the " | ||
|
nsivabalan marked this conversation as resolved.
|
||
| + "last empty clean's earliest_commit_toRetain value thereby optimizing the clean planning"); | ||
|
|
||
|
nsivabalan marked this conversation as resolved.
|
||
| /** @deprecated Use {@link #CLEANER_POLICY} and its methods instead */ | ||
| @Deprecated | ||
|
|
@@ -426,6 +436,11 @@ public HoodieCleanConfig.Builder withMaxCommitsToClean(long maxCommitsToClean) { | |
| return this; | ||
| } | ||
|
|
||
| public HoodieCleanConfig.Builder withMaxIntervalToCreateEmptyCleanHours(long durationHours) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| cleanConfig.setValue(MAX_INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS, String.valueOf(durationHours)); | ||
| return this; | ||
| } | ||
|
|
||
| public HoodieCleanConfig build() { | ||
| cleanConfig.setDefaults(HoodieCleanConfig.class.getName()); | ||
| HoodieCleaningPolicy.valueOf(cleanConfig.getString(CLEANER_POLICY)); | ||
|
|
@@ -434,6 +449,10 @@ public HoodieCleanConfig build() { | |
| if (maxCommitsToClean < 1) { | ||
| throw new IllegalArgumentException(MAX_COMMITS_TO_CLEAN.key() + " must be >= 1, but was " + maxCommitsToClean); | ||
| } | ||
| long maxDurationHours = cleanConfig.getLong(MAX_INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| if (maxDurationHours == 0 || maxDurationHours < -1) { | ||
| throw new IllegalArgumentException(MAX_INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS.key() + " must be >= 1, but was " + maxDurationHours); | ||
| } | ||
| return cleanConfig; | ||
| } | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1842,6 +1842,10 @@ public boolean isAutoClean() { | |
| return getBoolean(HoodieCleanConfig.AUTO_CLEAN); | ||
| } | ||
|
|
||
| public long maxIntervalToCreateEmptyCleanHours() { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. getIntervalHoursToCreateEmptyClean |
||
| return getLong(HoodieCleanConfig.MAX_INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS); | ||
| } | ||
|
|
||
| public boolean shouldArchiveBeyondSavepoint() { | ||
| return getBooleanOrDefault(HoodieArchivalConfig.ARCHIVE_BEYOND_SAVEPOINT); | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,12 +25,13 @@ | |
| import org.apache.hudi.common.engine.HoodieEngineContext; | ||
| import org.apache.hudi.common.engine.HoodieLocalEngineContext; | ||
| import org.apache.hudi.common.model.CleanFileInfo; | ||
| import org.apache.hudi.common.model.HoodieCleaningPolicy; | ||
| import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; | ||
| import org.apache.hudi.common.table.timeline.HoodieInstant; | ||
| import org.apache.hudi.common.table.timeline.HoodieInstantTimeGenerator; | ||
| import org.apache.hudi.common.table.timeline.HoodieTimeline; | ||
| import org.apache.hudi.common.util.CleanerUtils; | ||
| import org.apache.hudi.common.util.Option; | ||
| import org.apache.hudi.common.util.StringUtils; | ||
| import org.apache.hudi.common.util.collection.Pair; | ||
| import org.apache.hudi.config.HoodieWriteConfig; | ||
| import org.apache.hudi.exception.HoodieException; | ||
|
|
@@ -42,13 +43,18 @@ | |
| import lombok.extern.slf4j.Slf4j; | ||
|
|
||
| import java.io.IOException; | ||
| import java.text.ParseException; | ||
| import java.time.ZonedDateTime; | ||
| import java.util.ArrayList; | ||
| import java.util.Collections; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.concurrent.TimeUnit; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| import static org.apache.hudi.common.table.timeline.InstantComparison.LESSER_THAN; | ||
| import static org.apache.hudi.common.table.timeline.InstantComparison.compareTimestamps; | ||
| import static org.apache.hudi.common.util.CleanerUtils.SAVEPOINTED_TIMESTAMPS; | ||
| import static org.apache.hudi.common.util.MapUtils.nonEmpty; | ||
|
|
||
|
|
@@ -94,6 +100,23 @@ private boolean needsCleaning(CleaningTriggerStrategy strategy) { | |
| } | ||
| } | ||
|
|
||
| private HoodieCleanerPlan getEmptyCleanerPlan(Option<HoodieInstant> earliestInstant, CleanPlanner<T, I, K, O> planner) throws IOException { | ||
| HoodieCleanerPlan.Builder cleanBuilder = HoodieCleanerPlan.newBuilder() | ||
| .setFilePathsToBeDeletedPerPartition(Collections.emptyMap()) | ||
| .setExtraMetadata(prepareExtraMetadata(planner.getSavepointedTimestamps())); | ||
| if (earliestInstant.isPresent()) { | ||
| HoodieInstant hoodieInstant = earliestInstant.get(); | ||
| cleanBuilder.setPolicy(config.getCleanerPolicy().name()) | ||
| .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION) | ||
| .setEarliestInstantToRetain(new HoodieActionInstant(hoodieInstant.requestedTime(), hoodieInstant.getAction(), hoodieInstant.getState().name())) | ||
| .setLastCompletedCommitTimestamp(planner.getLastCompletedCommitTimestamp()); | ||
|
nsivabalan marked this conversation as resolved.
|
||
| } else { | ||
| cleanBuilder.setPolicy(config.getCleanerPolicy().name()) | ||
| .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION); | ||
| } | ||
| return cleanBuilder.build(); | ||
|
nsivabalan marked this conversation as resolved.
|
||
| } | ||
|
|
||
| /** | ||
| * Generates List of files to be cleaned. | ||
| * | ||
|
|
@@ -109,8 +132,8 @@ HoodieCleanerPlan requestClean(HoodieEngineContext context) { | |
| context.clearJobStatus(); | ||
|
|
||
| if (partitionsToClean.isEmpty()) { | ||
| log.info("Nothing to clean here. It is already clean"); | ||
| return HoodieCleanerPlan.newBuilder().setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name()).build(); | ||
| log.info("Partitions to clean returned empty. Checking to see if empty clean needs to be created."); | ||
| return getEmptyCleanerPlan(earliestInstant, planner); | ||
| } | ||
| log.info( | ||
| "Earliest commit to retain for clean : {}", | ||
|
|
@@ -213,14 +236,61 @@ protected Option<HoodieCleanerPlan> requestClean() { | |
| cleanerEngineContext = context; | ||
| } | ||
| final HoodieCleanerPlan cleanerPlan = requestClean(cleanerEngineContext); | ||
| Option<HoodieCleanerPlan> option = Option.empty(); | ||
| if (nonEmpty(cleanerPlan.getFilePathsToBeDeletedPerPartition()) | ||
| && cleanerPlan.getFilePathsToBeDeletedPerPartition().values().stream().mapToInt(List::size).sum() > 0) { | ||
| Option<HoodieCleanerPlan> cleanPlanOpt = Option.empty(); | ||
| if ((cleanerPlan.getPartitionsToBeDeleted() != null && !cleanerPlan.getPartitionsToBeDeleted().isEmpty()) | ||
| || (nonEmpty(cleanerPlan.getFilePathsToBeDeletedPerPartition()) | ||
| && cleanerPlan.getFilePathsToBeDeletedPerPartition().values().stream().mapToInt(List::size).sum() > 0)) { | ||
| // Only create cleaner plan which does some work | ||
| option = Option.of(cleanerPlan); | ||
| cleanPlanOpt = Option.of(cleanerPlan); | ||
| } | ||
| // If cleaner plan returned an empty list, incremental clean is enabled and there was no | ||
| // completed clean created in the last X hours configured in MAX_DURATION_TO_CREATE_EMPTY_CLEAN, | ||
| // create a dummy clean to avoid full scan in the future. | ||
| // Note: For a dataset with incremental clean enabled, that does not receive any updates, cleaner plan always comes | ||
| // with an empty list of files to be cleaned. CleanActionExecutor would never be invoked for this dataset. | ||
|
nsivabalan marked this conversation as resolved.
|
||
| // To avoid fullscan on the dataset with every ingestion run, empty clean commit is created here. | ||
| if (cleanPlanOpt.isEmpty() && config.incrementalCleanerModeEnabled() && cleanerPlan.getEarliestInstantToRetain() != null && config.maxIntervalToCreateEmptyCleanHours() > 0) { | ||
| // Only create an empty clean commit if earliestInstantToRetain is present in the plan | ||
| boolean eligibleForEmptyCleanCommit = true; | ||
|
|
||
| // if there is no previous clean instant or the previous clean instant was before the configured max duration, schedule an empty clean commit | ||
| Option<HoodieInstant> lastCleanInstant = table.getCleanTimeline().filterCompletedInstants().lastInstant(); | ||
| if (lastCleanInstant.isPresent()) { | ||
| try { | ||
| ZonedDateTime latestDateTime = ZonedDateTime.ofInstant(java.time.Instant.now(), table.getMetaClient().getTableConfig().getTimelineTimezone().getZoneId()); | ||
|
nsivabalan marked this conversation as resolved.
|
||
| long currentCleanTimeMs = latestDateTime.toInstant().toEpochMilli(); | ||
| long lastCleanTimeMs = HoodieInstantTimeGenerator.parseDateFromInstantTime(lastCleanInstant.get().requestedTime()).toInstant().toEpochMilli(); | ||
| eligibleForEmptyCleanCommit = currentCleanTimeMs - lastCleanTimeMs > (TimeUnit.HOURS.toMillis(config.maxIntervalToCreateEmptyCleanHours())); | ||
| } catch (ParseException e) { | ||
| log.error("Unable to parse last clean commit time", e); | ||
| throw new HoodieException("Unable to parse last clean commit time", e); | ||
|
nsivabalan marked this conversation as resolved.
|
||
| } | ||
| } | ||
| if (eligibleForEmptyCleanCommit) { | ||
| // Ensure earliestCommitToRetain doesn't go backwards when user changes cleaner configuration | ||
| if (lastCleanInstant.isPresent()) { | ||
| try { | ||
| HoodieCleanMetadata lastCleanMetadata = table.getActiveTimeline().readCleanMetadata(lastCleanInstant.get()); | ||
| String previousEarliestCommitToRetain = lastCleanMetadata.getEarliestCommitToRetain(); | ||
| String currentEarliestCommitToRetain = cleanerPlan.getEarliestInstantToRetain().getTimestamp(); | ||
|
|
||
| return option; | ||
| if (!StringUtils.isNullOrEmpty(previousEarliestCommitToRetain) && !StringUtils.isNullOrEmpty(currentEarliestCommitToRetain) | ||
| && compareTimestamps(currentEarliestCommitToRetain, LESSER_THAN, previousEarliestCommitToRetain)) { | ||
| log.warn("Skipping empty clean creation because earliestCommitToRetain would go backwards. " | ||
| + "Previous: {}, Current: {}. This can happen when cleaner configuration is changed.", | ||
| previousEarliestCommitToRetain, currentEarliestCommitToRetain); | ||
|
Comment on lines
+279
to
+281
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can fix the getEarliestInstantToRetain in the current plan to previousEarliestCommitToRetain and go ahead w/ empty clean
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hey @yihua : the completed clean instant will only contain the I don't think its worth fetching the HoodieInstant from previous clean plan(deser clean requested) for this case.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand the claim. The previous ECTR is already fetched as |
||
| return Option.empty(); | ||
| } | ||
| } catch (IOException e) { | ||
| log.error("Unable to read last clean metadata", e); | ||
| throw new HoodieException("Unable to read last clean metadata", e); | ||
| } | ||
| } | ||
| log.info("Creating an empty clean instant with earliestCommitToRetain of {}", cleanerPlan.getEarliestInstantToRetain().getTimestamp()); | ||
| return Option.of(cleanerPlan); | ||
|
nsivabalan marked this conversation as resolved.
|
||
| } | ||
| } | ||
| return cleanPlanOpt; | ||
| } | ||
|
|
||
| @Override | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS-> `INTERVAL_TO_CREATE_EMPTY_CLEAN_HOURS