Skip to content

[OSS PR #18337] feat(clean): Adding empty clean support to hudi#31

Open
yihua wants to merge 1 commit into
masterfrom
oss-18337
Open

[OSS PR #18337] feat(clean): Adding empty clean support to hudi#31
yihua wants to merge 1 commit into
masterfrom
oss-18337

Conversation

@yihua
Copy link
Copy Markdown
Owner

@yihua yihua commented Apr 10, 2026

Mirror of apache#18337 for automated bot review.

Original author: @nsivabalan
Base branch: master

Summary by CodeRabbit

Release Notes

  • New Features

    • Added configuration property hoodie.write.empty.clean.create.duration.ms to control when empty clean commits are created based on time intervals.
  • Bug Fixes

    • Fixed cleaner parallelism calculation to prevent zero-parallelism execution.
    • Improved handling of empty partition scenarios during clean operations.
  • Tests

    • Added comprehensive test coverage for empty clean operation scenarios.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 10, 2026

Important

Review skipped

Too many files!

This PR contains 206 files, which is 56 over the limit of 150.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3d0287d3-57bd-4995-bb0d-eb0da886a794

📥 Commits

Reviewing files that changed from the base of the PR and between 56371a0 and 5208c76.

📒 Files selected for processing (206)
  • docker/build_docker_images.sh
  • hudi-cli/src/main/java/org/apache/hudi/cli/utils/InputStreamConsumer.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/utils/ArchivalUtils.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedAppendHandle.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/FileGroupReaderBasedMergeHandle.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanner.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackActionExecutor.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/MarkerBasedRollbackStrategy.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackHelper.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackHelperFactory.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackHelperV1.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/ZeroToOneUpgradeHandler.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/util/DistributedRegistryUtil.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/TestBaseHoodieTableServiceClient.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/rollback/TestRollbackHelper.java
  • hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/util/HoodieSchemaConverter.java
  • hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.java
  • hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java
  • hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java
  • hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metrics/DistributedRegistry.java
  • hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieSparkSchemaConverters.scala
  • hudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/hudi/SparkAdapter.scala
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/common/table/log/TestLogReaderUtils.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metrics/TestDistributedRegistry.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestMarkerBasedRollbackStrategy.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieCleanerTestBase.java
  • hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java
  • hudi-common/src/main/java/org/apache/hudi/common/engine/HoodieEngineContext.java
  • hudi-common/src/main/java/org/apache/hudi/common/model/DefaultHoodieRecordPayload.java
  • hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
  • hudi-common/src/main/java/org/apache/hudi/common/model/HoodiePayloadProps.java
  • hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/log/LogReaderUtils.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/read/HoodieReadStats.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/read/buffer/LogScanningRecordBufferLoader.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/timeline/BaseHoodieTimeline.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
  • hudi-common/src/main/java/org/apache/hudi/common/table/timeline/versioning/v1/ArchivedTimelineV1.java
  • hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java
  • hudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.java
  • hudi-common/src/main/java/org/apache/hudi/common/util/collection/RocksDBDAO.java
  • hudi-common/src/main/java/org/apache/hudi/common/util/queue/DisruptorMessageQueue.java
  • hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileReaderFactory.java
  • hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileWriterFactory.java
  • hudi-common/src/main/java/org/apache/hudi/metadata/FileSystemBackedTableMetadata.java
  • hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
  • hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadata.java
  • hudi-common/src/main/java/org/apache/hudi/util/PartitionPathFilterUtil.java
  • hudi-common/src/test/java/org/apache/hudi/common/model/TestDefaultHoodieRecordPayload.java
  • hudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.java
  • hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieReadStats.java
  • hudi-common/src/test/java/org/apache/hudi/common/util/TestCleanerUtils.java
  • hudi-common/src/test/java/org/apache/hudi/util/TestPartitionPathFilterUtil.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkMdtCompactionMetrics.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/metrics/FlinkRocksDBIndexMetrics.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/CleanFunction.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithDisruptorBufferSort.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/RLIBootstrapOperator.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/common/AbstractStreamWriteFunction.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactOperator.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CleanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompactHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompactionCommitHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompactionPlanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeCleanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeCompactHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeCompactionCommitHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeCompactionPlanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeTableServiceHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactionCommitHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactionPlanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DefaultCleanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactionCommitHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactionPlanHandler.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/TableServiceHandlerFactory.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/event/Correspondent.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/event/WriteMetadataEvent.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/BucketAssignFunction.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/index/IndexBackend.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/index/RocksDBIndexBackend.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/EventBuffers.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/HoodieScanContext.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/HoodieSource.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceReader.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceSplitReader.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/RecordLimiter.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/function/AbstractSplitReaderFunction.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/function/HoodieCdcSplitReaderFunction.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/function/HoodieSplitReaderFunction.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/assign/HoodieSplitAssigners.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/assign/HoodieSplitBucketAssigner.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/catalog/HoodieHiveCatalog.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/FlinkRowDataReaderContext.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcImageManager.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcInputFormat.java
  • hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcIterators.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/metrics/TestFlinkCompactionMetrics.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteCopyOnWrite.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteMergeOnRead.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/append/TestAppendWriteFunctionWithBufferSort.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/compact/handler/TestCompositeHandlers.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/event/TestWriteMetadataEvent.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/partitioner/index/TestRocksDBIndexBackend.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/MockStateSnapshotContext.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestEventBuffers.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/TestWriteBase.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/TestHoodieSourceSplitReader.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/TestRecordLimiter.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestAbstractSplitReaderFunction.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieSplitReaderFunction.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/TestDefaultHoodieSplitProvider.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/assign/TestHoodieSplitBucketAssigner.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestHoodieTableFactory.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/catalog/TestHoodieHiveCatalog.java
  • hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/format/TestFlinkRowDataReaderContext.java
  • hudi-gcp/src/main/java/org/apache/hudi/gcp/bigquery/BigQuerySchemaResolver.java
  • hudi-gcp/src/test/java/org/apache/hudi/gcp/bigquery/TestBigQuerySchemaResolver.java
  • hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetReaderIterator.java
  • hudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HoodieWrapperFileSystem.java
  • hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/TestHoodieTableConfig.java
  • hudi-hadoop-common/src/test/java/org/apache/hudi/common/table/timeline/TestArchivedTimelineV1.java
  • hudi-hadoop-common/src/test/java/org/apache/hudi/common/util/TestCompactionUtils.java
  • hudi-hadoop-common/src/test/java/org/apache/hudi/common/util/TestParquetReaderIterator.java
  • hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
  • hudi-io/src/main/java/org/apache/hudi/common/metrics/LocalRegistry.java
  • hudi-io/src/main/java/org/apache/hudi/common/metrics/Registry.java
  • hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileMetaIndexBlock.java
  • hudi-io/src/main/java/org/apache/hudi/io/hfile/HFileRootIndexBlock.java
  • hudi-io/src/main/java/org/apache/hudi/io/util/IOUtils.java
  • hudi-io/src/test/java/org/apache/hudi/io/hfile/TestHFileWriter.java
  • hudi-io/src/test/java/org/apache/hudi/io/util/TestIOUtils.java
  • hudi-spark-datasource/hudi-spark-common/pom.xml
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV1.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV2.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/metadata/CatalogBackedTableMetadata.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/HoodieTableChanges.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieVectorSearchPlanBuilder.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/TableValuedFunctions.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSourceV1.scala
  • hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSourceV2.scala
  • hudi-spark-datasource/hudi-spark/pom.xml
  • hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieAnalysis.scala
  • hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/RunCleanProcedure.scala
  • hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowCommitsProcedure.scala
  • hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/table/TestHoodieMergeOnReadTable.java
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestDataSourceDefaults.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestHoodieVectorSearchFunction.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestIncrementalQueryColumnPruning.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsIndex.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsPruning.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/avro/TestSchemaConverters.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/common/TestCatalogBackedTableMetadata.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/common/TestInstantTimeValidation.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/common/TestSqlConf.scala
  • hudi-spark-datasource/hudi-spark3-common/src/main/scala/org/apache/spark/sql/adapter/BaseSpark3Adapter.scala
  • hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/hudi/analysis/Spark33HoodiePruneFileSourcePartitions.scala
  • hudi-spark-datasource/hudi-spark4-common/src/main/scala/org/apache/spark/sql/adapter/BaseSpark4Adapter.scala
  • hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala
  • hudi-spark-datasource/hudi-spark4.0.x/src/test/java/org/apache/hudi/io/storage/row/TestHoodieRowParquetWriteSupportVariant.java
  • hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java
  • hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestSparkSchemaUtils.java
  • hudi-sync/hudi-hive-sync/src/test/java/org/apache/hudi/hive/util/TestHiveSchemaUtil.java
  • hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SparkSchemaUtils.java
  • hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
  • hudi-utilities/src/test/java/org/apache/hudi/utilities/offlinejob/TestHoodieClusteringJob.java
  • pom.xml

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This pull request introduces an empty clean feature for Hudi, allowing scheduled empty clean commits when no partitions require deletion. A new configuration property MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS controls whether empty clean operations are created based on elapsed time since the last clean. Configuration accessors, plan generation logic, and metadata handling have been added across multiple executor and test files.

Changes

Cohort / File(s) Summary
Configuration Properties
hudi-client/.../HoodieCleanConfig.java, hudi-client/.../HoodieWriteConfig.java
Added MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS configuration property with default -1L, builder method withMaxDurationToCreateEmptyClean(...), and public accessor maxDurationToCreateEmptyCleanMs() to expose the setting.
Clean Executors
hudi-client/.../CleanActionExecutor.java, hudi-client/.../CleanPlanActionExecutor.java
Added parallelism minimum enforcement; added getEmptyCleanerPlan(...) to generate plans with no deletions; refactored runClean(...) to create empty clean metadata via new createEmptyCleanMetadata(...) helper; introduced conditional empty clean creation based on time threshold and eligibility checks to prevent backward retention movement.
Unit Tests – Core Functionality
hudi-client/.../TestCleaner.java
Updated partition constant references from HoodieTestDataGenerator.NO_PARTITION_PATH to HoodieTestUtils.NO_PARTITION_PATH; added testEmptyClean() to verify empty clean execution creates proper metadata with no partitions.
Unit Tests – Plan Execution
hudi-client/.../TestCleanPlanExecutor.java
Added two comprehensive test cases: testEmptyCleansAddedAfterThreshold() verifies timing-based empty clean scheduling, and testEmptyCleanDoesNotGoBackwardsOnConfigChange() ensures retention boundaries don't regress; added commitToTestTable(...) helper and supporting imports.
Unit Tests – Utilities
hudi-client/.../TestCleanPlanner.java, hudi-client/.../HoodieCleanerTestBase.java
Updated test metadata computation in getCleanCommitMetadata(...); refactored runCleaner(...) test helper to externalize SparkRDDWriteClient acquisition and timestamp formatting with new method overload.

Sequence Diagram

sequenceDiagram
    participant Client as Clean Planner
    participant Executor as CleanPlanActionExecutor
    participant Timeline as Active Timeline
    participant Metadata as Metadata Generator
    participant Config as HoodieWriteConfig

    Client->>Executor: requestClean()
    Executor->>Config: maxDurationToCreateEmptyCleanMs()
    Config-->>Executor: duration threshold
    
    alt Plan has work
        Executor->>Executor: createNormalCleanerPlan()
        Executor-->>Client: return plan
    else Plan is empty & incremental mode
        Executor->>Timeline: getLastCleanInstantTime()
        Timeline-->>Executor: last clean timestamp
        Executor->>Executor: compare elapsed time vs threshold
        
        alt Time threshold exceeded & retention valid
            Executor->>Executor: getEmptyCleanerPlan()
            Executor->>Timeline: reload active timeline
            Executor->>Metadata: createEmptyCleanMetadata()
            Metadata-->>Executor: empty clean metadata
            Executor-->>Client: return empty plan & metadata
        else Threshold not met or retention invalid
            Executor-->>Client: return empty (skip clean)
        end
    end
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A hoppy commit hops in with care,
Empty cleans floating through the air,
Timing checks keep things just right,
Metadata glows in retention's light,
Clean schedules now hop with delight! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main feature addition: empty clean support for Hudi, which is reflected across multiple files including configuration, executor, and comprehensive test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch oss-18337

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 10, 2026

Greptile Summary

This PR introduces an empty clean commit feature to Hudi's cleaning subsystem. For datasets using incremental clean that receive few or no updates, the cleaner planner previously never generated a plan (nothing to clean), meaning the earliestCommitToRetain pointer was never advanced. On the next run, the cleaner would then need to do a full-table scan. The feature adds:

  • A new config hoodie.write.empty.clean.create.duration.ms (default -1, disabled) that gates how often an empty clean commit may be written.
  • Logic in CleanPlanActionExecutor to emit an "empty" plan (no files to delete) if no work is needed but the threshold has elapsed, and a safety guard to prevent earliestCommitToRetain from regressing when the cleaner policy is changed.
  • A corresponding createEmptyCleanMetadata builder in CleanActionExecutor for the execution side.
  • New tests in TestCleanPlanExecutor covering the threshold, the regression guard, and extra metadata (savepoints).

Key findings:

  • A potential NullPointerException in createEmptyCleanMetadata when the clean plan's EarliestInstantToRetain is null.
  • Missing setVersion() call in getEmptyCleanerPlan for the earliestInstant-absent branch.
  • Unintended in-place mutation of the caller-supplied extraMetadata map in prepareExtraMetadata.

Confidence Score: 3/5

Not safe to merge as-is due to a latent NPE in createEmptyCleanMetadata that can be triggered by plans with only partition deletions and no earliest instant to retain.

The feature design and the threshold/regression-guard logic are sound. However, the NPE on cleanerPlan.getEarliestInstantToRetain().getTimestamp() in createEmptyCleanMetadata (line 250, CleanActionExecutor) is a real defect — it can be hit when a clean plan carries only partitionsToBeDeleted entries with no file paths and a null earliestInstantToRetain. Additionally, the missing setVersion in the absent-earliestInstant branch of getEmptyCleanerPlan risks plan-version skew. Both issues have targeted one-line fixes; once addressed, the PR is in good shape.

CleanActionExecutor.java (NPE on line 250) and CleanPlanActionExecutor.java (missing setVersion, extraMetadata mutation on lines 113–116 and 193–194) need attention before merge.

Important Files Changed

Filename Overview
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java Adds createEmptyCleanMetadata helper and a new branch in runClean to emit empty metadata when cleanStats is empty; contains a NPE risk on getEarliestInstantToRetain().getTimestamp() when the plan's field is null.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java Adds getEmptyCleanerPlan, threshold-based eligibility logic, and an earliestCommitToRetain regression guard; missing setVersion in the absent-earliestInstant branch and mutates the caller-supplied extraMetadata map.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java Adds MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS config (default -1, opt-in) with builder method; config definition is clean and safe.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java Adds maxDurationToCreateEmptyCleanMs() accessor delegating to the new HoodieCleanConfig key; straightforward change.
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java Adds testEmptyCleansAddedAfterThreshold and testEmptyCleanDoesNotGoBackwardsOnConfigChange covering the happy path, regression guard, and savepoint metadata; no test covers the null-earliestInstantToRetain NPE path in createEmptyCleanMetadata.
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java Adds testEmptyClean validating execution of a manually constructed empty clean plan; test always provides a non-null EarliestInstantToRetain so the NPE path is not exercised.
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.java No changes to existing planner tests; file referenced for context only.
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieCleanerTestBase.java Adds a runCleaner overload accepting a SparkRDDWriteClient and an explicit instant time to support time-based threshold testing.

Sequence Diagram

sequenceDiagram
    participant CE as CleanActionExecutor
    participant CPAE as CleanPlanActionExecutor
    participant CP as CleanPlanner
    participant AT as ActiveTimeline

    CE->>CPAE: execute()
    CPAE->>CP: getEarliestCommitToRetain()
    CP-->>CPAE: earliestInstant (may be empty)
    CPAE->>CP: getPartitionPathsToClean(earliestInstant)
    CP-->>CPAE: partitionsToClean

    alt partitionsToClean is empty
        CPAE->>CPAE: getEmptyCleanerPlan(earliestInstant)
        CPAE-->>CE: cleanerPlan (empty files, set/null earliestInstantToRetain)
    else partitions exist
        CPAE->>CP: getDeletePaths per partition
        CPAE-->>CE: cleanerPlan (with file paths)
    end

    CE->>CE: Check empty plan eligibility
    note over CE: cleanPlanOpt.isEmpty() &&<br/>incrementalCleanerMode &&<br/>earliestInstantToRetain != null &&<br/>maxDuration > 0 &&<br/>time since last clean > threshold

    alt eligible for empty clean
        CE->>AT: saveToCleanRequested (empty plan)
        CE->>CE: runClean(table, cleanInstant, cleanerPlan)
        CE->>CE: clean(context, cleanerPlan) → cleanStats (empty list)
        CE->>CE: createEmptyCleanMetadata(cleanerPlan, ...)
        note over CE: NPE if getEarliestInstantToRetain() == null
        CE->>AT: transitionCleanInflightToComplete
    else not eligible
        CE-->>CE: return cleanPlanOpt (empty)
    end
Loading

Comments Outside Diff (1)

  1. hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java, line 193-194 (link)

    P2 In-place mutation of the caller-supplied extraMetadata map

    extraMetadata.orElseGet(() -> new HashMap<>()) returns the contained map when extraMetadata is present. The subsequent .put(SAVEPOINTED_TIMESTAMPS, ...) then mutates the original map that was passed into the executor constructor. If that map is:

    • unmodifiable (e.g. Collections.emptyMap() or Collections.unmodifiableMap(...)) this will throw UnsupportedOperationException.
    • shared across calls, subsequent calls will carry over stale savepoint data from a previous call.

    Create a defensive copy instead:

Reviews (1): Last reviewed commit: "Addressing feedback" | Re-trigger Greptile

.setTimeTakenInMillis(timeTakenMillis)
.setTotalFilesDeleted(0)
.setLastCompletedCommitTimestamp(cleanerPlan.getLastCompletedCommitTimestamp())
.setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain().getTimestamp())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Potential NullPointerException on getEarliestInstantToRetain()

createEmptyCleanMetadata is reached whenever cleanStats.isEmpty(). cleanStats is built by streaming over cleanerPlan.getFilePathsToBeDeletedPerPartition().keySet(), so it is empty when the partition-file map is empty — which is exactly the case for any plan that has only partitionsToBeDeleted entries with no file paths.

In the regular (non-empty-clean-commit) code path inside requestClean(context):

return new HoodieCleanerPlan(
    earliestInstant.map(...).orElse(null),   // ← null when no earliest instant
    ...
);

If earliestInstant is absent (e.g. KEEP_LATEST_FILE_VERSIONS policy with no commits beyond the threshold) the plan's EarliestInstantToRetain field is null. If partitionsToBeDeleted is non-empty for that plan and there are no file paths, cleanStats will be empty, createEmptyCleanMetadata will be called, and line 250 will NPE:

.setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain().getTimestamp())
//                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ can be null
Suggested change
.setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain().getTimestamp())
.setEarliestCommitToRetain(cleanerPlan.getEarliestInstantToRetain() != null
? cleanerPlan.getEarliestInstantToRetain().getTimestamp() : null)

Comment on lines +113 to +116
} else {
cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name());
}
return cleanBuilder.build();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 setVersion is missing in the else branch of getEmptyCleanerPlan

When earliestInstant is absent, the builder never calls .setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION), so the serialised plan will have the Avro-schema default (typically 1 or 0 depending on the schema evolution). When this plan is later read back by CleanerUtils.getCleanerPlan, a version mismatch can cause silent compatibility issues or an UnsupportedOperationException in version-dispatch logic.

setVersion should be applied unconditionally:

Suggested change
} else {
cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name());
}
return cleanBuilder.build();
} else {
cleanBuilder.setPolicy(HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name())
.setVersion(CleanPlanner.LATEST_CLEAN_PLAN_VERSION);
}

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java (1)

244-255: Prefer the latest clean-metadata version constant here.

Line 251 hardcodes CLEAN_METADATA_VERSION_2 for the empty-clean path. The normal path goes through CleanerUtils.convertCleanMetadata(...), so this branch will drift the next time clean metadata is version-bumped.

♻️ Minimal alignment
-import static org.apache.hudi.common.util.CleanerUtils.CLEAN_METADATA_VERSION_2;
+import static org.apache.hudi.common.util.CleanerUtils.LATEST_CLEAN_METADATA_VERSION;
...
-        .setVersion(CLEAN_METADATA_VERSION_2)
+        .setVersion(LATEST_CLEAN_METADATA_VERSION)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`
around lines 244 - 255, The empty-clean branch in createEmptyCleanMetadata
hardcodes CLEAN_METADATA_VERSION_2 which will drift when versions bump; replace
that hardcoded constant by using the same version source as the normal path —
e.g., call CleanerUtils.convertCleanMetadata(...) to produce canonical metadata
(or retrieve the current clean-metadata version via CleanerUtils’ public
accessor) and use that value instead of CLEAN_METADATA_VERSION_2 so both paths
stay aligned.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java`:
- Around line 247-257: Add explicit validation for
MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS by enforcing the value is either -1 or >=
0; in the builder setter method maxDurationToCreateEmptyCleanMs(...) validate
the incoming long and throw an IllegalArgumentException for invalid values, and
add the same check in the build() method to catch property-file or deserialized
configurations before constructing the HoodieCleanConfig instance.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`:
- Around line 218-228: The code in CleanActionExecutor should not treat
cleanStats.isEmpty() alone as the signal for an empty clean because partition
deletions (tracked on the CleanerPlan) are real work; change the conditional so
that createEmptyCleanMetadata(...) is used only when both cleanStats.isEmpty()
AND there are no partition deletes recorded on the cleanerPlan (e.g., check
cleanerPlan.getPartitionsToDelete() or equivalent), otherwise call
CleanerUtils.convertCleanMetadata(...) as before (passing
inflightInstant.requestedTime() and Option.of(timer.endTimer())) so
partition-delete metadata is preserved; update the logic around cleanStats,
cleanerPlan, createEmptyCleanMetadata, and CleanerUtils.convertCleanMetadata
accordingly.

In
`@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java`:
- Around line 893-894: The test creates a write client just to call savepoint
via getHoodieWriteClient(config).savepoint(...) but never closes that client;
change the code to obtain the HoodieWriteClient instance (e.g.,
HoodieWriteClient client = getHoodieWriteClient(config)), call
client.savepoint(fourthCommitTs, "user", "comment"), and then close the client
(client.close()) or use a try-with-resources block to ensure the write client is
always closed to release background resources.

In
`@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieCleanerTestBase.java`:
- Around line 103-109: The overload of runCleaner creates a SparkRDDWriteClient
via getHoodieWriteClient(config) and returns without closing it; update this
method (the runCleaner overload that builds cleanInstantTs) to ensure the
allocated writeClient is closed in all cases by calling writeClient.close() (or
try-with-resources/try-finally) after delegating to the other runCleaner(String
cleanInstantTs, SparkRDDWriteClient<?> writeClient, ...) so the client is always
closed even on exceptions; reference the methods/variables: runCleaner(...),
getHoodieWriteClient, writeClient, makeNewCommitTime, and ensure the close
happens after runCleaner returns or in finally.

---

Nitpick comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`:
- Around line 244-255: The empty-clean branch in createEmptyCleanMetadata
hardcodes CLEAN_METADATA_VERSION_2 which will drift when versions bump; replace
that hardcoded constant by using the same version source as the normal path —
e.g., call CleanerUtils.convertCleanMetadata(...) to produce canonical metadata
(or retrieve the current clean-metadata version via CleanerUtils’ public
accessor) and use that value instead of CLEAN_METADATA_VERSION_2 so both paths
stay aligned.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f2e6c2ce-8bde-4af9-9312-b151090a4f18

📥 Commits

Reviewing files that changed from the base of the PR and between 35e2bbf and 56371a0.

📒 Files selected for processing (8)
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
  • hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java
  • hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/action/TestCleanPlanner.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/TestCleaner.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieCleanerTestBase.java

Comment on lines +247 to +257
public static final ConfigProperty<Long> MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS = ConfigProperty
.key("hoodie.write.empty.clean.create.duration.ms")
.defaultValue(-1L)
.markAdvanced()
.withDocumentation("In some cases empty clean commit needs to be created to ensure the clean planner "
+ "does not look through entire dataset if there are no clean plans. This is possible for append-only "
+ "dataset. Also, for these datasets we cannot ignore clean completely since in the future there could "
+ "be upsert or replace operations. By creating empty clean commit, earliest_commit_to_retain value "
+ "will be updated so that now clean planner can only check for partitions that are modified after the "
+ "last empty clean's earliest_commit_toRetain value there by optimizing the clean planning");

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Usages of empty-clean duration config =="
rg -n -C3 --type=java 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS|maxDurationToCreateEmptyCleanMs|withMaxDurationToCreateEmptyClean'

echo
echo "== Inspect CleanPlanActionExecutor logic =="
fd -i "CleanPlanActionExecutor.java" --exec sed -n '1,260p' {}

echo
echo "== Inspect validations in clean/write config =="
fd -i "HoodieCleanConfig.java" --exec rg -n -C3 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS|withMaxDurationToCreateEmptyClean|build\(' {}
fd -i "HoodieWriteConfig.java" --exec rg -n -C3 'maxDurationToCreateEmptyCleanMs|validate\(' {}

Repository: yihua/hudi

Length of output: 37609


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check for all maxDurationToCreateEmptyCleanMs accesses ==="
rg -n 'maxDurationToCreateEmptyCleanMs' --type=java

echo
echo "=== Verify no other executors or code paths use this config ==="
rg -n 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS' --type=java

echo
echo "=== Check if this config can be loaded from properties directly ==="
rg -n 'fromProperties\|fromFile' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | head -20

Repository: yihua/hudi

Length of output: 1473


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check HoodieWriteConfig initialization and getLong method ==="
rg -n 'public long getLong|private long getLong' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java -A 5 | head -30

echo
echo "=== Check if HoodieCleanConfig has any factory methods ==="
fd -i "HoodieCleanConfig.java" --exec grep -n 'public static\|public HoodieCleanConfig' {} | head -20

echo
echo "=== Verify the build() method in HoodieCleanConfig ==="
fd -i "HoodieCleanConfig.java" --exec sed -n '433,445p' {}

Repository: yihua/hudi

Length of output: 119


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find the getLong implementation in config hierarchy ==="
fd -i "HoodieConfig.java" --exec grep -n 'public long getLong' {} 

echo
echo "=== Check HoodieWriteConfig build method and validation ==="
fd -i "HoodieWriteConfig.java" --exec sed -n '3789,3800p' {}

echo
echo "=== Full HoodieCleanConfig build method ==="
fd -i "HoodieCleanConfig.java" --exec sed -n '433,450p' {}

echo
echo "=== Check if value is parsed/validated anywhere in build or getter ==="
rg -n 'parseLong|Long.parseLong' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/ --type=java | grep -i 'empty\|clean' | head -10

Repository: yihua/hudi

Length of output: 113


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Search for AbstractHoodieConfig or base config class ==="
fd 'AbstractHoodieConfig.java|HoodieConfig.java' --type f

echo
echo "=== Get full HoodieCleanConfig build method ==="
cat hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java | sed -n '433,445p'

echo
echo "=== Check all references to getLong in config directory ==="
find hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config -name '*.java' -exec grep -l 'getLong' {} \;

echo
echo "=== Search for where config value is actually retrieved (check if Long.parseLong is used) ==="
rg 'Long.parseLong.*MAX_DURATION|getLong.*MAX_DURATION' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/ --type=java

Repository: yihua/hudi

Length of output: 1138


🏁 Script executed:

#!/bin/bash
set -euoripefail

echo "=== Check HoodieConfig.getLong implementation ==="
sed -n '1,500p' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java | grep -A 10 'public.*getLong'

echo
echo "=== Verify how getLong handles Long.parseLong ==="
rg -n 'Long.parseLong' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java -A 3

echo
echo "=== Check if validation happens on config build in HoodieWriteConfig ==="
sed -n '3789,3810p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

echo
echo "=== Verify if properties file can bypass builder validation ==="
rg -n 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java -B 5 -A 5 | head -30

Repository: yihua/hudi

Length of output: 728


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check HoodieConfig.getLong implementation ==="
sed -n '1,500p' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java | grep -A 10 'public.*getLong'

echo
echo "=== Verify how getLong handles Long.parseLong ==="
rg -n 'Long.parseLong' hudi-common/src/main/java/org/apache/hudi/common/config/HoodieConfig.java -A 3

echo
echo "=== Check if validation happens on config build in HoodieWriteConfig ==="
sed -n '3789,3810p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

echo
echo "=== Verify if properties file can bypass builder validation ==="
rg -n 'MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java -B 5 -A 5 | head -30

Repository: yihua/hudi

Length of output: 3584


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Check HoodieWriteConfig.validate() method to see if it validates MAX_DURATION ==="
rg -n 'private void validate' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java -A 100 | head -150

Repository: yihua/hudi

Length of output: 5864


Consider adding validation for MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS to reject invalid ranges.

Line 429 accepts any long value without validation. While downstream code at line 251 safely gates the feature with config.maxDurationToCreateEmptyCleanMs() > 0, adding explicit validation would improve clarity and prevent silent misconfigurations. Enforce duration == -1 || duration >= 0 either in the builder setter (line 428) or the build() method (line 433) to match the documented semantics.

The proposed fix remains valid:

🔧 Suggested fix
    public HoodieCleanConfig.Builder withMaxDurationToCreateEmptyClean(long duration) {
+     if (duration < -1) {
+       throw new IllegalArgumentException(
+           "hoodie.write.empty.clean.create.duration.ms must be -1 (disabled) or >= 0");
+     }
      cleanConfig.setValue(MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS, String.valueOf(duration));
      return this;
    }

Also applies to: 433-439 (add similar validation to build() method to catch property-file configurations)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java`
around lines 247 - 257, Add explicit validation for
MAX_DURATION_TO_CREATE_EMPTY_CLEAN_MS by enforcing the value is either -1 or >=
0; in the builder setter method maxDurationToCreateEmptyCleanMs(...) validate
the incoming long and throw an IllegalArgumentException for invalid values, and
add the same check in the build() method to catch property-file or deserialized
configurations before constructing the HoodieCleanConfig instance.

Comment on lines +218 to +228
table.getMetaClient().reloadActiveTimeline();
HoodieCleanMetadata metadata;
if (cleanStats.isEmpty()) {
return HoodieCleanMetadata.newBuilder().build();
metadata = createEmptyCleanMetadata(cleanerPlan, inflightInstant, timer.endTimer());
} else {
metadata = CleanerUtils.convertCleanMetadata(
inflightInstant.requestedTime(),
Option.of(timer.endTimer()),
cleanStats,
cleanerPlan.getExtraMetadata()
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't use cleanStats.isEmpty() as the empty-clean signal.

Line 220 now treats every stat-less execution as a no-op clean, but clean() still deletes partitionsToBeDeleted without emitting a HoodieCleanStat for them. After CleanPlanActionExecutor started treating partition deletes as real work, a partition-only clean will land here, get serialized as an empty clean, and lose its partition-delete metadata.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java`
around lines 218 - 228, The code in CleanActionExecutor should not treat
cleanStats.isEmpty() alone as the signal for an empty clean because partition
deletions (tracked on the CleanerPlan) are real work; change the conditional so
that createEmptyCleanMetadata(...) is used only when both cleanStats.isEmpty()
AND there are no partition deletes recorded on the cleanerPlan (e.g., check
cleanerPlan.getPartitionsToDelete() or equivalent), otherwise call
CleanerUtils.convertCleanMetadata(...) as before (passing
inflightInstant.requestedTime() and Option.of(timer.endTimer())) so
partition-delete metadata is preserved; update the logic around cleanStats,
cleanerPlan, createEmptyCleanMetadata, and CleanerUtils.convertCleanMetadata
accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants