[OSS PR #18295] feat(metadata): Allow users to safely execute compaction plans on metadata table concurrently through a table service platform (rather than only inline during write)#27
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughMultiple features and refactors across modules: clustering-expiration/heartbeat rollback, metadata-table service delegation, distributed metric registries, Flink reader refactors with global record-limit push-down, VECTOR nesting validation, compaction/log-compaction helpers, lock-config derivation, catalog-backed partition listing, CDC iterator/image refactor, many handler/factory abstractions, and extensive tests. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 14
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java (1)
1352-1359:⚠️ Potential issue | 🟠 MajorValidate
table.service.manager.actionsfor property-based builders too.Only
withTableServiceManagerActions()callsvalidateTableServiceManagerActions(...). Values supplied throughfromFile(),fromProperties(), orwithProperties()bypass validation entirely, sobuild()currently accepts invalid action lists and pushes the failure downstream.Fail fast during build
public HoodieMetadataConfig build() { metadataConfig.setDefaultValue(ENABLE, getDefaultMetadataEnable(engineType)); metadataConfig.setDefaultValue(ENABLE_METADATA_INDEX_COLUMN_STATS, getDefaultColStatsEnable(engineType)); metadataConfig.setDefaultValue(SECONDARY_INDEX_ENABLE_PROP, getDefaultSecondaryIndexEnable(engineType)); metadataConfig.setDefaultValue(STREAMING_WRITE_ENABLED, getDefaultForStreamingWriteEnabled(engineType)); // fix me: disable when schema on read is enabled. metadataConfig.setDefaults(HoodieMetadataConfig.class.getName()); + String actions = metadataConfig.getString(TABLE_SERVICE_MANAGER_ACTIONS); + if (StringUtils.nonEmpty(actions)) { + validateTableServiceManagerActions(actions); + } return metadataConfig; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java` around lines 1352 - 1359, build() lets property-based builders bypass validation of table.service.manager.actions because only withTableServiceManagerActions() invokes validateTableServiceManagerActions(...); fix by invoking validateTableServiceManagerActions(...) in HoodieMetadataConfig.build() for any actions coming from properties/file defaults as well—retrieve the configured table.service.manager.actions value (the same key used by withTableServiceManagerActions), call validateTableServiceManagerActions(actions) before returning metadataConfig, and ensure the validation path covers values supplied via fromFile(), fromProperties(), and withProperties(), not just the fluent withTableServiceManagerActions() setter.
🟡 Minor comments (7)
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV1.scala-308-309 (1)
308-309:⚠️ Potential issue | 🟡 MinorDuplicate log statement.
The "Additional Filters to be applied" message is logged twice - once at line 244 and again at line 309. Remove the duplicate to avoid log noise.
Suggested fix
scanDf = IncrementalRelationUtil.filterRequiredColumnsFromDF(scanDf, requiredColumns, metaClient) - log.info("Additional Filters to be applied to incremental source are :" + filters.mkString("Array(", ", ", ")")) filters.foldLeft(scanDf)((e, f) => e.filter(f)).rdd🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV1.scala` around lines 308 - 309, Remove the duplicate log.info call that prints "Additional Filters to be applied..." after calling IncrementalRelationUtil.filterRequiredColumnsFromDF; keep the earlier log emitted at the original location (around the previous log at line ~244). Specifically, delete the log.info(...) invocation that uses filters.mkString("Array(", ", ", ")") in IncrementalRelationV1.scala (the one immediately following the scanDf = IncrementalRelationUtil.filterRequiredColumnsFromDF(...) line) so the message is logged only once.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/TestDefaultHoodieSplitProvider.java-567-568 (1)
567-568:⚠️ Potential issue | 🟡 MinorPin the bucket config in this test.
These assertions now depend on the default value of
FlinkOptions.BUCKET_INDEX_NUM_BUCKETS. If that default changes, this test can fail for the wrong reason. Set the option explicitly here so the bucket→task mapping stays stable.Suggested test hardening
+import org.apache.hudi.configuration.FlinkOptions; ... - HoodieSplitBucketAssigner assigner = - new HoodieSplitBucketAssigner(4, new Configuration()); + Configuration conf = new Configuration(); + conf.set(FlinkOptions.BUCKET_INDEX_NUM_BUCKETS, 4); + HoodieSplitBucketAssigner assigner = + new HoodieSplitBucketAssigner(4, conf);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/TestDefaultHoodieSplitProvider.java` around lines 567 - 568, The test relies on the default for FlinkOptions.BUCKET_INDEX_NUM_BUCKETS which makes it brittle; instead create a Configuration, set FlinkOptions.BUCKET_INDEX_NUM_BUCKETS explicitly (e.g. to 4) and pass that Configuration into the HoodieSplitBucketAssigner constructor (HoodieSplitBucketAssigner) so the bucket→task mapping is stable regardless of global defaults.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/assign/TestHoodieSplitBucketAssigner.java-177-194 (1)
177-194:⚠️ Potential issue | 🟡 MinorThis test never proves the assignments actually differ.
"country=DE"only guarantees a non-zerohashCode, not a non-zerohashCode() % parallelism. If that modulo is0,taskWith4andtaskWith8can be identical and this still passes, so the behavior promised by the test name is not covered. Pick a partition with a known modulo and assert the two task IDs are different, or rename the test.Suggested assertion fix
+import static org.junit.jupiter.api.Assertions.assertNotEquals; ... - String partition = "country=DE"; // non-zero hashCode guaranteed + String partition = "p1"; // hashCode = 3521, so hashCode % 5 == 1 ... assertEquals(expectedTaskId(partition, bucket, 4, parallelism), taskWith4); assertEquals(expectedTaskId(partition, bucket, 8, parallelism), taskWith8); + assertNotEquals(taskWith4, taskWith8, + "Changing numBuckets should change the assignment for this partition");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/assign/TestHoodieSplitBucketAssigner.java` around lines 177 - 194, The test testDifferentNumBucketsProduceDifferentAssignments currently can pass even when assignments are identical because "country=DE" may produce the same task id after modulo; update the test so it actually proves different assignments by choosing a partition value with a known hashCode behavior or by explicitly asserting the two assignments differ. Concretely, keep the existing calls to new HoodieSplitBucketAssigner(parallelism, conf4).assign(split(...)) and conf8, then add an assertion like assertNotEquals(taskWith4, taskWith8) (or replace the partition with a literal whose hashCode % parallelism yields different expectedTaskId values) in addition to the existing expectedTaskId checks; reference the test method testDifferentNumBucketsProduceDifferentAssignments, the HoodieSplitBucketAssigner constructor, the assign(...) call, the split(...) helper, and expectedTaskId(...) when making the change.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java-3244-3247 (1)
3244-3247:⚠️ Potential issue | 🟡 MinorAssert the filtered row, not just the row count.
This passes even if the engine applies
LIMIT 1before the partition filter, because any single row satisfies the assertion. Please also assert that the returned row belongs topar1so the test actually protects filter+limit push-down behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java` around lines 3244 - 3247, Update the assertion for the query result stored in result1 to verify the filtered row's partition value instead of only its count: after executing tableEnv.sqlQuery("select * from t1 where `partition` = 'par1' limit 1").execute().collect() and converting to result1, assert that result1 has size 1 and that the single Row in result1 has its `partition` field equal to "par1" (or assert the expected id value for that partition), so the test verifies filter+limit push-down rather than just row count.hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java-163-183 (1)
163-183:⚠️ Potential issue | 🟡 MinorAssert the streaming-write flag in this test.
testCreateMetadataWriteConfigForcesStreamingWritesOffWithMultiWritercurrently only re-validates the failed-write policy, concurrency mode, and lock provider. IfSTREAMING_WRITE_ENABLEDregresses back totrue, this test still passes. Please add an explicit assertion for the MDT streaming-write setting here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java` around lines 163 - 183, The test testCreateMetadataWriteConfigForcesStreamingWritesOffWithMultiWriter is missing an explicit assertion that streaming writes are disabled on the metadata config; after obtaining metadataWriteConfig from HoodieMetadataWriteUtils.createMetadataWriteConfig, add an assertion that the metadataWriteConfig's metadata config streaming flag is false (e.g., assertFalse(metadataWriteConfig.getMetadataConfig().isStreamingWriteEnabled()) or equivalent) so the test fails if STREAMING_WRITE_ENABLED regresses to true.hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java-279-285 (1)
279-285:⚠️ Potential issue | 🟡 MinorReject negative clustering expiration thresholds.
EXPIRATION_THRESHOLD_MINScurrently accepts anylong. A negative value effectively removes the age guardrail and makes heartbeat-expired clustering instants immediately rollback-eligible. Please validate this as non-negative inBuilder.validate().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java` around lines 279 - 285, The EXPIRATION_THRESHOLD_MINS ConfigProperty currently allows negative values; update HoodieClusteringConfig.Builder.validate() to ensure the value retrieved from EXPIRATION_THRESHOLD_MINS is non-negative and throw a clear IllegalArgumentException (or similar) if it is negative. Locate the check in Builder.validate(), read the long via EXPIRATION_THRESHOLD_MINS.get(config) (or equivalent accessor), and add a guard like if (value < 0) throw new IllegalArgumentException("hoodie.clustering.expiration.threshold.mins must be non-negative: " + value) so invalid configs are rejected early.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestAbstractSplitReaderFunction.java-213-245 (1)
213-245:⚠️ Potential issue | 🟡 MinorAssert cross-instance non-identity in the “independence” tests.
These tests only prove that each instance is stable across repeated calls. A broken implementation that shares one static
Configuration/HoodieWriteConfigacross every reader would still pass.Suggested test tightening
import static org.junit.jupiter.api.Assertions.assertFalse; import static org.junit.jupiter.api.Assertions.assertNotNull; +import static org.junit.jupiter.api.Assertions.assertNotSame; import static org.junit.jupiter.api.Assertions.assertSame; import static org.junit.jupiter.api.Assertions.assertTrue; @@ assertNotNull(hadoopConf1); assertNotNull(hadoopConf2); + assertNotSame(hadoopConf1, hadoopConf2, + "distinct reader instances should not share the same hadoopConf singleton"); // Each instance must keep returning its own stable singleton. assertSame(hadoopConf1, fn1.hadoopConfForTest(), "fn1's hadoopConf must remain the same singleton across calls"); @@ assertNotNull(wc1); assertNotNull(wc2); + assertNotSame(wc1, wc2, + "distinct reader instances should not share the same writeConfig singleton"); // Each instance must keep returning its own stable singleton. assertSame(wc1, fn1.writeConfigForTest(), "fn1's writeConfig must remain the same singleton across calls");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestAbstractSplitReaderFunction.java` around lines 213 - 245, The tests testTwoInstancesHaveStableIndependentHadoopConf and testTwoInstancesHaveStableIndependentWriteConfig currently only assert per-instance stability; update each to also assert the two instances do not share the same object by adding an assertNotSame between fn1.hadoopConfForTest() and fn2.hadoopConfForTest() in the Hadoop config test, and an assertNotSame between fn1.writeConfigForTest() and fn2.writeConfigForTest() in the write config test so that hadoopConfForTest and writeConfigForTest prove cross-instance independence as well as stability.
🧹 Nitpick comments (7)
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala (1)
2605-2617: Strengthen this regression test by asserting_hoodie_partition_pathvalues explicitly.Line 2606 selects
_hoodie_partition_path, but current assertions only check_row_keyanddata. Adding value checks here will validate the duplicate-meta-field behavior end-to-end.Proposed test assertion enhancement
assertEquals("row1", results(0).getAs[String]("_row_key")) + assertEquals("src_partition1", results(0).getAs[String]("_hoodie_partition_path")) assertEquals("value1", results(0).getAs[String]("data")) assertEquals("row2", results(1).getAs[String]("_row_key")) + assertEquals("src_partition1", results(1).getAs[String]("_hoodie_partition_path")) assertEquals("value2", results(1).getAs[String]("data")) assertEquals("row3", results(2).getAs[String]("_row_key")) + assertEquals("src_partition2", results(2).getAs[String]("_hoodie_partition_path")) assertEquals("value3", results(2).getAs[String]("data"))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala` around lines 2605 - 2617, Add explicit assertions for the _hoodie_partition_path column in the TestCOWDataSource regression check: after collecting incrementalDf into results and before or alongside the existing asserts on "_row_key" and "data", assert that results(0).getAs[String]("_hoodie_partition_path"), results(1).getAs[String]("_hoodie_partition_path"), and results(2).getAs[String]("_hoodie_partition_path") equal the expected partition values for row1, row2, and row3 respectively; update the test near the incrementalDf.select(...).collect() block (references: incrementalDf, results) to include these checks so duplicate-meta-field behavior is validated end-to-end.hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scala (3)
28-75: Consider using explicit return type annotation.The
getPrunedSchemamethod lacks an explicit return type. While Scala can infer it, adding: StructTypeimproves code readability and API clarity.Suggested fix
def getPrunedSchema(requiredColumns: Array[String], usedSchema: StructType, - metaClient: HoodieTableMetaClient) = { + metaClient: HoodieTableMetaClient): StructType = { var prunedSchema = StructType(Seq())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scala` around lines 28 - 75, Add an explicit return type to the getPrunedSchema method signature: declare it as returning org.apache.spark.sql.types.StructType (or simply StructType if the import is present) so the signature becomes def getPrunedSchema(...): StructType = { ... }; leave the implementation unchanged—only update the method declaration to include the return type for clarity.
77-109: Consider using explicit return type annotation.Similarly,
filterRequiredColumnsFromDFwould benefit from an explicit return type for consistency and clarity.Suggested fix
def filterRequiredColumnsFromDF(df: DataFrame, requiredColumns: Array[String], - metaClient: HoodieTableMetaClient) = { + metaClient: HoodieTableMetaClient): DataFrame = { var updatedDF = df🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scala` around lines 77 - 109, The method filterRequiredColumnsFromDF lacks an explicit return type; update its signature to declare the return type (DataFrame) — e.g., change def filterRequiredColumnsFromDF(df: DataFrame, requiredColumns: Array[String], metaClient: HoodieTableMetaClient): DataFrame = { ... } — so the compiler and readers see the intended return type clearly while leaving the method body unchanged.
82-84: RedundanttoDF()calls beforedrop().The
toDF()calls are unnecessary sincedrop()already returns aDataFrame. This applies to lines 83, 92, and 103.Suggested simplification
// If _hoodie_commit_time is not part of the required columns remove it from the resultant Dataframe if (!requiredColumns.contains(HoodieRecord.COMMIT_TIME_METADATA_FIELD)) { - updatedDF = updatedDF.toDF().drop(HoodieRecord.COMMIT_TIME_METADATA_FIELD) + updatedDF = updatedDF.drop(HoodieRecord.COMMIT_TIME_METADATA_FIELD) }Apply similar changes to lines 92 and 103.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scala` around lines 82 - 84, Remove the redundant toDF() calls before drop on updatedDF in IncrementalRelationUtil.scala: instead of updatedDF = updatedDF.toDF().drop(HoodieRecord.COMMIT_TIME_METADATA_FIELD) (and the similar lines for HoodieRecord.COMMIT_SEQNO_METADATA_FIELD and HoodieRecord.RECORD_KEY_METADATA_FIELD), call drop directly on updatedDF (updatedDF = updatedDF.drop(...)) because drop already returns a DataFrame; update the three occurrences referencing updatedDF and those HoodieRecord.* metadata fields.hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala (1)
1022-1025: Make nested VECTOR error matching less brittle.Matching the full literal message is fragile; prefer checking stable tokens (for example,
"Nested VECTOR columns"and"top-level field").💡 Suggested refactor
- private def nestedVectorMessageInCauseChain(ex: Throwable): Boolean = - ex != null && (Option(ex.getMessage).exists(_.contains( - "VECTOR column 'embedding' must be a top-level field. Nested VECTOR columns (inside STRUCT, ARRAY, or MAP) are not supported.")) - || nestedVectorMessageInCauseChain(ex.getCause)) + private def nestedVectorMessageInCauseChain(ex: Throwable): Boolean = { + var cause = ex + while (cause != null) { + val msg = Option(cause.getMessage).getOrElse("") + if (msg.contains("Nested VECTOR columns") && msg.contains("top-level field")) { + return true + } + cause = cause.getCause + } + false + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala` around lines 1022 - 1025, The current nestedVectorMessageInCauseChain method matches the full error string which is brittle; update it to look for stable tokens instead (e.g., check that Option(ex.getMessage).exists(msg => msg.contains("Nested VECTOR columns") && msg.contains("top-level field")) ) while preserving the recursive cause traversal (nestedVectorMessageInCauseChain(ex.getCause)); keep the function name and behavior but replace the exact-literal contains check with the two-token check to make matching resilient to minor wording changes.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieSplitReaderFunction.java (1)
389-445: Drop or repurpose these duplicated constructor smoke tests.
testConstructorWithEmitDeleteon Line 394 still passesfalse, and the null-schema/default-constructor cases are already covered earlier in this file. This block adds maintenance noise without increasing coverage.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieSplitReaderFunction.java` around lines 389 - 445, These tests duplicate existing constructor coverage for HoodieSplitReaderFunction (testConstructorWithNullTableSchemaThrows, testConstructorWithNullRequiredSchemaThrows, testDefaultConstructor) and add noise; remove the redundant methods testConstructorWithEmitDelete, testConstructorWithNullTableSchemaThrows, testConstructorWithNullRequiredSchemaThrows, and testDefaultConstructor from this block OR repurpose testConstructorWithEmitDelete to actually exercise emit-delete behavior by passing true for the emitDelete parameter and asserting expected behavior/state for emit-delete so the test provides unique coverage.hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieBackedTableMetadataWriter.java (1)
130-185: Please add coverage for the delegatedcompactIfNecessary(...)branch too.These tests exercise delegation for pending services, but the production change also added a second branch inside
compactIfNecessary(...)where scheduling proceeds and inlinecompact()/logCompact()are skipped. A regression there would still pass this suite.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieBackedTableMetadataWriter.java` around lines 130 - 185, Add a unit test that targets the delegated branch inside HoodieBackedTableMetadataWriter.compactIfNecessary(...) by mocking BaseHoodieWriteClient and HoodieTableMetaClient so that shouldDelegateToTableServiceManager(...) returns true for compaction (and/or logcompaction) while compactIfNecessary is triggered (e.g., set pending instants via metaClient.getActiveTimeline().filterPendingCompactionTimeline()/filterPendingLogCompactionTimeline()). Verify that when delegation is enabled the inline methods writeClient.compact(...) and writeClient.logCompact(...) are NOT invoked and that metaClient.reloadActiveTimeline() is called (or not) as appropriate; use the same pattern as existing tests (mock timelines, set HoodieWriteConfig TABLE_SERVICE_MANAGER_* props, call compactIfNecessary via HoodieBackedTableMetadataWriter) to assert the skip behavior and timeline reload for the delegated branch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java`:
- Around line 635-637: Rework the clustering heartbeat lifecycle in
BaseHoodieTableServiceClient so the executor (the code path that actually starts
cluster execution) claims/starts or refreshes the heartbeat when execution
begins (e.g., at the start of cluster()) rather than when the plan is merely
scheduled; ensure heartbeatClient.stop(...) is executed in a failure-safe path
(finally/failure handling) so stopped on both success and error to avoid orphan
heartbeats blocking expiration; update the symmetric heartbeat handling location
in this class as well (the other clustering heartbeat block) to follow the same
claim/start-at-execution and guaranteed-stop pattern, and keep using
config.isExpirationOfClusteringEnabled() to gate the behavior.
- Around line 1210-1215: The age calculation in hasInstantExpired() uses the
table timeline timezone for currentTimeMs but parses instantTime with a
different/default timezone; update hasInstantExpired() to parse/convert
instantTime using the same ZoneId from
metaClient.getTableConfig().getTimelineTimezone() so currentTimeMs and
instantTimeMs are computed in the same timezone. Locate hasInstantExpired(),
replace the call to
HoodieInstantTimeGenerator.parseDateFromInstantTime(instantTime) with the
overload or conversion that accepts/uses the ZoneId
(metaClient.getTableConfig().getTimelineTimezone().getZoneId()), and ensure both
timestamps are derived using that zone before computing ageMs.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java`:
- Around line 278-317: getLockConfigForBuiltInLockProvider builds lockProps only
from lock-prefixed keys, causing inferred values like ZK_LOCK_KEY (which
defaults from HoodieWriteConfig.TBL_NAME) to be lost; fix by copying any needed
non-prefixed inferred entries into lockProps before building the
HoodieLockConfig. Specifically, in getLockConfigForBuiltInLockProvider (use
dataProps and lockProps), after the provider-specific copyPropsWithPrefix(...)
call but before newBuilder().fromProperties(...).build(), check for keys that
are inferred by providers (e.g., ZK_LOCK_KEY and any other provider-specific
inferred keys) and if absent in lockProps, populate them from
writeConfig.getProps()/HoodieWriteConfig.TBL_NAME or compute the same default
the provider expects so the rebuilt HoodieLockConfig preserves inferred inputs.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/util/DistributedRegistryUtil.java`:
- Around line 39-54: The helper createWrapperFileSystemRegistries only calls
HoodieWrapperFileSystem.setMetricsRegistry(...) when config.isMetricsOn() is
true, leaving process-wide registries from a previous table intact when metrics
are off; modify createWrapperFileSystemRegistries so that in the metrics-off
branch you explicitly clear or replace the registries (e.g., by setting
neutral/no-op or null registries) via
HoodieWrapperFileSystem.setMetricsRegistry(...) so the global state is reset for
the current context; update the branch that currently calls
Registry.getRegistryOfClass(...) (or the else path) to call
HoodieWrapperFileSystem.setMetricsRegistry(...) with the appropriate
cleared/empty Registry instances to ensure metrics isolation across
tables/executors.
In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java`:
- Around line 96-97: DISTRIBUTED_REGISTRY_MAP is currently keyed only by
tableName + "." + registryName which is JVM-global and causes
DistributedRegistry instances to be reused across different Spark applications;
change the cache key to include the Spark application identifier (e.g.,
SparkContext.applicationId or SparkEnv.get.conf / SparkContext.hashCode when
applicationId is unavailable) so each Spark application gets its own
DistributedRegistry; update code paths that construct keys (references:
DISTRIBUTED_REGISTRY_MAP, DistributedRegistry, and register(javaSparkContext))
to build and use the new composite key (applicationId + "." + tableName + "." +
registryName) and ensure entries are looked up/inserted with that composite key
in all places including the other usages around lines 282-289.
In `@hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java`:
- Around line 424-427: The current checks only inspect the immediate non-null
type and validateNoVectorInNestedRecord() only traverses RECORDs, so VECTOR can
still be hidden inside UNION/ARRAY/MAP (e.g., ARRAY<UNION<VECTOR,INT>>). Add a
recursive type-walking check (e.g., containsVectorInType or
containsVectorInNestedType) and call it from the places that currently only
check elementSchema.getNonNullType() (the ARRAY element check around
elementSchema, the MAP value/key checks, and the RECORD child validation
locations referenced) so any VECTOR nested under UNION, ARRAY, MAP, or RECORD
triggers a HoodieSchemaException with the same "VECTOR columns must be top-level
fields" message; update validateNoVectorInNestedRecord() to use this helper
instead of only walking child RECORDs and reuse the helper in the other two
spots (lines around 445-448 and 484-515).
In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceReader.java`:
- Around line 50-52: The current code creates a RecordLimiter per
HoodieSourceReader from scanContext.getLimit(), which applies the full limit per
subtask and overproduces when parallel; fix by computing a per-subtask quota and
passing that to HoodieSourceSplitReader instead of the full limit: if
scanContext.getLimit() == RecordLimiter.NO_LIMIT keep Option.empty(), otherwise
compute int parallelism = context.getNumberOfParallelSubtasks() (or equivalent
method on context), compute perSubtaskLimit = Math.max(1,
(scanContext.getLimit() + parallelism - 1) / parallelism) (or use floor/ceil
policy you prefer), and create Option.of(new RecordLimiter(perSubtaskLimit))
when constructing HoodieSourceSplitReader; update the constructor call in
HoodieSourceReader accordingly so split readers get the divided limit rather
than the global limit.
In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/RecordLimiter.java`:
- Around line 53-69: The current wrap() implementation in
RecordsWithSplitIds.nextRecordFromSplit uses separate totalReadCount.get() >=
limit and totalReadCount.incrementAndGet() calls which can race and overshoot
the limit and also treat NO_LIMIT (-1) as exhausted; change the logic to treat
negative limit as unlimited and use an atomic combine-and-check (e.g., call
totalReadCount.incrementAndGet() or getAndIncrement() and base the decision on
the returned value) so a thread only returns a record when the atomic result
stays within the limit, otherwise discard the record and return null; update the
wrapped method in wrap() (reference: RecordsWithSplitIds.nextRecordFromSplit,
totalReadCount, limit, wrap) accordingly.
In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/assign/HoodieSplitBucketAssigner.java`:
- Around line 44-50: Add a legacy overload public HoodieSplitBucketAssigner(int
parallelism) that preserves the old public API and delegates to the new
constructor; implement it to build a legacy-compatible Configuration (e.g., a
default Configuration populated with legacy/default
BUCKET_INDEX_PARTITION_EXPRESSIONS, BUCKET_INDEX_PARTITION_RULE, and
BUCKET_INDEX_NUM_BUCKETS) and call this(parallelism, legacyConf). Update
references to the new fields (numBucketsFunction and partitionIndexFunc) remain
unchanged; ensure the legacy ctor still validates parallelism the same way
before delegation.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.java`:
- Around line 168-219: The tests currently only call the same 6-arg constructor
and never exercise the limit logic; update the tests to invoke the 7-arg
constructor of HoodieCdcSplitReaderFunction with explicit limit values
(positive, zero, and -1 sentinel) and assert the limit behavior: for example,
construct HoodieCdcSplitReaderFunction(..., limit=5),
HoodieCdcSplitReaderFunction(..., limit=0) and HoodieCdcSplitReaderFunction(...,
limit=-1) and verify the class uses the provided limit by either asserting a
public/package-private getter (add getLimit() if missing) or, if no getter
exists, read the private field 'limit' via reflection or assert observable
behavior (e.g., that record iteration is wrapped/truncated when limit>0 and not
wrapped for limit<=0); update the test methods (testConstructorWithLimit,
testConstructorWithLimitAndEmptyFieldTypes,
testDefaultConstructorUsesNoLimitSentinel,
testConstructorWithLimitZeroIsAccepted) to perform these concrete assertions
instead of only assertNotNull.
In
`@hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetReaderIterator.java`:
- Around line 48-65: Ensure the iterator's closed flag is set on every teardown
path before attempting to close the underlying reader: in
ParquetReaderIterator.hasNext() and any exception handlers set closed = true
prior to calling parquetReader.close() (or
FileIOUtils.closeQuietly(parquetReader)), so that failures during read() or
close() cannot leave closed=false; update all places referencing
parquetReader.close(), FileIOUtils.closeQuietly(parquetReader), or throwing
HoodieException to flip the closed field first (use the existing closed field,
parquetReader, hasNext(), next and close() symbols) and then attempt to close
the reader.
In `@hudi-io/src/main/java/org/apache/hudi/common/metrics/Registry.java`:
- Around line 105-109: When an existing registry's implementation class doesn't
match the requested clazz, fail fast instead of logging and returning the wrong
instance: replace the LOG.error(...) branch in the Registry lookup (the code
that inspects registry, registryName and clazz) with a thrown exception (e.g.,
throw new IllegalStateException or IllegalArgumentException) that includes
registryName, registry.getClass().getName(), and the requested clazz so callers
cannot continue with an incompatible registry implementation.
- Around line 152-155: setRegistries is re-keying table-scoped registries
incorrectly: when iterating registries it uses makeKey("", registry.getName())
which produces "::"+registry.getName(), but getRegistryOfClass expects keys
created by makeKey(tableName, registryName) (tableName+"::"+registryName) and
registry.getName() already contains "tableName.registryName"; fix setRegistries
by parsing registry.getName() into its tableName and registryName parts (or by
calling the same key-construction logic used by getRegistryOfClass) and insert
into REGISTRY_MAP with makeKey(tableName, registryName) so lookups by
getRegistryOfClass find the propagated instance (update code in setRegistries to
derive tableName and registryName from registry.getName() or use registry
methods that return them).
In
`@hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java`:
- Around line 332-350: The current rollbackFailedClusteringForPartitions method
aborts on the first ineligible instant by throwing from the
BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback check,
causing partial rollback; change this to skip ineligible instants instead of
throwing: for each instant from getPendingClusteringInstantsForPartitions, call
isClusteringInstantEligibleForRollback and if it returns false, log a
warning/info referencing the instant.requestedTime() and continue to the next
instant (optionally collect skipped instants for metrics), otherwise proceed
with metaClient.reloadActiveTimeline(), re-check containsInstant and call
client.rollback(instant.requestedTime()) with LOG.info around successful
rollbacks; do not throw during per-instant processing so other eligible instants
can still be cleaned up.
---
Outside diff comments:
In
`@hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java`:
- Around line 1352-1359: build() lets property-based builders bypass validation
of table.service.manager.actions because only withTableServiceManagerActions()
invokes validateTableServiceManagerActions(...); fix by invoking
validateTableServiceManagerActions(...) in HoodieMetadataConfig.build() for any
actions coming from properties/file defaults as well—retrieve the configured
table.service.manager.actions value (the same key used by
withTableServiceManagerActions), call
validateTableServiceManagerActions(actions) before returning metadataConfig, and
ensure the validation path covers values supplied via fromFile(),
fromProperties(), and withProperties(), not just the fluent
withTableServiceManagerActions() setter.
---
Minor comments:
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java`:
- Around line 279-285: The EXPIRATION_THRESHOLD_MINS ConfigProperty currently
allows negative values; update HoodieClusteringConfig.Builder.validate() to
ensure the value retrieved from EXPIRATION_THRESHOLD_MINS is non-negative and
throw a clear IllegalArgumentException (or similar) if it is negative. Locate
the check in Builder.validate(), read the long via
EXPIRATION_THRESHOLD_MINS.get(config) (or equivalent accessor), and add a guard
like if (value < 0) throw new
IllegalArgumentException("hoodie.clustering.expiration.threshold.mins must be
non-negative: " + value) so invalid configs are rejected early.
In
`@hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java`:
- Around line 163-183: The test
testCreateMetadataWriteConfigForcesStreamingWritesOffWithMultiWriter is missing
an explicit assertion that streaming writes are disabled on the metadata config;
after obtaining metadataWriteConfig from
HoodieMetadataWriteUtils.createMetadataWriteConfig, add an assertion that the
metadataWriteConfig's metadata config streaming flag is false (e.g.,
assertFalse(metadataWriteConfig.getMetadataConfig().isStreamingWriteEnabled())
or equivalent) so the test fails if STREAMING_WRITE_ENABLED regresses to true.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestAbstractSplitReaderFunction.java`:
- Around line 213-245: The tests testTwoInstancesHaveStableIndependentHadoopConf
and testTwoInstancesHaveStableIndependentWriteConfig currently only assert
per-instance stability; update each to also assert the two instances do not
share the same object by adding an assertNotSame between fn1.hadoopConfForTest()
and fn2.hadoopConfForTest() in the Hadoop config test, and an assertNotSame
between fn1.writeConfigForTest() and fn2.writeConfigForTest() in the write
config test so that hadoopConfForTest and writeConfigForTest prove
cross-instance independence as well as stability.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/assign/TestHoodieSplitBucketAssigner.java`:
- Around line 177-194: The test
testDifferentNumBucketsProduceDifferentAssignments currently can pass even when
assignments are identical because "country=DE" may produce the same task id
after modulo; update the test so it actually proves different assignments by
choosing a partition value with a known hashCode behavior or by explicitly
asserting the two assignments differ. Concretely, keep the existing calls to new
HoodieSplitBucketAssigner(parallelism, conf4).assign(split(...)) and conf8, then
add an assertion like assertNotEquals(taskWith4, taskWith8) (or replace the
partition with a literal whose hashCode % parallelism yields different
expectedTaskId values) in addition to the existing expectedTaskId checks;
reference the test method testDifferentNumBucketsProduceDifferentAssignments,
the HoodieSplitBucketAssigner constructor, the assign(...) call, the split(...)
helper, and expectedTaskId(...) when making the change.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/TestDefaultHoodieSplitProvider.java`:
- Around line 567-568: The test relies on the default for
FlinkOptions.BUCKET_INDEX_NUM_BUCKETS which makes it brittle; instead create a
Configuration, set FlinkOptions.BUCKET_INDEX_NUM_BUCKETS explicitly (e.g. to 4)
and pass that Configuration into the HoodieSplitBucketAssigner constructor
(HoodieSplitBucketAssigner) so the bucket→task mapping is stable regardless of
global defaults.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java`:
- Around line 3244-3247: Update the assertion for the query result stored in
result1 to verify the filtered row's partition value instead of only its count:
after executing tableEnv.sqlQuery("select * from t1 where `partition` = 'par1'
limit 1").execute().collect() and converting to result1, assert that result1 has
size 1 and that the single Row in result1 has its `partition` field equal to
"par1" (or assert the expected id value for that partition), so the test
verifies filter+limit push-down rather than just row count.
In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV1.scala`:
- Around line 308-309: Remove the duplicate log.info call that prints
"Additional Filters to be applied..." after calling
IncrementalRelationUtil.filterRequiredColumnsFromDF; keep the earlier log
emitted at the original location (around the previous log at line ~244).
Specifically, delete the log.info(...) invocation that uses
filters.mkString("Array(", ", ", ")") in IncrementalRelationV1.scala (the one
immediately following the scanDf =
IncrementalRelationUtil.filterRequiredColumnsFromDF(...) line) so the message is
logged only once.
---
Nitpick comments:
In
`@hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieBackedTableMetadataWriter.java`:
- Around line 130-185: Add a unit test that targets the delegated branch inside
HoodieBackedTableMetadataWriter.compactIfNecessary(...) by mocking
BaseHoodieWriteClient and HoodieTableMetaClient so that
shouldDelegateToTableServiceManager(...) returns true for compaction (and/or
logcompaction) while compactIfNecessary is triggered (e.g., set pending instants
via
metaClient.getActiveTimeline().filterPendingCompactionTimeline()/filterPendingLogCompactionTimeline()).
Verify that when delegation is enabled the inline methods
writeClient.compact(...) and writeClient.logCompact(...) are NOT invoked and
that metaClient.reloadActiveTimeline() is called (or not) as appropriate; use
the same pattern as existing tests (mock timelines, set HoodieWriteConfig
TABLE_SERVICE_MANAGER_* props, call compactIfNecessary via
HoodieBackedTableMetadataWriter) to assert the skip behavior and timeline reload
for the delegated branch.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieSplitReaderFunction.java`:
- Around line 389-445: These tests duplicate existing constructor coverage for
HoodieSplitReaderFunction (testConstructorWithNullTableSchemaThrows,
testConstructorWithNullRequiredSchemaThrows, testDefaultConstructor) and add
noise; remove the redundant methods testConstructorWithEmitDelete,
testConstructorWithNullTableSchemaThrows,
testConstructorWithNullRequiredSchemaThrows, and testDefaultConstructor from
this block OR repurpose testConstructorWithEmitDelete to actually exercise
emit-delete behavior by passing true for the emitDelete parameter and asserting
expected behavior/state for emit-delete so the test provides unique coverage.
In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scala`:
- Around line 28-75: Add an explicit return type to the getPrunedSchema method
signature: declare it as returning org.apache.spark.sql.types.StructType (or
simply StructType if the import is present) so the signature becomes def
getPrunedSchema(...): StructType = { ... }; leave the implementation
unchanged—only update the method declaration to include the return type for
clarity.
- Around line 77-109: The method filterRequiredColumnsFromDF lacks an explicit
return type; update its signature to declare the return type (DataFrame) — e.g.,
change def filterRequiredColumnsFromDF(df: DataFrame, requiredColumns:
Array[String], metaClient: HoodieTableMetaClient): DataFrame = { ... } — so the
compiler and readers see the intended return type clearly while leaving the
method body unchanged.
- Around line 82-84: Remove the redundant toDF() calls before drop on updatedDF
in IncrementalRelationUtil.scala: instead of updatedDF =
updatedDF.toDF().drop(HoodieRecord.COMMIT_TIME_METADATA_FIELD) (and the similar
lines for HoodieRecord.COMMIT_SEQNO_METADATA_FIELD and
HoodieRecord.RECORD_KEY_METADATA_FIELD), call drop directly on updatedDF
(updatedDF = updatedDF.drop(...)) because drop already returns a DataFrame;
update the three occurrences referencing updatedDF and those HoodieRecord.*
metadata fields.
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala`:
- Around line 2605-2617: Add explicit assertions for the _hoodie_partition_path
column in the TestCOWDataSource regression check: after collecting incrementalDf
into results and before or alongside the existing asserts on "_row_key" and
"data", assert that results(0).getAs[String]("_hoodie_partition_path"),
results(1).getAs[String]("_hoodie_partition_path"), and
results(2).getAs[String]("_hoodie_partition_path") equal the expected partition
values for row1, row2, and row3 respectively; update the test near the
incrementalDf.select(...).collect() block (references: incrementalDf, results)
to include these checks so duplicate-meta-field behavior is validated
end-to-end.
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala`:
- Around line 1022-1025: The current nestedVectorMessageInCauseChain method
matches the full error string which is brittle; update it to look for stable
tokens instead (e.g., check that Option(ex.getMessage).exists(msg =>
msg.contains("Nested VECTOR columns") && msg.contains("top-level field")) )
while preserving the recursive cause traversal
(nestedVectorMessageInCauseChain(ex.getCause)); keep the function name and
behavior but replace the exact-literal contains check with the two-token check
to make matching resilient to minor wording changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 77107fc0-3ad3-4e3d-be84-bb7c0d616f5e
📒 Files selected for processing (68)
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriterTableVersionSix.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/ScheduleCompactionActionExecutor.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/util/DistributedRegistryUtil.javahudi-client/hudi-client-common/src/test/java/org/apache/hudi/client/TestBaseHoodieTableServiceClient.javahudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieLockConfig.javahudi-client/hudi-client-common/src/test/java/org/apache/hudi/config/TestHoodieWriteConfig.javahudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieBackedTableMetadataWriter.javahudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.javahudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/SparkRDDWriteClient.javahudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.javahudi-client/hudi-spark-client/src/main/java/org/apache/hudi/metrics/DistributedRegistry.javahudi-client/hudi-spark-client/src/main/scala/org/apache/spark/sql/avro/HoodieSparkSchemaConverters.scalahudi-client/hudi-spark-client/src/test/java/org/apache/hudi/metrics/TestDistributedRegistry.javahudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.javahudi-common/src/main/java/org/apache/hudi/common/engine/HoodieEngineContext.javahudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.javahudi-common/src/main/java/org/apache/hudi/common/util/CompactionUtils.javahudi-common/src/test/java/org/apache/hudi/common/config/TestHoodieMetadataConfig.javahudi-common/src/test/java/org/apache/hudi/common/schema/TestHoodieSchema.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/HoodieScanContext.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/HoodieSource.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceReader.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/HoodieSourceSplitReader.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/RecordLimiter.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/function/AbstractSplitReaderFunction.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/function/HoodieCdcSplitReaderFunction.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/reader/function/HoodieSplitReaderFunction.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/assign/HoodieSplitAssigners.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/source/split/assign/HoodieSplitBucketAssigner.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestWriteCopyOnWrite.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/TestHoodieSourceSplitReader.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/TestRecordLimiter.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestAbstractSplitReaderFunction.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieSplitReaderFunction.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/TestDefaultHoodieSplitProvider.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/split/assign/TestHoodieSplitBucketAssigner.javahudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.javahudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetReaderIterator.javahudi-hadoop-common/src/main/java/org/apache/hudi/hadoop/fs/HoodieWrapperFileSystem.javahudi-hadoop-common/src/test/java/org/apache/hudi/common/util/TestCompactionUtils.javahudi-hadoop-common/src/test/java/org/apache/hudi/common/util/TestParquetReaderIterator.javahudi-io/src/main/java/org/apache/hudi/common/metrics/LocalRegistry.javahudi-io/src/main/java/org/apache/hudi/common/metrics/Registry.javahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV1.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/IncrementalRelationV2.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/util/IncrementalRelationUtil.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/HoodieTableChanges.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/HoodieSqlCommonUtils.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSourceV1.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSourceV2.scalahudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scalahudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestIncrementalQueryColumnPruning.scalahudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scalahudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scalahudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/avro/TestSchemaConverters.scalahudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/common/TestInstantTimeValidation.scalahudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.javahudi-utilities/src/test/java/org/apache/hudi/utilities/offlinejob/TestHoodieClusteringJob.java
💤 Files with no reviewable changes (1)
- hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java
| if (config.isExpirationOfClusteringEnabled()) { | ||
| heartbeatClient.stop(clusteringCommitTime); | ||
| } |
There was a problem hiding this comment.
Rework the clustering heartbeat lifecycle so the executor owns it.
The heartbeat now starts when the plan is scheduled, but this class never reclaims/refreshes it when cluster() actually begins, and stop() only runs on the success path. That makes queued/delegated plans look abandoned before a healthy worker starts them, while failed executions can leave the old heartbeat alive indefinitely and block expiration-based rollback. Mirror the compaction flow: claim/start the heartbeat when execution begins and stop it in a failure-safe path as well.
Also applies to: 739-742
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java`
around lines 635 - 637, Rework the clustering heartbeat lifecycle in
BaseHoodieTableServiceClient so the executor (the code path that actually starts
cluster execution) claims/starts or refreshes the heartbeat when execution
begins (e.g., at the start of cluster()) rather than when the plan is merely
scheduled; ensure heartbeatClient.stop(...) is executed in a failure-safe path
(finally/failure handling) so stopped on both success and error to avoid orphan
heartbeats blocking expiration; update the symmetric heartbeat handling location
in this class as well (the other clustering heartbeat block) to follow the same
claim/start-at-execution and guaranteed-stop pattern, and keep using
config.isExpirationOfClusteringEnabled() to gate the behavior.
| private static boolean hasInstantExpired(HoodieTableMetaClient metaClient, String instantTime, long expirationMins) throws ParseException { | ||
| ZoneId zoneId = metaClient.getTableConfig().getTimelineTimezone().getZoneId(); | ||
| long currentTimeMs = ZonedDateTime.ofInstant(java.time.Instant.now(), zoneId).toInstant().toEpochMilli(); | ||
| long instantTimeMs = HoodieInstantTimeGenerator.parseDateFromInstantTime(instantTime).toInstant().toEpochMilli(); | ||
| long ageMs = currentTimeMs - instantTimeMs; | ||
| return ageMs >= TimeUnit.MINUTES.toMillis(expirationMins); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -e
echo "== HoodieInstantTimeGenerator definitions =="
fd -i 'HoodieInstantTimeGenerator.java' --exec sed -n '1,260p' {}
echo
echo "== parseDateFromInstantTime call sites =="
rg -n --type=java 'parseDateFromInstantTime\s*\('
echo
echo "== timeline timezone usage =="
rg -n --type=java 'getTimelineTimezone\s*\('Repository: yihua/hudi
Length of output: 21136
Fix timezone mismatch in hasInstantExpired() instant age calculation.
The current implementation uses the table's configured timeline timezone to compute currentTimeMs but hardcodes ZoneId.systemDefault() when parsing the instant time via parseDateFromInstantTime(). If these timezones differ, the age calculation produces incorrect results. Align the instant time parsing with the table's timeline timezone:
Original code snippet
private static boolean hasInstantExpired(HoodieTableMetaClient metaClient, String instantTime, long expirationMins) throws ParseException {
ZoneId zoneId = metaClient.getTableConfig().getTimelineTimezone().getZoneId();
long currentTimeMs = ZonedDateTime.ofInstant(java.time.Instant.now(), zoneId).toInstant().toEpochMilli();
long instantTimeMs = HoodieInstantTimeGenerator.parseDateFromInstantTime(instantTime).toInstant().toEpochMilli();
long ageMs = currentTimeMs - instantTimeMs;
return ageMs >= TimeUnit.MINUTES.toMillis(expirationMins);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java`
around lines 1210 - 1215, The age calculation in hasInstantExpired() uses the
table timeline timezone for currentTimeMs but parses instantTime with a
different/default timezone; update hasInstantExpired() to parse/convert
instantTime using the same ZoneId from
metaClient.getTableConfig().getTimelineTimezone() so currentTimeMs and
instantTimeMs are computed in the same timezone. Locate hasInstantExpired(),
replace the call to
HoodieInstantTimeGenerator.parseDateFromInstantTime(instantTime) with the
overload or conversion that accepts/uses the ZoneId
(metaClient.getTableConfig().getTimelineTimezone().getZoneId()), and ensure both
timestamps are derived using that zone before computing ageMs.
| public static void createWrapperFileSystemRegistries(HoodieEngineContext context, HoodieWriteConfig config) { | ||
| if (config.isMetricsOn()) { | ||
| Registry registry; | ||
| Registry registryMeta; | ||
| if (config.isExecutorMetricsEnabled()) { | ||
| // Create and set distributed registry for HoodieWrapperFileSystem | ||
| registry = context.getMetricRegistry(config.getTableName(), HoodieWrapperFileSystem.REGISTRY_NAME); | ||
| registryMeta = context.getMetricRegistry(config.getTableName(), HoodieWrapperFileSystem.REGISTRY_META_NAME); | ||
| } else { | ||
| registry = Registry.getRegistryOfClass(config.getTableName(), HoodieWrapperFileSystem.REGISTRY_NAME, | ||
| Registry.getRegistry(HoodieWrapperFileSystem.REGISTRY_NAME).getClass().getName()); | ||
| registryMeta = Registry.getRegistryOfClass(config.getTableName(), HoodieWrapperFileSystem.REGISTRY_META_NAME, | ||
| Registry.getRegistry(HoodieWrapperFileSystem.REGISTRY_META_NAME).getClass().getName()); | ||
| } | ||
| HoodieWrapperFileSystem.setMetricsRegistry(registry, registryMeta); | ||
| } |
There was a problem hiding this comment.
Reset wrapper-FS registries in the metrics-off path.
HoodieWrapperFileSystem.setMetricsRegistry(...) updates process-wide state, but this helper only touches it when metrics are enabled. A client initialized after a metrics-enabled table will otherwise keep using the previous table's registries, which breaks metrics isolation across tables/executors. Please clear or replace the registries explicitly when config.isMetricsOn() is false.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/util/DistributedRegistryUtil.java`
around lines 39 - 54, The helper createWrapperFileSystemRegistries only calls
HoodieWrapperFileSystem.setMetricsRegistry(...) when config.isMetricsOn() is
true, leaving process-wide registries from a previous table intact when metrics
are off; modify createWrapperFileSystemRegistries so that in the metrics-off
branch you explicitly clear or replace the registries (e.g., by setting
neutral/no-op or null registries) via
HoodieWrapperFileSystem.setMetricsRegistry(...) so the global state is reset for
the current context; update the branch that currently calls
Registry.getRegistryOfClass(...) (or the else path) to call
HoodieWrapperFileSystem.setMetricsRegistry(...) with the appropriate
cleared/empty Registry instances to ensure metrics isolation across
tables/executors.
| private static final Map<String, Registry> DISTRIBUTED_REGISTRY_MAP = new ConcurrentHashMap<>(); | ||
|
|
There was a problem hiding this comment.
Scope the distributed-registry cache to the Spark application, not just the table name.
DISTRIBUTED_REGISTRY_MAP is JVM-global, and the cache key is only tableName + "." + registryName. If two Spark contexts in the same JVM use the same logical table name, the second one reuses the first DistributedRegistry and never runs register(javaSparkContext) for its own context. That can leak metrics across tenants/apps and bind later jobs to a stale registry instance.
💡 Suggested direction
- private static final Map<String, Registry> DISTRIBUTED_REGISTRY_MAP = new ConcurrentHashMap<>();
+ private final Map<String, Registry> distributedRegistryMap = new ConcurrentHashMap<>();
...
`@Override`
public Registry getMetricRegistry(String tableName, String registryName) {
- final String prefixedName = tableName.isEmpty() ? registryName : tableName + "." + registryName;
- return DISTRIBUTED_REGISTRY_MAP.computeIfAbsent(prefixedName, key -> {
+ final String key = String.join("::",
+ javaSparkContext.sc().applicationId(),
+ tableName,
+ registryName);
+ return distributedRegistryMap.computeIfAbsent(key, unused -> {
Registry registry = Registry.getRegistryOfClass(tableName, registryName, DistributedRegistry.class.getName());
((DistributedRegistry) registry).register(javaSparkContext);
return registry;
});
}Also applies to: 282-289
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/common/HoodieSparkEngineContext.java`
around lines 96 - 97, DISTRIBUTED_REGISTRY_MAP is currently keyed only by
tableName + "." + registryName which is JVM-global and causes
DistributedRegistry instances to be reused across different Spark applications;
change the cache key to include the Spark application identifier (e.g.,
SparkContext.applicationId or SparkEnv.get.conf / SparkContext.hashCode when
applicationId is unavailable) so each Spark application gets its own
DistributedRegistry; update code paths that construct keys (references:
DISTRIBUTED_REGISTRY_MAP, DistributedRegistry, and register(javaSparkContext))
to build and use the new composite key (applicationId + "." + tableName + "." +
registryName) and ensure entries are looked up/inserted with that composite key
in all places including the other usages around lines 282-289.
| // ------------------------------------------------------------------------- | ||
| // Limit push-down constructor tests | ||
| // ------------------------------------------------------------------------- | ||
|
|
||
| @Test | ||
| public void testConstructorWithLimit() { | ||
| HoodieCdcSplitReaderFunction function = new HoodieCdcSplitReaderFunction( | ||
| conf, | ||
| tableState, | ||
| internalSchemaManager, | ||
| ROW_DATA_TYPE.getChildren(), | ||
| Collections.emptyList(), | ||
| false); | ||
|
|
||
| assertNotNull(function); | ||
| } | ||
|
|
||
| @Test | ||
| public void testConstructorWithLimitAndEmptyFieldTypes() { | ||
| HoodieCdcSplitReaderFunction function = new HoodieCdcSplitReaderFunction( | ||
| conf, | ||
| tableState, | ||
| internalSchemaManager, | ||
| Collections.emptyList(), | ||
| Collections.emptyList(), | ||
| false); | ||
|
|
||
| assertNotNull(function); | ||
| } | ||
|
|
||
| @Test | ||
| public void testDefaultConstructorUsesNoLimitSentinel() { | ||
| // 6-arg constructor must delegate to 7-arg with limit=-1, both should succeed. | ||
| HoodieCdcSplitReaderFunction defaultLimit = new HoodieCdcSplitReaderFunction( | ||
| conf, tableState, internalSchemaManager, | ||
| ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false); | ||
| HoodieCdcSplitReaderFunction explicitNoLimit = new HoodieCdcSplitReaderFunction( | ||
| conf, tableState, internalSchemaManager, | ||
| ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false); | ||
|
|
||
| assertNotNull(defaultLimit); | ||
| assertNotNull(explicitNoLimit); | ||
| } | ||
|
|
||
| @Test | ||
| public void testConstructorWithLimitZeroIsAccepted() { | ||
| // limit=0 is a valid constructor argument (limitIterator won't wrap since limit <= 0). | ||
| HoodieCdcSplitReaderFunction function = new HoodieCdcSplitReaderFunction( | ||
| conf, tableState, internalSchemaManager, | ||
| ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false); | ||
| assertNotNull(function); | ||
| } |
There was a problem hiding this comment.
These “limit” tests never exercise a limit.
Every new test in this block still calls the same constructor shape and only asserts non-null. That means Lines 173-218 do not verify the limit value, the zero-limit case, or the default-sentinel delegation described in the names/comments, so this suite will not catch regressions in the new limit-pushdown behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.java`
around lines 168 - 219, The tests currently only call the same 6-arg constructor
and never exercise the limit logic; update the tests to invoke the 7-arg
constructor of HoodieCdcSplitReaderFunction with explicit limit values
(positive, zero, and -1 sentinel) and assert the limit behavior: for example,
construct HoodieCdcSplitReaderFunction(..., limit=5),
HoodieCdcSplitReaderFunction(..., limit=0) and HoodieCdcSplitReaderFunction(...,
limit=-1) and verify the class uses the provided limit by either asserting a
public/package-private getter (add getLimit() if missing) or, if no getter
exists, read the private field 'limit' via reflection or assert observable
behavior (e.g., that record iteration is wrapped/truncated when limit>0 and not
wrapped for limit<=0); update the test methods (testConstructorWithLimit,
testConstructorWithLimitAndEmptyFieldTypes,
testDefaultConstructorUsesNoLimitSentinel,
testConstructorWithLimitZeroIsAccepted) to perform these concrete assertions
instead of only assertNotNull.
| public boolean hasNext() { | ||
| try { | ||
| // Short-circuit if already closed | ||
| if (closed) { | ||
| return false; | ||
| } | ||
| // To handle when hasNext() is called multiple times for idempotency and/or the first time | ||
| if (this.next == null) { | ||
| this.next = parquetReader.read(); | ||
| } | ||
| return this.next != null; | ||
| boolean hasNextRecord = this.next != null; | ||
| if (!hasNextRecord) { | ||
| close(); | ||
| } | ||
| return hasNextRecord; | ||
| } catch (Exception e) { | ||
| FileIOUtils.closeQuietly(parquetReader); | ||
| throw new HoodieException("unable to read next record from parquet file ", e); |
There was a problem hiding this comment.
Keep closed consistent on all teardown paths.
Lines 64-65 and 82-83 still close the raw ParquetReader directly, while Line 93 flips closed only after parquetReader.close() succeeds. If read() or close() throws, the iterator state still says “open”, so later hasNext() / next() calls can retry against a reader that is already being torn down.
♻️ Suggested fix
`@Override`
public boolean hasNext() {
+ if (closed) {
+ return false;
+ }
try {
- // Short-circuit if already closed
- if (closed) {
- return false;
- }
// To handle when hasNext() is called multiple times for idempotency and/or the first time
if (this.next == null) {
this.next = parquetReader.read();
@@
}
return hasNextRecord;
} catch (Exception e) {
+ closed = true;
FileIOUtils.closeQuietly(parquetReader);
throw new HoodieException("unable to read next record from parquet file ", e);
}
}
@@
} catch (Exception e) {
+ closed = true;
FileIOUtils.closeQuietly(parquetReader);
throw new HoodieException("unable to read next record from parquet file ", e);
}
}
public void close() {
if (closed) {
return;
}
+ closed = true;
try {
parquetReader.close();
- closed = true;
} catch (IOException e) {
throw new HoodieException("Exception while closing the parquet reader", e);
}
}Also applies to: 69-95
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-hadoop-common/src/main/java/org/apache/hudi/common/util/ParquetReaderIterator.java`
around lines 48 - 65, Ensure the iterator's closed flag is set on every teardown
path before attempting to close the underlying reader: in
ParquetReaderIterator.hasNext() and any exception handlers set closed = true
prior to calling parquetReader.close() (or
FileIOUtils.closeQuietly(parquetReader)), so that failures during read() or
close() cannot leave closed=false; update all places referencing
parquetReader.close(), FileIOUtils.closeQuietly(parquetReader), or throwing
HoodieException to flip the closed field first (use the existing closed field,
parquetReader, hasNext(), next and close() symbols) and then attempt to close
the reader.
| if (!registry.getClass().getName().equals(clazz)) { | ||
| LOG.error("Registry with name " + registryName + " already exists with a different class " + registry.getClass().getName() | ||
| + " than the requested class " + clazz); | ||
| } | ||
| return registry; |
There was a problem hiding this comment.
Don't silently return the wrong registry implementation.
If a name is already registered under one class, logging and then returning that instance for a different requested class means callers keep running with incompatible semantics and the first registrant wins forever. This should fail fast instead of degrading into a hidden runtime mismatch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hudi-io/src/main/java/org/apache/hudi/common/metrics/Registry.java` around
lines 105 - 109, When an existing registry's implementation class doesn't match
the requested clazz, fail fast instead of logging and returning the wrong
instance: replace the LOG.error(...) branch in the Registry lookup (the code
that inspects registry, registryName and clazz) with a thrown exception (e.g.,
throw new IllegalStateException or IllegalArgumentException) that includes
registryName, registry.getClass().getName(), and the requested clazz so callers
cannot continue with an incompatible registry implementation.
| static void setRegistries(Collection<Registry> registries) { | ||
| for (Registry registry : registries) { | ||
| REGISTRY_MAP.putIfAbsent(makeKey("", registry.getName()), registry); | ||
| } |
There was a problem hiding this comment.
setRegistries() re-keys propagated table-scoped registries under the wrong map key.
getRegistryOfClass(tableName, registryName, ...) stores entries under tableName + "::" + registryName, but setRegistries() reinserts them as "::" + registry.getName(), where getName() is already "tableName.registryName". A later lookup by (tableName, registryName) misses the propagated instance and creates a second registry.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hudi-io/src/main/java/org/apache/hudi/common/metrics/Registry.java` around
lines 152 - 155, setRegistries is re-keying table-scoped registries incorrectly:
when iterating registries it uses makeKey("", registry.getName()) which produces
"::"+registry.getName(), but getRegistryOfClass expects keys created by
makeKey(tableName, registryName) (tableName+"::"+registryName) and
registry.getName() already contains "tableName.registryName"; fix setRegistries
by parsing registry.getName() into its tableName and registryName parts (or by
calling the same key-construction logic used by getRegistryOfClass) and insert
into REGISTRY_MAP with makeKey(tableName, registryName) so lookups by
getRegistryOfClass find the propagated instance (update code in setRegistries to
derive tableName and registryName from registry.getName() or use registry
methods that return them).
| public static void rollbackFailedClusteringForPartitions( | ||
| SparkRDDWriteClient<?> client, | ||
| HoodieTableMetaClient metaClient, | ||
| List<String> partitions) { | ||
| for (HoodieInstant instant : getPendingClusteringInstantsForPartitions(metaClient, partitions)) { | ||
| if (!BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback( | ||
| metaClient, instant, client.getConfig(), client.getHeartbeatClient())) { | ||
| throw new HoodieException("Clustering instant " + instant.requestedTime() | ||
| + " targeting requested partitions is not eligible for rollback " | ||
| + "(heartbeat still active or instant too recent)"); | ||
| } | ||
| // Reload timeline to handle the case where the instant committed and cleaned up | ||
| // its heartbeat after the timeline was first loaded | ||
| metaClient.reloadActiveTimeline(); | ||
| if (metaClient.getActiveTimeline().filterInflightsAndRequested() | ||
| .containsInstant(instant.requestedTime())) { | ||
| LOG.info("Rolling back expired clustering instant {}", instant.requestedTime()); | ||
| client.rollback(instant.requestedTime()); | ||
| } |
There was a problem hiding this comment.
Avoid all-or-nothing rollback in this cleanup helper.
rollbackFailedClusteringForPartitions is documented as rolling back expired/eligible instants, but Lines 337-342 throw on the first live/recent instant. If the partition set contains both stale and active clustering plans, this can rollback some instants and then fail, leaving partial state and preventing cleanup of the rest.
Possible direction
for (HoodieInstant instant : getPendingClusteringInstantsForPartitions(metaClient, partitions)) {
if (!BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback(
metaClient, instant, client.getConfig(), client.getHeartbeatClient())) {
- throw new HoodieException("Clustering instant " + instant.requestedTime()
- + " targeting requested partitions is not eligible for rollback "
- + "(heartbeat still active or instant too recent)");
+ LOG.info("Skipping pending clustering instant {} because it is not rollback-eligible yet",
+ instant.requestedTime());
+ continue;
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java`
around lines 332 - 350, The current rollbackFailedClusteringForPartitions method
aborts on the first ineligible instant by throwing from the
BaseHoodieTableServiceClient.isClusteringInstantEligibleForRollback check,
causing partial rollback; change this to skip ineligible instants instead of
throwing: for each instant from getPendingClusteringInstantsForPartitions, call
isClusteringInstantEligibleForRollback and if it returns false, log a
warning/info referencing the instant.requestedTime() and continue to the next
instant (optionally collect skipped instants for metrics), otherwise proceed
with metaClient.reloadActiveTimeline(), re-check containsInstant and call
client.rollback(instant.requestedTime()) with LOG.info around successful
rollbacks; do not throw during per-instant processing so other eligible instants
can still be cleaned up.
Greptile SummaryThis PR adds support for safely executing metadata-table (MDT) compaction plans from an external table service platform (rather than inline during writes), along with clustering expiration/rollback via heartbeat-based detection, column pruning for incremental reads, distributed metrics support, log-compaction scheduling improvements, and VECTOR schema validation for nested types. Key changes:
Confidence Score: 3/5PR introduces several independently useful features but has multiple open structural issues that need resolution before merge Several previous-thread issues remain unresolved (Registry key mismatch, static DISTRIBUTED_REGISTRY_MAP, fail-fast in rollbackFailedClusteringForPartitions, inflightInstantsStream duplication). Two new issues are identified: the hasInstantExpired timezone dead-code affecting non-UTC tables, and a potential PUSH_DOWN_INCR_FILTERS regression from the TableScan→PrunedScan upgrade. The core MDT delegation and lock-config derivation logic are sound, and tests are substantial. BaseHoodieTableServiceClient.java (timezone bug), IncrementalRelationUtil.scala (PUSH_DOWN_INCR_FILTERS regression), Registry.java (key mismatch), HoodieSparkEngineContext.java (static registry map) Important Files Changed
Sequence DiagramsequenceDiagram
participant DT as Data Table Writer
participant MDT_W as HoodieBackedTableMetadataWriter
participant MDT_C as MDT Write Client
participant TSM as Table Service Manager (External)
DT->>MDT_W: syncDataTableCommit()
MDT_W->>MDT_W: Check pending compactions on MDT timeline
alt shouldDelegateToTableServiceManager(compaction)?
MDT_W->>MDT_C: scheduleCompactionAtInstant()
MDT_C-->>MDT_W: plan scheduled (REQUESTED state)
MDT_W-->>DT: skip inline execution
note over TSM: TSM independently picks up and executes MDT compaction plan
TSM->>MDT_C: compact(instantTime)
else inline execution
MDT_W->>MDT_C: scheduleCompactionAtInstant()
MDT_W->>MDT_C: compact(instantTime)
end
Note over DT,MDT_C: Clustering Expiration Flow
DT->>DT: rollFailedWrites (LAZY policy)
DT->>DT: filterPendingClusteringTimeline()
DT->>DT: isClusteringInstantEligibleForRollback()
alt eligible for rollback
DT->>DT: rollback(clusteringInstant)
end
Reviews (3): Last reviewed commit: "change name" | Re-trigger Greptile |
| /** | ||
| * Map of all distributed registries created via getMetricRegistry(). | ||
| * This map is passed to Spark executors to make the registries available there. | ||
| */ |
There was a problem hiding this comment.
Static
DISTRIBUTED_REGISTRY_MAP causes cross-table registry collisions
DISTRIBUTED_REGISTRY_MAP is static final, so it is shared across every HoodieSparkEngineContext (and therefore every SparkRDDWriteClient) in the same JVM. In a Spark application running both a data-table writer and a metadata-table writer concurrently, setRegistries pushes all registries into Registry.REGISTRY_MAP using Registry.makeKey("", registry.getName()), which can silently merge metrics from two different tables into one accumulator.
Consider making DISTRIBUTED_REGISTRY_MAP an instance field, or scoping the map key to include a unique client/session identifier.
| } | ||
| } |
There was a problem hiding this comment.
The "Additional Filters to be applied to incremental source are :" message is printed twice: once inside the else branch above and again here after filterRequiredColumnsFromDF. The second log call is a copy-paste artifact and should be removed.
| } | |
| } | |
| scanDf = IncrementalRelationUtil.filterRequiredColumnsFromDF(scanDf, requiredColumns, metaClient) |
| metaClient: HoodieTableMetaClient) = { | ||
| var prunedSchema = StructType(Seq()) |
There was a problem hiding this comment.
Public methods lack explicit return type annotations
Both getPrunedSchema and filterRequiredColumnsFromDF are public and infer their return types (StructType and DataFrame respectively). In Scala, public API methods should declare explicit return types for clarity and binary-compatibility guarantees:
def getPrunedSchema(requiredColumns: Array[String],
usedSchema: StructType,
metaClient: HoodieTableMetaClient): StructType = {
def filterRequiredColumnsFromDF(df: DataFrame,
requiredColumns: Array[String],
metaClient: HoodieTableMetaClient): DataFrame = {| /** | ||
| * Set all registries if they are not already registered. | ||
| */ | ||
| static void setRegistries(Collection<Registry> registries) { | ||
| for (Registry registry : registries) { | ||
| REGISTRY_MAP.putIfAbsent(makeKey("", registry.getName()), registry); | ||
| } |
There was a problem hiding this comment.
setRegistries stores registries under a mismatched key
registry.getName() returns the full name as set in getRegistryOfClass — e.g. "myTable.HoodieWrapperFileSystem". So makeKey("", registry.getName()) produces "::myTable.HoodieWrapperFileSystem".
But the driver stored the same registry under makeKey("myTable", "HoodieWrapperFileSystem") = "myTable::HoodieWrapperFileSystem". Any executor-side code that calls Registry.getRegistry("HoodieWrapperFileSystem") (key "::HoodieWrapperFileSystem") will find neither and silently create a new LocalRegistry, losing those metrics.
The intent appears to be forwarding the distributed accumulators so that executors accumulate into the same object. The key used here should be consistent with how REGISTRY_MAP is keyed elsewhere, for example:
static void setRegistries(Collection<Registry> registries) {
for (Registry registry : registries) {
// Re-derive the compound key the same way getRegistryOfClass() stored it.
// getName() returns "tableName.registryName", so we need to split on the first ".".
String fullName = registry.getName();
int dotIdx = fullName.indexOf('.');
String tableName = dotIdx >= 0 ? fullName.substring(0, dotIdx) : "";
String registryName = dotIdx >= 0 ? fullName.substring(dotIdx + 1) : fullName;
REGISTRY_MAP.putIfAbsent(makeKey(tableName, registryName), registry);
}
}|
|
||
| static final String SPARK_REJECT_UPDATE_STRATEGY_CLASS_NAME = | ||
| "org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy"; | ||
|
|
||
| public static final ConfigProperty<String> UPDATES_STRATEGY = ConfigProperty | ||
| .key("hoodie.clustering.updates.strategy") | ||
| .defaultValue("org.apache.hudi.client.clustering.update.strategy.SparkRejectUpdateStrategy") | ||
| .noDefaultValue() | ||
| .withInferFunction(cfg -> { | ||
| String strategy = cfg.getStringOrDefault(HoodieLockConfig.WRITE_CONFLICT_RESOLUTION_STRATEGY_CLASS_NAME, ""); | ||
| if (PreferWriterConflictResolutionStrategy.class.getName().equals(strategy)) { | ||
| return Option.of(SPARK_ALLOW_UPDATE_STRATEGY_CLASS_NAME); | ||
| } | ||
| return Option.of(SPARK_REJECT_UPDATE_STRATEGY_CLASS_NAME); |
There was a problem hiding this comment.
noDefaultValue() makes UPDATES_STRATEGY.defaultValue() return null
Replacing the hard-coded .defaultValue("org.apache.hudi...SparkRejectUpdateStrategy") with .noDefaultValue() means UPDATES_STRATEGY.defaultValue() now returns null. The infer function computes the right value at config-resolution time, but any call-site that uses the raw ConfigProperty.defaultValue() accessor (e.g. for documentation generation, validation, or legacy code paths that haven't been updated) will now receive null and may NPE.
The deprecated constant DEFAULT_CLUSTERING_UPDATES_STRATEGY was correctly updated to point at SPARK_REJECT_UPDATE_STRATEGY_CLASS_NAME directly, so that one is safe. But review all direct usages of UPDATES_STRATEGY.defaultValue() across the codebase to confirm none depend on a non-null return.
Instead of throwing an error when streaming writes and multi-writer are both configured on the metadata table, silently force streaming writes off since multi-writer mode is for separate table service execution and is not compatible with streaming writes.
There was a problem hiding this comment.
Actionable comments posted: 15
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
hudi-common/src/main/java/org/apache/hudi/common/util/collection/RocksDBDAO.java (1)
509-528:⚠️ Potential issue | 🟡 MinorReorder
defaultWriteOptionsclose afterrocksDBclose.Per RocksDB Java best practices, resources should be closed in reverse order of construction: RocksDB instance first, then Options/WriteOptions, then Statistics. Currently,
defaultWriteOptionsis closed beforerocksDB.close()at line 516. MovedefaultWriteOptions.close()to aftergetRocksDB().close()(line 518) to follow the recommended resource cleanup order.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/util/collection/RocksDBDAO.java` around lines 509 - 528, The close() method currently closes defaultWriteOptions before closing the RocksDB instance; change the teardown order to follow reverse-construction best practices by closing the RocksDB instance first (call getRocksDB().close()), then close defaultWriteOptions (defaultWriteOptions.close()), and finally close statistics (statistics.close()). Update the sequence in the synchronized close() method so managedHandlesMap/managedDescriptorMap are cleared, then getRocksDB().close(), then defaultWriteOptions.close() if non-null, and then statistics cleanup and filesystem deletion.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java (1)
101-109:⚠️ Potential issue | 🟠 MajorFix validation order: move
sanityCheck()aftersetupConfOptions().The
sanityCheck()method (line 107) callscheckIndexType(), which asserts thatINDEX_GLOBAL_ENABLEDmust be true whenGLOBAL_RECORD_LEVEL_INDEXis configured. However, this assertion runs beforesetupConfOptions()(line 108), which internally callssetupWriteOptions()(lines 428-442) to auto-enableINDEX_GLOBAL_ENABLED=trueforGLOBAL_RECORD_LEVEL_INDEXconfigurations. This ordering prevents users from setting onlyindex.type=GLOBAL_RECORD_LEVEL_INDEXwithout also explicitly settingindex.global.enabled=true, even though the factory would normalize the config successfully if allowed to proceed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java` around lines 101 - 109, In createDynamicTableSink, the call order causes sanityCheck() (which invokes checkIndexType() and asserts INDEX_GLOBAL_ENABLED when index.type == GLOBAL_RECORD_LEVEL_INDEX) to run before setupConfOptions() (which calls setupWriteOptions() and may auto-enable INDEX_GLOBAL_ENABLED); move the sanityCheck(conf, schema) call to after setupConfOptions(conf, context.getObjectIdentifier(), context.getCatalogTable(), schema) so configuration normalization in setupWriteOptions() occurs before validation, ensuring GLOBAL_RECORD_LEVEL_INDEX is accepted when setupConfOptions enables INDEX_GLOBAL_ENABLED.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/RLIBootstrapOperator.java (1)
63-72:⚠️ Potential issue | 🟠 MajorGuard the bootstrap preload on restore.
initializeState()is called during both startup and recovery. The unconditionalpreLoadRLIRecords()will re-scan the metadata table and re-emit the full bootstrap payload after failover. The siblingBootstrapOperatorguards this withcontext.isRestored()and operator state persistence, butRLIBootstrapOperatorlacks both mechanisms, making recovery O(size of RLI) and risking duplicate index initialization downstream.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/RLIBootstrapOperator.java` around lines 63 - 72, initializeState currently always calls preLoadRLIRecords which causes a full re-scan on recovery; wrap the preload so it only runs on fresh start by checking the StateInitializationContext restore flag (use context.isRestored()) and skip preLoadRLIRecords when restored, and persist/restore a small operator state (e.g., loadedCnt) so the operator knows whether RLI was already loaded across restarts; update initializeState, preLoadRLIRecords usage, and add operator state handling for loadedCnt to match BootstrapOperator's approach.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactHandler.java (1)
96-116:⚠️ Potential issue | 🟡 MinorStop metadata compaction metrics in a
finallyblock.If the log-compaction path throws,
endCompaction()is never reached and the metric stays stuck in the running state.Proposed fix
} else { compactionMetrics.startCompaction(); - // Create a write client specifically for the metadata table - HoodieFlinkMergeOnReadTableCompactor<?> compactor = new HoodieFlinkMergeOnReadTableCompactor<>(); - HoodieTableMetaClient metaClient = table.getMetaClient(); - if (needReloadMetaClient) { - // reload the timeline - metaClient.reload(); - } - Option<InstantRange> instantRange = CompactHelpers.getInstance().getInstantRange(metaClient); - List<WriteStatus> writeStatuses = compactor.logCompact( - writeClient.getConfig(), - event.getOperation(), - event.getCompactionInstantTime(), - instantRange, - table, - table.getTaskContextSupplier()); - compactionMetrics.endCompaction(); - collector.collect(createCommitEvent(event, writeStatuses)); + try { + HoodieFlinkMergeOnReadTableCompactor<?> compactor = new HoodieFlinkMergeOnReadTableCompactor<>(); + HoodieTableMetaClient metaClient = table.getMetaClient(); + if (needReloadMetaClient) { + metaClient.reload(); + } + Option<InstantRange> instantRange = CompactHelpers.getInstance().getInstantRange(metaClient); + List<WriteStatus> writeStatuses = compactor.logCompact( + writeClient.getConfig(), + event.getOperation(), + event.getCompactionInstantTime(), + instantRange, + table, + table.getTaskContextSupplier()); + collector.collect(createCommitEvent(event, writeStatuses)); + } finally { + compactionMetrics.endCompaction(); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactHandler.java` around lines 96 - 116, The log-compaction path calls compactionMetrics.startCompaction() but currently only calls compactionMetrics.endCompaction() on the success path after HoodieFlinkMergeOnReadTableCompactor.logCompact(...), so if an exception is thrown the metric remains running; wrap the compaction work (the call to compactor.logCompact(...) and subsequent collector.collect(createCommitEvent(...))) in a try block and move compactionMetrics.endCompaction() into a finally block so it always runs, rethrowing the exception after the finally to preserve error propagation; ensure you keep use of HoodieFlinkMergeOnReadTableCompactor, writeClient.getConfig(), event.getCompactionInstantTime(), instantRange, table, table.getTaskContextSupplier(), createCommitEvent(...) and collector.collect(...) in the try so successful results are still collected.
♻️ Duplicate comments (2)
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.java (1)
172-219:⚠️ Potential issue | 🟠 MajorLimit-pushdown tests still don’t validate any limit behavior.
On Lines 174-180, 187-193, 201-207, and 215-217, all tests still use the 6-arg constructor and only assert non-null, so none of the named limit cases (
limit>0,limit=0, default sentinel) are actually exercised.Proposed test fix
@@ `@Test` public void testConstructorWithLimit() { HoodieCdcSplitReaderFunction function = new HoodieCdcSplitReaderFunction( conf, tableState, internalSchemaManager, ROW_DATA_TYPE.getChildren(), Collections.emptyList(), - false); + false, + 5L); assertNotNull(function); + assertEquals(5L, extractLimit(function)); } @@ public void testConstructorWithLimitAndEmptyFieldTypes() { HoodieCdcSplitReaderFunction function = new HoodieCdcSplitReaderFunction( conf, tableState, internalSchemaManager, Collections.emptyList(), Collections.emptyList(), - false); + false, + 5L); assertNotNull(function); + assertEquals(5L, extractLimit(function)); } @@ public void testDefaultConstructorUsesNoLimitSentinel() { @@ HoodieCdcSplitReaderFunction explicitNoLimit = new HoodieCdcSplitReaderFunction( conf, tableState, internalSchemaManager, - ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false); + ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false, -1L); @@ assertNotNull(defaultLimit); assertNotNull(explicitNoLimit); + assertEquals(-1L, extractLimit(defaultLimit)); + assertEquals(-1L, extractLimit(explicitNoLimit)); } @@ public void testConstructorWithLimitZeroIsAccepted() { @@ HoodieCdcSplitReaderFunction function = new HoodieCdcSplitReaderFunction( conf, tableState, internalSchemaManager, - ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false); + ROW_DATA_TYPE.getChildren(), Collections.emptyList(), false, 0L); assertNotNull(function); + assertEquals(0L, extractLimit(function)); } + + private long extractLimit(HoodieCdcSplitReaderFunction function) { + try { + java.lang.reflect.Field f = HoodieCdcSplitReaderFunction.class.getDeclaredField("limit"); + f.setAccessible(true); + return (long) f.get(function); + } catch (ReflectiveOperationException e) { + throw new AssertionError("Unable to inspect limit field", e); + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.java` around lines 172 - 219, Tests for HoodieCdcSplitReaderFunction are currently only calling the 6-arg constructor and asserting non-null, so the limit cases are not exercised; update each test (testConstructorWithLimit, testConstructorWithLimitAndEmptyFieldTypes, testDefaultConstructorUsesNoLimitSentinel, testConstructorWithLimitZeroIsAccepted) to call the 7-argument HoodieCdcSplitReaderFunction constructor that includes the limit parameter with explicit values (e.g., limit>0 for positive case, 0 for zero case, -1 for sentinel/default) and replace the bare assertNotNull with assertions that verify the limit was applied (either by calling a public accessor if available, or by using reflection to read the private limit-related field, or by asserting observable behavior that depends on the limit such as wrapping of iterators); ensure you reference the constructor signature HoodieCdcSplitReaderFunction(..., List<RowType> fieldTypes, List<String> projectedFields, boolean someFlag, long limit) when making the changes.hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java (1)
424-427:⚠️ Potential issue | 🟠 MajorThe new VECTOR guard still misses nested UNION/ARRAY/MAP shapes.
This still only rejects the immediate non-null child in
createArray()/createMap(), andvalidateNoVectorInNestedRecord()only walks into childRECORDs. Shapes likeARRAY<UNION<VECTOR,INT>>orMAP<STRING,UNION<RECORD<...VECTOR...>>>will still pass even though the error text says nested VECTORs are unsupported.Also applies to: 445-448, 484-515
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java` around lines 424 - 427, The current VECTOR guard only checks the immediate non-null child and RECORDs, so nested shapes like ARRAY<UNION<VECTOR,INT>> or MAP<STRING,UNION<RECORD<...VECTOR...>>> slip through; update createArray() and createMap() and the validateNoVectorInNestedRecord() flow to recursively walk all possible nested type constructors (UNION, ARRAY, MAP, RECORD) by following getNonNullType() and inspecting getType() (HoodieSchemaType) until you either hit a non-VECTOR leaf or find a VECTOR; extract this into a helper (e.g., containsVectorRecursively(Type t)) that returns true if any nested/non-null child is VECTOR and call it from the places that currently only check the immediate child, then throw the existing HoodieSchemaException with the same message when the helper returns true. Ensure the helper traverses UNION members, ARRAY element types, MAP value/key types as applicable and reuses getNonNullType() on each visited node.
🟡 Minor comments (6)
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcInputFormat.java-71-77 (1)
71-77:⚠️ Potential issue | 🟡 MinorPotential resource leak if exception occurs before iterator takes ownership.
If an exception is thrown between creating
CdcImageManager(line 71-74) and constructingCdcFileSplitsIterator(line 77), the manager won't be closed.🛡️ Proposed fix to ensure cleanup on exception
HoodieCDCSupplementalLoggingMode mode = OptionsResolver.getCDCSupplementalLoggingMode(conf); CdcImageManager manager = new CdcImageManager( tableState.getRowType(), FlinkWriteClients.getHoodieClientConfig(conf), this::getFileSliceIterator); - Function<HoodieCDCFileSplit, ClosableIterator<RowData>> recordIteratorFunc = - cdcFileSplit -> getRecordIteratorSafe(split.getTablePath(), split.getMaxCompactionMemoryInBytes(), cdcFileSplit, mode, manager); - return new CdcIterators.CdcFileSplitsIterator(((CdcInputSplit) split).getChanges(), manager, recordIteratorFunc); + try { + Function<HoodieCDCFileSplit, ClosableIterator<RowData>> recordIteratorFunc = + cdcFileSplit -> getRecordIteratorSafe(split.getTablePath(), split.getMaxCompactionMemoryInBytes(), cdcFileSplit, mode, manager); + return new CdcIterators.CdcFileSplitsIterator(((CdcInputSplit) split).getChanges(), manager, recordIteratorFunc); + } catch (Exception e) { + manager.close(); + throw e; + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcInputFormat.java` around lines 71 - 77, Creating CdcImageManager and then calling getRecordIteratorSafe may throw before the CdcIterators.CdcFileSplitsIterator takes ownership, leaking resources; wrap the creation and subsequent setup in a try/catch (or try/finally) so that if any exception occurs before you return the new CdcFileSplitsIterator you call manager.close() (or manager.closeQuietly()) and rethrow the exception; ensure the returned CdcFileSplitsIterator still assumes ownership of the manager when successfully constructed (i.e., only close on the error path), referencing CdcImageManager, getRecordIteratorSafe, and CdcIterators.CdcFileSplitsIterator to locate the code.docker/build_docker_images.sh-108-112 (1)
108-112:⚠️ Potential issue | 🟡 MinorQuote
--build-argassignments to prevent shell splitting if version strings contain spaces.At lines 109–111, unquoted variable expansions can be split by the shell if the values passed via command-line arguments contain spaces or special characters (e.g.,
./build_docker_images.sh --hadoop-version "2.8.4 test"). While default version strings are safe, the script accepts arbitrary values from users without validation, making this a defensive improvement.Suggested fix
if ! docker build \ - --build-arg HADOOP_VERSION=${HADOOP_VERSION} \ - --build-arg SPARK_VERSION=${SPARK_VERSION} \ - --build-arg HIVE_VERSION=${HIVE_VERSION} \ + --build-arg "HADOOP_VERSION=${HADOOP_VERSION}" \ + --build-arg "SPARK_VERSION=${SPARK_VERSION}" \ + --build-arg "HIVE_VERSION=${HIVE_VERSION}" \ "$IMAGE_CONTEXT" -t "$TAG_LATEST" -t "$TAG_VERSIONED"; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docker/build_docker_images.sh` around lines 108 - 112, The docker build command is using unquoted variable expansions for the --build-arg assignments (HADOOP_VERSION, SPARK_VERSION, HIVE_VERSION) which can be split if a version string contains spaces; update the invocation so each build-arg assignment is quoted (e.g. --build-arg "HADOOP_VERSION=${HADOOP_VERSION}") to prevent shell word-splitting and preserve special characters when calling docker build in this block.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeCleanHandler.java-24-48 (1)
24-48:⚠️ Potential issue | 🟡 MinorAdd exception handling to ensure both handlers are always invoked.
The
forEachHandlermethod inCompositeTableServiceHandlerlacks exception handling. If the first handler throws an exception duringclean(),waitForCleaningFinish(),startAsyncCleaning(), orclose(), the second handler's method is never invoked. This could leave the metadata table in an inconsistent state relative to the data table.For
close()specifically, if the data table handler fails to close, the metadata table handler's resources may leak.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/CompositeCleanHandler.java` around lines 24 - 48, The composite currently delegates via forEachHandler which stops on the first exception; change each method in CompositeCleanHandler (clean, waitForCleaningFinish, startAsyncCleaning, close) to invoke the two underlying handlers individually (use the same handler references used by CompositeTableServiceHandler or call the existing accessors), wrap each handler call in its own try/catch, and ensure the second handler is always invoked even if the first throws; if one or both throw, capture the first Throwable, add any subsequent Throwable as suppressed to the first, and after both calls rethrow the captured exception so callers still see failures while preventing resource leaks (apply this pattern for close as well).hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactionCommitHandler.java-97-101 (1)
97-101:⚠️ Potential issue | 🟡 MinorThe NPE concern is mitigated by Flink's lifecycle contract, but defensive null checks would improve robustness.
While
compactionMetricscould theoretically be null ifregisterMetrics()is not called beforecommitIfNecessary()(lines 133, 143, 180, 221), this is prevented in practice by Flink's standard SinkFunction lifecycle:open()(which callsregisterMetrics()at CompactionCommitSink line 60) is always invoked beforeinvoke()(which callscommitIfNecessary()).However, the interface lacks explicit documentation of this required call order, and the implementation has no defensive null checks. Adding either a null check or a Javadoc contract to
CompactionCommitHandlerwould make the dependency explicit and improve defensiveness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactionCommitHandler.java` around lines 97 - 101, The implementation assumes registerMetrics() is always called before commitIfNecessary(), but lacks defensive checks; update CompactionCommitHandler (or DataTableCompactionCommitHandler) so the contract is explicit and robust by either adding a short Javadoc on CompactionCommitHandler.registerMetrics() stating it must be called prior to commitIfNecessary(), and/or adding null checks in DataTableCompactionCommitHandler.commitIfNecessary() (and any other methods that reference compactionMetrics) to handle a missing compactionMetrics gracefully (log an error/warning and skip metrics operations) so NPEs cannot occur if registerMetrics() was not invoked.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java-105-106 (1)
105-106:⚠️ Potential issue | 🟡 MinorGuard
close()against partial initialization.If
createCompactionPlanHandler(...)fails inopen(),close()now dereferences a null handler and can mask the real startup error.💡 Suggested fix
`@Override` public void close() throws Exception { - this.compactionPlanHandler.close(); + if (this.compactionPlanHandler != null) { + this.compactionPlanHandler.close(); + } super.close(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionPlanOperator.java` around lines 105 - 106, The close() method currently calls this.compactionPlanHandler.close() without null-checking, which can NPE if createCompactionPlanHandler(...) failed during open(); update close() in CompactionPlanOperator to guard against partial initialization by checking compactionPlanHandler != null before invoking its close (or wrap the call in a null-safe try/catch), ensuring any exception from open() is not masked by a secondary NPE from close().hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactHandler.java-127-147 (1)
127-147:⚠️ Potential issue | 🟡 MinorAlways end compaction metrics on failure.
endCompaction()only runs on the success path here. If the compactor orcollector.collect(...)throws, the metric remains open and subsequent measurements are skewed.Proposed fix
protected void doCompaction(CompactionPlanEvent event, Collector<CompactionCommitEvent> collector, boolean needReloadMetaClient) throws Exception { compactionMetrics.startCompaction(); - HoodieFlinkMergeOnReadTableCompactor<?> compactor = new HoodieFlinkMergeOnReadTableCompactor<>(); - HoodieTableMetaClient metaClient = table.getMetaClient(); - if (needReloadMetaClient) { - // reload the timeline - metaClient.reload(); - } - // schema evolution - CompactionUtil.setAvroSchema(writeClient.getConfig(), metaClient); - List<WriteStatus> writeStatuses = compactor.compact( - writeClient.getConfig(), - event.getOperation(), - event.getCompactionInstantTime(), - table.getTaskContextSupplier(), - createReaderContext(needReloadMetaClient), - table); - compactionMetrics.endCompaction(); - collector.collect(createCommitEvent(event, writeStatuses)); + try { + HoodieFlinkMergeOnReadTableCompactor<?> compactor = new HoodieFlinkMergeOnReadTableCompactor<>(); + HoodieTableMetaClient metaClient = table.getMetaClient(); + if (needReloadMetaClient) { + metaClient.reload(); + } + CompactionUtil.setAvroSchema(writeClient.getConfig(), metaClient); + List<WriteStatus> writeStatuses = compactor.compact( + writeClient.getConfig(), + event.getOperation(), + event.getCompactionInstantTime(), + table.getTaskContextSupplier(), + createReaderContext(needReloadMetaClient), + table); + collector.collect(createCommitEvent(event, writeStatuses)); + } finally { + compactionMetrics.endCompaction(); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/DataTableCompactHandler.java` around lines 127 - 147, The compaction metrics are only ended on the success path in doCompaction; wrap the compaction work (the call to HoodieFlinkMergeOnReadTableCompactor.compact(...) and collector.collect(...)) in a try/finally so compactionMetrics.endCompaction() is always invoked (and rethrow the exception after the finally block if one occurs). Specifically, in DataTableCompactHandler.doCompaction(), call compactionMetrics.startCompaction() as now, then perform compactor.compact(...) and collector.collect(...) inside a try block and move compactionMetrics.endCompaction() into the finally to guarantee it runs on both success and failure.
🧹 Nitpick comments (20)
hudi-io/src/test/java/org/apache/hudi/io/util/TestIOUtils.java (1)
95-127: Add negative-value encode coverage forwriteVarInt.Current new tests are good for positive/boundary values, but adding signed cases would better lock down compatibility for the full API contract.
✅ Suggested test extension
`@Test` public void testWriteVarIntRoundTrip() { @@ - int[] testValues = {0, 1, 127, 128, 146, 200, 255, 256, 300, 1000, 32080, 65535, 100000, - 2034958, 632492350, Integer.MAX_VALUE}; + int[] testValues = { + Integer.MIN_VALUE, -203, -128, -113, -112, -1, + 0, 1, 127, 128, 146, 200, 255, 256, 300, 1000, 32080, 65535, 100000, + 2034958, 632492350, Integer.MAX_VALUE + };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-io/src/test/java/org/apache/hudi/io/util/TestIOUtils.java` around lines 95 - 127, Add negative-value coverage to the writeVarInt tests by feeding a few negative ints (e.g., -1, -128, Integer.MIN_VALUE and a couple mid-range negatives) into IOUtils.writeVarInt and asserting round-trip correctness: use IOUtils.decodeVarLongSizeOnDisk(encoded, 0) to verify size equals encoded.length and IOUtils.readVarLong(encoded, 0, size) (or readVarLong(encoded, 0) where applicable) to assert the decoded value equals the original negative int; place these checks into testWriteVarIntRoundTrip (or a new test) alongside the existing positive cases to ensure writeVarInt/readVarLong handle signed inputs correctly.hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java (3)
279-280: PreferifPresent()overmap()for side-effect-only operations.Using
Optional.map()when you're only interested in the side effect (and discarding the return value) is non-idiomatic. TheifPresent()method better expresses intent.♻️ Suggested fix
- logBlock.getBlockContentLocation() - .map(contentLocation -> totalLogBlocksSize.addAndGet(contentLocation.getBlockSize())); + logBlock.getBlockContentLocation() + .ifPresent(contentLocation -> totalLogBlocksSize.addAndGet(contentLocation.getBlockSize()));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java` around lines 279 - 280, In BaseHoodieLogRecordReader replace the use of Optional.map when invoking logBlock.getBlockContentLocation() for a side-effect with Optional.ifPresent to express intent: call logBlock.getBlockContentLocation().ifPresent(contentLocation -> totalLogBlocksSize.addAndGet(contentLocation.getBlockSize())) so the Optional return value is not misused and the side-effect (totalLogBlocksSize.addAndGet(...)) is clearly represented.
546-548: Consider returning an unmodifiable view to prevent external mutation.
getBlocksStats()returns a direct reference to the internal mutable list. While current callers only read from it, returning an unmodifiable view would provide better encapsulation.♻️ Suggested fix
public List<Map<String, Object>> getBlocksStats() { - return blocksStats; + return Collections.unmodifiableList(blocksStats); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java` around lines 546 - 548, getBlocksStats() currently returns the internal mutable list blocksStats allowing external mutation; change its implementation to return an unmodifiable view by returning Collections.unmodifiableList(blocksStats) from getBlocksStats(), adding an import for java.util.Collections; optionally, if you want to also protect contained maps, wrap each map with Collections.unmodifiableMap before returning or return a defensive copy of the list of maps.
495-511: Inconsistent null-check forrecordBuffer; preferifPresent()for side-effect-onlyOptionalusage.
Line 500 calls
recordBuffer.processDataBlock()without a null check, while lines 498 and 501 defensively checkrecordBuffer != null. IfrecordBuffercan be null when this method is called, line 500 will NPE.Line 506-507 uses
Optional.map()for a side effect (putting into the map), which is non-idiomatic.♻️ Suggested fix
private void processDataBlock(HoodieDataBlock dataBlock, Option<KeySpec> keySpecOpt) throws IOException { String blockInstantTime = dataBlock.getLogBlockHeader().get(INSTANT_TIME); LOG.debug("Processing log block with instant time {}", blockInstantTime); long totalLogRecordsBefore = recordBuffer != null ? recordBuffer.getTotalLogRecords() : 0L; HoodieTimer blockReadTimer = HoodieTimer.start(); - recordBuffer.processDataBlock(dataBlock, keySpecOpt); + if (recordBuffer != null) { + recordBuffer.processDataBlock(dataBlock, keySpecOpt); + } long recordsInBlock = recordBuffer != null ? recordBuffer.getTotalLogRecords() - totalLogRecordsBefore : 0L; Map<String, Object> blockReadMetrics = new HashMap<>(); blockReadMetrics.put(LOG_BLOCK_FULL_READ_DURATION_IN_MILLIS, blockReadTimer.endTimer()); blockReadMetrics.put(TOTAL_RECORDS_PRESENT_IN_LOG_BLOCK, recordsInBlock); dataBlock.getBlockContentLocation() - .map(contentLocation -> blockReadMetrics.put(BLOCK_SIZE_IN_BYTES, contentLocation.getBlockSize())); + .ifPresent(contentLocation -> blockReadMetrics.put(BLOCK_SIZE_IN_BYTES, contentLocation.getBlockSize())); blockReadMetrics.put(HoodieLogBlock.HeaderMetadataType.INSTANT_TIME.toString(), blockInstantTime); blocksStats.add(blockReadMetrics); LOG.debug("For log block, scan metrics are {}", blockReadMetrics); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/table/log/BaseHoodieLogRecordReader.java` around lines 495 - 511, The method processDataBlock currently calls recordBuffer.processDataBlock(...) without guarding against a null recordBuffer while other lines use recordBuffer != null; fix by checking recordBuffer is present (e.g., if (recordBuffer != null) or recordBuffer.ifPresent(...) depending on its type) before invoking processDataBlock and when computing recordsInBlock use the same presence check to avoid NPEs; also replace dataBlock.getBlockContentLocation().map(...) which uses map for a side effect with a proper ifPresent/ifPresentOrElse so you call blocksStats.put(BLOCK_SIZE_IN_BYTES, ...) (or blockReadMetrics.put(...)) only when the content location Optional is present.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcIterators.java (3)
587-593: RedundantsetRowKindcall.Same issue as
BeforeImageIterator.getAfterImage()- the row kind is already set byimageManager.getImageRecord().♻️ Remove redundant setRowKind
`@Override` protected RowData getBeforeImage(RowKind rowKind, GenericRecord cdcRecord) { String recordKey = cdcRecord.get(1).toString(); RowData row = imageManager.getImageRecord(recordKey, beforeImages, rowKind); - row.setRowKind(rowKind); return projection.project(row); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcIterators.java` around lines 587 - 593, In getBeforeImage (class CdcIterators), remove the redundant row.setRowKind(rowKind) call since imageManager.getImageRecord(...) already sets the RowKind (same fix as applied to BeforeImageIterator.getAfterImage()); simply return projection.project(row) after obtaining the row from imageManager.getImageRecord(recordKey, beforeImages, rowKind).
541-547: RedundantsetRowKindcall.
imageManager.getImageRecord()already sets the row kind (line 544), but it's set again on line 545. Remove the redundant call.♻️ Remove redundant setRowKind
`@Override` protected RowData getAfterImage(RowKind rowKind, GenericRecord cdcRecord) { String recordKey = cdcRecord.get(1).toString(); RowData row = imageManager.getImageRecord(recordKey, afterImages, rowKind); - row.setRowKind(rowKind); return projection.project(row); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcIterators.java` around lines 541 - 547, In getAfterImage (method getAfterImage in CdcIterators), remove the redundant call to row.setRowKind(rowKind) since imageManager.getImageRecord(recordKey, afterImages, rowKind) already sets the RowKind; leave the retrieval via imageManager.getImageRecord and the final return projection.project(row) intact.
334-338: Double-close of sharedCdcImageManager.
DataLogFileIterator.close()closes theimageManager, butCdcFileSplitsIterator(which owns and created the manager) also closes it in itsclose()method. WhileCdcImageManager.close()is idempotent, this design is confusing and error-prone.Consider having only the owner (
CdcFileSplitsIterator) responsible for closing the manager, or add a flag to track closure state.♻️ Proposed fix: Remove imageManager.close() from DataLogFileIterator
`@Override` public void close() { logRecordIterator.close(); - imageManager.close(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcIterators.java` around lines 334 - 338, DataLogFileIterator.close() currently calls imageManager.close() even though CdcFileSplitsIterator (the owner/creator of CdcImageManager) also closes it, causing a double-close pattern; remove the imageManager.close() call from DataLogFileIterator.close() so only CdcFileSplitsIterator is responsible for closing CdcImageManager, or alternatively add an ownership/closed flag on CdcImageManager and only close when the owner (CdcFileSplitsIterator) invokes close; update DataLogFileIterator.close() to stop closing imageManager and ensure CdcFileSplitsIterator.close() remains the single point that calls CdcImageManager.close().hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala (1)
90-108: Potential integer overflow when parsing k from non-Int expressions.The
parseKmethod convertsrawValue.toString.toInt, which could silently truncate if the expression evaluates to aLonglarger thanInt.MaxValue. While unlikely in practice (k values that large would be impractical), explicit type handling would be safer.Suggested improvement for type-safe parsing
private[logical] def parseK(funcName: String, expr: Expression): Int = { val rawValue = expr.eval() if (rawValue == null) { throw new HoodieAnalysisException( s"Function '$funcName': k must be a positive integer, got null") } - val kValue = try { - rawValue.toString.toInt - } catch { - case _: NumberFormatException => - throw new HoodieAnalysisException( - s"Function '$funcName': k must be a positive integer, got '$rawValue'") + val kValue: Int = rawValue match { + case i: Int => i + case l: Long if l >= Int.MinValue && l <= Int.MaxValue => l.toInt + case l: Long => + throw new HoodieAnalysisException( + s"Function '$funcName': k value $l exceeds maximum allowed (${Int.MaxValue})") + case other => + try { other.toString.toInt } + catch { case _: NumberFormatException => + throw new HoodieAnalysisException( + s"Function '$funcName': k must be a positive integer, got '$other'") + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/HoodieVectorSearchTableValuedFunction.scala` around lines 90 - 108, parseK currently calls rawValue.toString.toInt which can silently truncate large numeric types (e.g., Long/BigInt); change parseK (method parseK in HoodieVectorSearchTableValuedFunction) to handle numeric types explicitly: if rawValue is an Int use it, if it's a Long/BigInt/Short/Byte/Number convert to a Long/BigInt and validate it fits in Int range and is > 0, if it's a String try parsing it as a BigInt/Long then validate range, and otherwise throw HoodieAnalysisException; ensure all invalid or out-of-range values produce the same user-friendly error messages instead of silent truncation.hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala (1)
86-97: Floating-point equality comparison may cause edge-case issues.Line 92 uses exact equality (
denom == 0.0) to check for zero denominator. Due to floating-point precision, vectors with extremely small norms might produce a non-zero but near-zero denominator, leading to potential division issues or numerical instability.Consider using a small epsilon threshold
case DistanceMetric.COSINE => (a, b, bNorm) => val aNorm = Vectors.norm(a, 2.0) val denom = aNorm * bNorm - if (denom == 0.0) 1.0 else math.min(2.0, math.max(0.0, 1.0 - (a.dot(b) / denom))) + if (denom < 1e-15) 1.0 else math.min(2.0, math.max(0.0, 1.0 - (a.dot(b) / denom)))This provides a safety margin for near-zero denominators while maintaining the same behavior for actual zero vectors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/VectorDistanceUtils.scala` around lines 86 - 97, The COSINE branch in resolveDistanceFn uses exact floating-point equality (denom == 0.0); change this to a safe near-zero check by introducing a small epsilon and treating denom as zero when |denom| <= eps (for example eps = 1e-12) to avoid division by a near-zero value; update the DistanceMetric.COSINE case (variables aNorm, bNorm, denom) to compute denom, compare against eps, and return the same fallback (1.0) when denom is effectively zero, otherwise perform the division as before.hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala (1)
341-344: Consider preserving the cause chain for debugging.The exception wrapper only includes the message but not the original exception as the cause. This can make debugging harder when unexpected errors occur.
Suggested improvement
} catch { - case e: Exception => throw new HoodieAnalysisException( - s"$funcName: unable to resolve table '$table': ${e.getMessage}") + case e: Exception => throw new HoodieAnalysisException( + s"$funcName: unable to resolve table '$table': ${e.getMessage}", e) }Note: This requires
HoodieAnalysisExceptionto have a constructor that accepts a cause. If it doesn't, the current approach is acceptable.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/analysis/HoodieSparkBaseAnalysis.scala` around lines 341 - 344, The catch block that currently throws new HoodieAnalysisException(s"$funcName: unable to resolve table '$table': ${e.getMessage}") should preserve the original exception as the cause; change the throw to construct HoodieAnalysisException with the message and pass the caught exception `e` as the cause (i.e., use the HoodieAnalysisException constructor that accepts a Throwable), referencing the catch in HoodieSparkBaseAnalysis's resolver logic (the catch handling `e: Exception`, using `funcName` and `table`) so the original stacktrace is retained for debugging.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/partitioner/index/TestRocksDBIndexBackend.java (1)
102-158: Consider adding upper bound assertions for hit ratios.Lines 147-150 assert that hit ratios are
>= 0D, but Line 151 only adds the upper bound check forROCKSDB_BLOCK_CACHE_HIT_RATIO. For completeness, consider adding<= 1Dassertions for the other hit ratio metrics as well (data, index, filter).Proposed fix
assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_HIT_RATIO).getValue()).doubleValue() >= 0D); assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_DATA_HIT_RATIO).getValue()).doubleValue() >= 0D); assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_INDEX_HIT_RATIO).getValue()).doubleValue() >= 0D); assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_FILTER_HIT_RATIO).getValue()).doubleValue() >= 0D); assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_HIT_RATIO).getValue()).doubleValue() <= 1D); + assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_DATA_HIT_RATIO).getValue()).doubleValue() <= 1D); + assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_INDEX_HIT_RATIO).getValue()).doubleValue() <= 1D); + assertTrue(((Number) gauges.get(FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_FILTER_HIT_RATIO).getValue()).doubleValue() <= 1D);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/partitioner/index/TestRocksDBIndexBackend.java` around lines 102 - 158, The test testMetricsReflectWritesReadsAndAutoFlush in TestRocksDBIndexBackend currently only asserts an upper bound (<= 1D) for FlinkRocksDBIndexMetrics.ROCKSDB_BLOCK_CACHE_HIT_RATIO; add corresponding upper-bound assertions for ROCKSDB_BLOCK_CACHE_DATA_HIT_RATIO, ROCKSDB_BLOCK_CACHE_INDEX_HIT_RATIO, and ROCKSDB_BLOCK_CACHE_FILTER_HIT_RATIO so each hit-ratio gauge is asserted to be >= 0D and <= 1D. Locate the assertions using the test method name and the FlinkRocksDBIndexMetrics constant names and add the missing <= 1D checks immediately after their existing >= 0D assertions.hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/common/table/log/TestLogReaderUtils.java (1)
119-121: Tighten the boundary assertion to the actual cutoff commit.This test calls the helper with
secondCommitas max, so asserting againstthirdCommitis looser than the contract being validated.Proposed refinement
- assertTrue(commitTime.compareTo(thirdCommit) < 0, - "Commit time should be before third commit: " + commitTime); + assertTrue(commitTime.compareTo(secondCommit) <= 0, + "Commit time should be <= second commit: " + commitTime);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/common/table/log/TestLogReaderUtils.java` around lines 119 - 121, The assertion in TestLogReaderUtils compares commitTime against thirdCommit but the helper was invoked with secondCommit as the max cutoff; tighten the boundary by asserting commitTime.compareTo(secondCommit) < 0 (or the equivalent strict comparison) so the test verifies the actual cutoff passed to the helper (update the assertion using the commitTime and secondCommit variables in the test).hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala (1)
1148-1156: Consider scoping instant normalization to incremental query mode.Line 1150–1153 normalizes begin/end instants regardless of
QUERY_TYPE. Limiting this to incremental reads would reduce surprising behavior for snapshot/read-optimized calls that happen to carry these keys.Suggested refinement
val startCommitKey = DataSourceReadOptions.START_COMMIT.key val endCommitKey = DataSourceReadOptions.END_COMMIT.key - val normalizedStartCommit = translatedParams.get(startCommitKey) - .map(v => startCommitKey -> HoodieSqlCommonUtils.formatIncrementalInstant(v)) - val normalizedEndCommit = translatedParams.get(endCommitKey) - .map(v => endCommitKey -> HoodieSqlCommonUtils.formatIncrementalInstant(v)) + val normalizeInstants = queryType == DataSourceReadOptions.QUERY_TYPE_INCREMENTAL_OPT_VAL + val normalizedStartCommit = if (normalizeInstants) { + translatedParams.get(startCommitKey) + .map(v => startCommitKey -> HoodieSqlCommonUtils.formatIncrementalInstant(v)) + } else None + val normalizedEndCommit = if (normalizeInstants) { + translatedParams.get(endCommitKey) + .map(v => endCommitKey -> HoodieSqlCommonUtils.formatIncrementalInstant(v)) + } else None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala` around lines 1148 - 1156, The start/end instant normalization currently always runs; restrict it to only run for incremental queries by checking the query type (compare queryType against DataSourceReadOptions.QUERY_TYPE possible incremental value) before calling HoodieSqlCommonUtils.formatIncrementalInstant. Concretely, wrap the normalizedStartCommit/normalizedEndCommit creation so they are computed only when QUERY_TYPE.key -> queryType indicates an incremental read (otherwise leave them as None), and then merge those option pairs into the final Map as before.hudi-common/src/main/java/org/apache/hudi/util/PartitionPathFilterUtil.java (1)
26-37: Harden null handling in shared predicate utility.At Line 29 and Line 36, null
relativePathPrefixesor nullpathcan trigger runtime failures. Since this is a shared utility, add explicit guards/fail-fast handling to keep call-sites safer.Suggested hardening
public class PartitionPathFilterUtil { public static Predicate<String> relativePathPrefixPredicate(List<String> relativePathPrefixes) { + if (relativePathPrefixes == null) { + return path -> false; + } return path -> relativePathPrefixes.stream().anyMatch(relativePathPrefix -> // Partition paths stored in metadata table do not have the slash at the end. // If the relativePathPrefix is empty, return all partition paths; // else if the relative path prefix is the same as the path, this is an exact match; // else, we need to make sure the path is a subdirectory of relativePathPrefix, by // checking if the path starts with relativePathPrefix appended by a slash ("/"). StringUtils.isNullOrEmpty(relativePathPrefix) - || path.equals(relativePathPrefix) || path.startsWith(relativePathPrefix + "/")); + || (path != null && (path.equals(relativePathPrefix) || path.startsWith(relativePathPrefix + "/")))); } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/util/PartitionPathFilterUtil.java` around lines 26 - 37, Ensure relativePathPrefixes is non-null and make the returned predicate null-safe: in PartitionPathFilterUtil.relativePathPrefixPredicate call Objects.requireNonNull(relativePathPrefixes, "relativePathPrefixes must not be null") to fail fast if the list is null, and change the lambda to first check path != null (return false for null path) before evaluating the stream anyMatch over relativePathPrefixes (the existing StringUtils.isNullOrEmpty(relativePathPrefix) already handles null prefixes), referencing the method relativePathPrefixPredicate, the parameter relativePathPrefixes, the lambda variable path, and the stream variable relativePathPrefix.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java (1)
374-380: Minor typo in comment."recommiting" should be "recommitting".
📝 Fix typo
`@Override` public void subtaskReset(int i, long resetCkpId) { - // There exists pending instants waiting for recommiting. + // There exists pending instants waiting for recommitting. if (tableState.isRLIWithBootstrap && !eventBuffers.getPendingInstantsBefore(resetCkpId).isEmpty()) {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java` around lines 374 - 380, In subtaskReset (use symbols: subtaskReset, tableState.isRLIWithBootstrap, eventBuffers.getPendingInstantsBefore, executor.executeSync, commitInstants) fix the comment typo by changing "recommiting" to "recommitting" so the inline comment reads "use sync execution here to make sure the recommitting finishes before RLI bootstrapping".hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java (1)
428-447: Exception typeCompletionExceptionmay mask root cause during debugging.Wrapping exceptions in
CompletionExceptionat line 445 could make debugging harder sinceCompletionExceptionis typically used for async operations. Consider usingHoodieExceptionor rethrowing as aRuntimeExceptionto maintain consistency with other exception handling in this test utility.♻️ Suggested alternative exception handling
} catch (Exception e) { - throw new CompletionException(e); + throw new RuntimeException("Failed to setup index bootstrap function", e); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java` around lines 428 - 447, The test helper method setupIndexBootstrapFunction currently catches Exception and rethrows it as a CompletionException which can obscure the real root cause; change the catch to rethrow a more appropriate unchecked exception used across the codebase (e.g., HoodieException) or at minimum a RuntimeException to preserve stack traces and consistency—update the catch block that now does "throw new CompletionException(e)" to instead throw new HoodieException(e) (or new RuntimeException(e)) so failures from bootstrapOperator.initializeState or bucketAssignerFunction.processElement surface clearly.hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestHoodieTableFactory.java (1)
231-238: Consider cleaning up config after test scenario.The test adds
DEFER_RLI_INIT_FOR_FRESH_TABLEconfiguration but doesn't clean it up. While this test method ends after the assertion, leaving the configuration set could affect subsequent tests if the test order changes or if additional scenarios are added to this method.♻️ Add config cleanup
final MockContext sourceContext6 = MockContext.getInstance(this.conf, schema, "f2"); assertThrows(IllegalArgumentException.class, () -> new HoodieTableFactory().createDynamicTableSink(sourceContext6)); + this.conf.removeConfig(FlinkOptions.INDEX_TYPE); + this.conf.removeConfig(FlinkOptions.METADATA_ENABLED); + this.conf.removeConfig(FlinkOptions.INDEX_GLOBAL_ENABLED); + this.conf.setString(HoodieMetadataConfig.DEFER_RLI_INIT_FOR_FRESH_TABLE.key(), "false"); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/TestHoodieTableFactory.java` around lines 231 - 238, The test sets HoodieMetadataConfig.DEFER_RLI_INIT_FOR_FRESH_TABLE in this.conf but does not restore or clear it, which can leak state to other tests; wrap the creation/assertion that uses the modified config (the block creating MockContext.getInstance(this.conf, schema, "f2") and calling new HoodieTableFactory().createDynamicTableSink(...)) in a try/finally (or reset the conf afterwards) and in the finally remove or reset the DEFER_RLI_INIT_FOR_FRESH_TABLE entry from this.conf so the config is cleaned up for subsequent tests.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java (1)
76-80: Consider null-check forcompactCommitHandlerinclose().If
open()fails after the factory call or before completing,compactCommitHandlercould potentially be null whenclose()is invoked (though Flink typically won't callclose()ifopen()throws). A defensive null check would be safer.🛡️ Optional defensive fix
`@Override` public void close() throws Exception { + if (compactCommitHandler != null) { compactCommitHandler.close(); + } super.close(); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactionCommitSink.java` around lines 76 - 80, The close() method may call compactCommitHandler.close() when compactCommitHandler is null if open() failed, so modify CompactionCommitSink.close() to defensively check compactCommitHandler != null before invoking compactCommitHandler.close(); ensure you still call super.close() afterward (and preserve the method signature throws Exception) so resources are cleaned even when the handler was never initialized.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactionCommitHandler.java (1)
82-100: Minor: Stray blank line between@Overrideand method signature.Line 83 has a blank line between
@Overrideandprotected void completeCompaction(...). This is unconventional and may cause style checker warnings.🔧 Suggested fix
`@Override` - protected void completeCompaction(String instant,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/handler/MetadataTableCompactionCommitHandler.java` around lines 82 - 100, Remove the stray blank line between the `@Override` annotation and the method signature for completeCompaction in MetadataTableCompactionCommitHandler; ensure the `@Override` directly precedes the protected void completeCompaction(String instant, boolean isLogCompaction, List<WriteStatus> statuses) throws IOException declaration so the annotation attaches correctly and satisfies style checks.hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactOperator.java (1)
125-131: Consider null-check forcompactHandlerinclose().Similar to
CompactionCommitSink, ifopen()fails before the handler is assigned,compactHandler.close()at line 130 could throw NPE.🛡️ Optional defensive fix
`@Override` public void close() throws Exception { if (null != this.executor) { this.executor.close(); } + if (compactHandler != null) { compactHandler.close(); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/compact/CompactOperator.java` around lines 125 - 131, The close() implementation in CompactOperator calls compactHandler.close() without a null-check which can NPE if open() failed before compactHandler was assigned; update CompactOperator.close() to only call compactHandler.close() when compactHandler != null (mirroring the defensive pattern used for executor and in CompactionCommitSink), and ensure any exceptions during close are propagated or logged consistently with existing resource cleanup semantics.
| # Parse command-line arguments | ||
| while [[ "$#" -gt 0 ]]; do | ||
| case $1 in | ||
| --hadoop-version) HADOOP_VERSION="$2"; shift ;; | ||
| --spark-version) SPARK_VERSION="$2"; shift ;; | ||
| --hive-version) HIVE_VERSION="$2"; shift ;; | ||
| --version-tag) VERSION_TAG_ARG="$2"; shift ;; | ||
| *) echo "Unknown parameter passed: $1"; exit 1 ;; | ||
| esac | ||
| shift | ||
| done |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the file
find . -name "build_docker_images.sh" -type fRepository: yihua/hudi
Length of output: 85
🏁 Script executed:
cd /repo && cat -n docker/build_docker_images.sh | head -60Repository: yihua/hudi
Length of output: 109
🏁 Script executed:
cat -n docker/build_docker_images.sh | head -60Repository: yihua/hudi
Length of output: 2720
🏁 Script executed:
# Test the vulnerability claim by simulating the argument parsing
bash -x docker/build_docker_images.sh --hadoop-version 2>&1 | head -50Repository: yihua/hudi
Length of output: 1235
🏁 Script executed:
# Test with the fix to verify it would work correctly
cat > /tmp/test_fix.sh << 'EOF'
#!/bin/bash
HADOOP_VERSION="2.8.4"
SPARK_VERSION="3.5.3"
HIVE_VERSION="2.3.10"
VERSION_TAG_ARG=""
# Parse with validation (the suggested fix)
while [[ "$#" -gt 0 ]]; do
case "$1" in
--hadoop-version)
[[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --hadoop-version requires a value" >&2; exit 1; }
HADOOP_VERSION="$2"; shift 2 ;;
--spark-version)
[[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --spark-version requires a value" >&2; exit 1; }
SPARK_VERSION="$2"; shift 2 ;;
--hive-version)
[[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --hive-version requires a value" >&2; exit 1; }
HIVE_VERSION="$2"; shift 2 ;;
--version-tag)
[[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --version-tag requires a value" >&2; exit 1; }
VERSION_TAG_ARG="$2"; shift 2 ;;
*) echo "Unknown parameter passed: $1" >&2; exit 1 ;;
esac
done
echo "HADOOP_VERSION=$HADOOP_VERSION"
echo "SPARK_VERSION=$SPARK_VERSION"
echo "HIVE_VERSION=$HIVE_VERSION"
echo "VERSION_TAG_ARG=$VERSION_TAG_ARG"
EOF
# Test the original issue
echo "=== Testing original code issue ==="
bash -x docker/build_docker_images.sh --hadoop-version 2>&1 | grep -A2 "HADOOP_VERSION="
# Test fixed version with missing value
echo -e "\n=== Testing fixed version (should fail with missing value) ==="
bash /tmp/test_fix.sh --hadoop-version 2>&1 || echo "Exit code: $?"
# Test fixed version with valid value
echo -e "\n=== Testing fixed version (should succeed with valid value) ==="
bash /tmp/test_fix.sh --hadoop-version 2.10.0 --spark-version 3.2.0Repository: yihua/hudi
Length of output: 455
Missing value validation for CLI flags allows silent failures with empty variables.
At lines 42–45, the script assumes $2 exists without validation. Passing --hadoop-version (or similar) without a value silently sets the variable to an empty string instead of reporting an error.
Test case: Running ./build_docker_images.sh --hadoop-version sets HADOOP_VERSION="" instead of failing.
Add validation to ensure each flag receives a non-empty, non-flag argument:
Suggested fix
while [[ "$#" -gt 0 ]]; do
- case $1 in
- --hadoop-version) HADOOP_VERSION="$2"; shift ;;
- --spark-version) SPARK_VERSION="$2"; shift ;;
- --hive-version) HIVE_VERSION="$2"; shift ;;
- --version-tag) VERSION_TAG_ARG="$2"; shift ;;
- *) echo "Unknown parameter passed: $1"; exit 1 ;;
- esac
- shift
+ case "$1" in
+ --hadoop-version)
+ [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --hadoop-version requires a value" >&2; exit 1; }
+ HADOOP_VERSION="$2"; shift 2 ;;
+ --spark-version)
+ [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --spark-version requires a value" >&2; exit 1; }
+ SPARK_VERSION="$2"; shift 2 ;;
+ --hive-version)
+ [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --hive-version requires a value" >&2; exit 1; }
+ HIVE_VERSION="$2"; shift 2 ;;
+ --version-tag)
+ [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --version-tag requires a value" >&2; exit 1; }
+ VERSION_TAG_ARG="$2"; shift 2 ;;
+ *) echo "Unknown parameter passed: $1" >&2; exit 1 ;;
+ esac
done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Parse command-line arguments | |
| while [[ "$#" -gt 0 ]]; do | |
| case $1 in | |
| --hadoop-version) HADOOP_VERSION="$2"; shift ;; | |
| --spark-version) SPARK_VERSION="$2"; shift ;; | |
| --hive-version) HIVE_VERSION="$2"; shift ;; | |
| --version-tag) VERSION_TAG_ARG="$2"; shift ;; | |
| *) echo "Unknown parameter passed: $1"; exit 1 ;; | |
| esac | |
| shift | |
| done | |
| # Parse command-line arguments | |
| while [[ "$#" -gt 0 ]]; do | |
| case "$1" in | |
| --hadoop-version) | |
| [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --hadoop-version requires a value" >&2; exit 1; } | |
| HADOOP_VERSION="$2"; shift 2 ;; | |
| --spark-version) | |
| [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --spark-version requires a value" >&2; exit 1; } | |
| SPARK_VERSION="$2"; shift 2 ;; | |
| --hive-version) | |
| [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --hive-version requires a value" >&2; exit 1; } | |
| HIVE_VERSION="$2"; shift 2 ;; | |
| --version-tag) | |
| [[ -n "${2:-}" && "$2" != --* ]] || { echo "Error: --version-tag requires a value" >&2; exit 1; } | |
| VERSION_TAG_ARG="$2"; shift 2 ;; | |
| *) echo "Unknown parameter passed: $1" >&2; exit 1 ;; | |
| esac | |
| done |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docker/build_docker_images.sh` around lines 39 - 49, The argument-parsing
loop (while [[ "$#" -gt 0 ]], case entries for --hadoop-version,
--spark-version, --hive-version, --version-tag) currently assigns $2 without
validation, so flags with missing values result in empty variables; update each
flag branch to validate that $2 exists and is not another flag (e.g., test -n
"$2" and [[ "$2" != -* ]] or ! -z "$2" && "${2:0:1}" != "-"), and if the check
fails print a clear error mentioning the specific flag (e.g., "--hadoop-version
requires a value") and exit non-zero, otherwise assign the value to
HADOOP_VERSION/SPARK_VERSION/HIVE_VERSION/VERSION_TAG_ARG and shift as before.
| if (!lockProps.containsKey(LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY)) { | ||
| throw new IllegalArgumentException(LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY | ||
| + " must be explicitly set when using " + lockProviderClass | ||
| + ". Without it, the lock path is derived from the table's base path," | ||
| + " which would resolve to a different location when building a lock config for a different table."); | ||
| } | ||
| } else if (ZookeeperBasedLockProvider.class.getCanonicalName().equals(lockProviderClass)) { | ||
| copyPropsWithPrefix(dataProps, lockProps, LockConfiguration.ZOOKEEPER_BASED_LOCK_PROPERTY_PREFIX); | ||
| if (!lockProps.containsKey(LockConfiguration.ZK_LOCK_KEY_PROP_KEY)) { | ||
| throw new IllegalArgumentException(LockConfiguration.ZK_LOCK_KEY_PROP_KEY | ||
| + " must be explicitly set when using " + lockProviderClass | ||
| + ". The inferred default from table name is not propagated" | ||
| + " when building a lock config for a different table."); | ||
| } | ||
| } else if (ZookeeperBasedImplicitBasePathLockProvider.class.getCanonicalName().equals(lockProviderClass)) { | ||
| throw new IllegalArgumentException(lockProviderClass | ||
| + " is not supported because it derives its lock identity from the table's base path," | ||
| + " which would resolve to a different lock when building a lock config for a different table." | ||
| + " Use " + ZookeeperBasedLockProvider.class.getCanonicalName() + " with an explicit " | ||
| + LockConfiguration.ZK_LOCK_KEY_PROP_KEY + " instead."); | ||
| } else if (HIVE_METASTORE_BASED_LOCK_PROVIDER_CLASS.equals(lockProviderClass)) { | ||
| copyPropsWithPrefix(dataProps, lockProps, LockConfiguration.HIVE_METASTORE_LOCK_PROPERTY_PREFIX); | ||
| if (!lockProps.containsKey(LockConfiguration.HIVE_DATABASE_NAME_PROP_KEY)) { | ||
| throw new IllegalArgumentException(LockConfiguration.HIVE_DATABASE_NAME_PROP_KEY | ||
| + " must be explicitly set when using " + lockProviderClass); | ||
| } | ||
| if (!lockProps.containsKey(LockConfiguration.HIVE_TABLE_NAME_PROP_KEY)) { | ||
| throw new IllegalArgumentException(LockConfiguration.HIVE_TABLE_NAME_PROP_KEY | ||
| + " must be explicitly set when using " + lockProviderClass); | ||
| } | ||
| } else if (DYNAMODB_BASED_LOCK_PROVIDER_CLASS.equals(lockProviderClass)) { | ||
| copyPropsWithPrefix(dataProps, lockProps, DYNAMODB_BASED_LOCK_PROPERTY_PREFIX); | ||
| if (!lockProps.containsKey(DYNAMODB_LOCK_PARTITION_KEY_PROP_KEY)) { | ||
| throw new IllegalArgumentException(DYNAMODB_LOCK_PARTITION_KEY_PROP_KEY | ||
| + " must be explicitly set when using " + lockProviderClass | ||
| + ". The inferred default from table name is not propagated" | ||
| + " when building a lock config for a different table."); | ||
| } |
There was a problem hiding this comment.
Validate required lock-identity properties as non-blank, not just present.
Current checks use containsKey(...), so empty values pass validation and can still produce invalid lock identities (for example an empty ZK lock key or filesystem lock path).
Proposed fix
- if (!lockProps.containsKey(LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY)) {
+ if (isBlank(lockProps.getProperty(LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY))) {
throw new IllegalArgumentException(LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY
+ " must be explicitly set when using " + lockProviderClass
+ ". Without it, the lock path is derived from the table's base path,"
+ " which would resolve to a different location when building a lock config for a different table.");
}
@@
- if (!lockProps.containsKey(LockConfiguration.ZK_LOCK_KEY_PROP_KEY)) {
+ if (isBlank(lockProps.getProperty(LockConfiguration.ZK_LOCK_KEY_PROP_KEY))) {
throw new IllegalArgumentException(LockConfiguration.ZK_LOCK_KEY_PROP_KEY
+ " must be explicitly set when using " + lockProviderClass
+ ". The inferred default from table name is not propagated"
+ " when building a lock config for a different table.");
}
@@
- if (!lockProps.containsKey(LockConfiguration.HIVE_DATABASE_NAME_PROP_KEY)) {
+ if (isBlank(lockProps.getProperty(LockConfiguration.HIVE_DATABASE_NAME_PROP_KEY))) {
throw new IllegalArgumentException(LockConfiguration.HIVE_DATABASE_NAME_PROP_KEY
+ " must be explicitly set when using " + lockProviderClass);
}
- if (!lockProps.containsKey(LockConfiguration.HIVE_TABLE_NAME_PROP_KEY)) {
+ if (isBlank(lockProps.getProperty(LockConfiguration.HIVE_TABLE_NAME_PROP_KEY))) {
throw new IllegalArgumentException(LockConfiguration.HIVE_TABLE_NAME_PROP_KEY
+ " must be explicitly set when using " + lockProviderClass);
}
@@
- if (!lockProps.containsKey(DYNAMODB_LOCK_PARTITION_KEY_PROP_KEY)) {
+ if (isBlank(lockProps.getProperty(DYNAMODB_LOCK_PARTITION_KEY_PROP_KEY))) {
throw new IllegalArgumentException(DYNAMODB_LOCK_PARTITION_KEY_PROP_KEY
+ " must be explicitly set when using " + lockProviderClass
+ ". The inferred default from table name is not propagated"
+ " when building a lock config for a different table.");
}
@@
private static void copyPropsWithPrefix(TypedProperties source, Properties target, String prefix) {
for (String key : source.stringPropertyNames()) {
if (key.startsWith(prefix)) {
target.setProperty(key, source.getProperty(key));
}
}
}
+
+ private static boolean isBlank(String value) {
+ return value == null || value.trim().isEmpty();
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java`
around lines 308 - 345, The current validations in HoodieLockConfig use
lockProps.containsKey(...) which allows empty values; update each validation
after copyPropsWithPrefix(...) to fetch the value (e.g.,
lockProps.get(LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY)) and throw
IllegalArgumentException when the value is null or blank
(value.trim().isEmpty()), not just when the key is missing. Apply this non-blank
check for the keys LockConfiguration.FILESYSTEM_LOCK_PATH_PROP_KEY,
LockConfiguration.ZK_LOCK_KEY_PROP_KEY,
LockConfiguration.HIVE_DATABASE_NAME_PROP_KEY,
LockConfiguration.HIVE_TABLE_NAME_PROP_KEY and
DYNAMODB_LOCK_PARTITION_KEY_PROP_KEY in the same conditional branches (those
that reference ZookeeperBasedLockProvider,
ZookeeperBasedImplicitBasePathLockProvider,
HIVE_METASTORE_BASED_LOCK_PROVIDER_CLASS and
DYNAMODB_BASED_LOCK_PROVIDER_CLASS), leaving copyPropsWithPrefix(...) usage
intact.
| /** | ||
| * Generates the header for a rollback command block. | ||
| */ | ||
| private Map<HoodieLogBlock.HeaderMetadataType, String> generateHeader(String commit) { | ||
| Map<HoodieLogBlock.HeaderMetadataType, String> header = new HashMap<>(3); | ||
| header.put(HoodieLogBlock.HeaderMetadataType.INSTANT_TIME, metaClient.getActiveTimeline().lastInstant().get().requestedTime()); | ||
| header.put(HoodieLogBlock.HeaderMetadataType.TARGET_INSTANT_TIME, commit); | ||
| header.put(HoodieLogBlock.HeaderMetadataType.COMMAND_BLOCK_TYPE, | ||
| String.valueOf(HoodieCommandBlock.HoodieCommandBlockTypeEnum.ROLLBACK_BLOCK.ordinal())); | ||
| return header; | ||
| } |
There was a problem hiding this comment.
Potential NoSuchElementException if timeline has no instants.
Line 163 calls .get() on metaClient.getActiveTimeline().lastInstant() without checking if the Option is present. If the active timeline is empty, this will throw NoSuchElementException.
Consider adding a presence check or using orElseThrow with a descriptive exception:
🛡️ Proposed fix
private Map<HoodieLogBlock.HeaderMetadataType, String> generateHeader(String commit) {
Map<HoodieLogBlock.HeaderMetadataType, String> header = new HashMap<>(3);
- header.put(HoodieLogBlock.HeaderMetadataType.INSTANT_TIME, metaClient.getActiveTimeline().lastInstant().get().requestedTime());
+ header.put(HoodieLogBlock.HeaderMetadataType.INSTANT_TIME,
+ metaClient.getActiveTimeline().lastInstant()
+ .orElseThrow(() -> new HoodieRollbackException("Cannot generate rollback header: active timeline has no instants"))
+ .requestedTime());
header.put(HoodieLogBlock.HeaderMetadataType.TARGET_INSTANT_TIME, commit);
header.put(HoodieLogBlock.HeaderMetadataType.COMMAND_BLOCK_TYPE,
String.valueOf(HoodieCommandBlock.HoodieCommandBlockTypeEnum.ROLLBACK_BLOCK.ordinal()));
return header;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /** | |
| * Generates the header for a rollback command block. | |
| */ | |
| private Map<HoodieLogBlock.HeaderMetadataType, String> generateHeader(String commit) { | |
| Map<HoodieLogBlock.HeaderMetadataType, String> header = new HashMap<>(3); | |
| header.put(HoodieLogBlock.HeaderMetadataType.INSTANT_TIME, metaClient.getActiveTimeline().lastInstant().get().requestedTime()); | |
| header.put(HoodieLogBlock.HeaderMetadataType.TARGET_INSTANT_TIME, commit); | |
| header.put(HoodieLogBlock.HeaderMetadataType.COMMAND_BLOCK_TYPE, | |
| String.valueOf(HoodieCommandBlock.HoodieCommandBlockTypeEnum.ROLLBACK_BLOCK.ordinal())); | |
| return header; | |
| } | |
| /** | |
| * Generates the header for a rollback command block. | |
| */ | |
| private Map<HoodieLogBlock.HeaderMetadataType, String> generateHeader(String commit) { | |
| Map<HoodieLogBlock.HeaderMetadataType, String> header = new HashMap<>(3); | |
| header.put(HoodieLogBlock.HeaderMetadataType.INSTANT_TIME, | |
| metaClient.getActiveTimeline().lastInstant() | |
| .orElseThrow(() -> new HoodieRollbackException("Cannot generate rollback header: active timeline has no instants")) | |
| .requestedTime()); | |
| header.put(HoodieLogBlock.HeaderMetadataType.TARGET_INSTANT_TIME, commit); | |
| header.put(HoodieLogBlock.HeaderMetadataType.COMMAND_BLOCK_TYPE, | |
| String.valueOf(HoodieCommandBlock.HoodieCommandBlockTypeEnum.ROLLBACK_BLOCK.ordinal())); | |
| return header; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/rollback/RollbackHelperV1.java`
around lines 158 - 168, The generateHeader method calls
metaClient.getActiveTimeline().lastInstant().get() which can throw
NoSuchElementException if the active timeline is empty; update generateHeader to
check presence of lastInstant() (e.g., use lastInstant().isPresent() or
lastInstant().orElseThrow(...)) and handle the empty-case by either throwing a
clear descriptive exception (with context including commit) or providing a
sensible default; ensure you reference
HoodieLogBlock.HeaderMetadataType.INSTANT_TIME and set its value only when the
instant is present (or use the explicit orElseThrow) so generateHeader never
blindly calls .get().
| } else if (dataType instanceof StructType | ||
| && SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType) dataType)) { | ||
| return makeShreddedVariantWriter((StructType) dataType); |
There was a problem hiding this comment.
Pass the original HoodieSchema into shredded child writers.
makeShreddedVariantWriter() rebuilds every nested writer with null schema, so logical-type handling inside typed_value silently falls back to defaults. That is unsafe for types whose Parquet encoding depends on HoodieSchema rather than Spark DataType alone — most notably VECTOR and millisecond TIMESTAMP. In those cases convertField() still emits the Parquet type from the original schema, but the writer serializes a different physical shape. Example: a shredded VECTOR typed value will get a FIXED_LEN_BYTE_ARRAY schema and a LIST writer.
💡 Proposed fix
- } else if (dataType instanceof StructType
- && SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType) dataType)) {
- return makeShreddedVariantWriter((StructType) dataType);
+ } else if (dataType instanceof StructType
+ && SparkAdapterSupport$.MODULE$.sparkAdapter().isVariantShreddingStruct((StructType) dataType)) {
+ return makeShreddedVariantWriter((StructType) dataType, resolvedSchema);
} else if (dataType instanceof StructType) {
StructType structType = (StructType) dataType;
ValueWriter[] fieldWriters = getFieldWriters(structType, resolvedSchema);
@@
- private ValueWriter makeShreddedVariantWriter(StructType shreddedStructType) {
+ private ValueWriter makeShreddedVariantWriter(StructType shreddedStructType, HoodieSchema sourceSchema) {
// Create writers for the shredded struct fields
// The shreddedStructType contains: metadata (binary), value (binary), typed_value (optional)
ValueWriter[] shreddedFieldWriters = Arrays.stream(shreddedStructType.fields())
- .map(field -> makeWriter(null, field.dataType()))
+ .map(field -> {
+ HoodieSchema fieldSchema = Option.ofNullable(sourceSchema)
+ .map(HoodieSchema::getNonNullType)
+ .flatMap(s -> s.hasFields() ? s.getField(field.name()) : Option.empty())
+ .map(HoodieSchemaField::schema)
+ .orElse(null);
+ return makeWriter(fieldSchema, field.dataType());
+ })
.toArray(ValueWriter[]::new);Also applies to: 573-578
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java`
around lines 551 - 553, The shredded-variant writer is being built with null
schema which causes nested writers to lose HoodieSchema-driven logical-type info
(affecting VECTOR, millisecond TIMESTAMP, etc.); update
makeShreddedVariantWriter in HoodieRowParquetWriteSupport to accept the original
HoodieSchema (pass the HoodieSchema from the caller where you currently call
makeShreddedVariantWriter) and propagate that HoodieSchema into all child-writer
construction paths so convertField/typed_value logic uses the provided
HoodieSchema rather than null; also update the other call sites that build
shredded writers (the similar calls around the 573-578 area) to pass the same
HoodieSchema through.
| protected List<PartitionPath> listPartitionPaths(List<String> relativePartitionPaths, | ||
| Types.RecordType partitionFields, | ||
| Expression partitionColumnPredicates) { | ||
| Expression partitionColumnPredicates, | ||
| List<Object> partitionPredicateExpressions) { | ||
| List<String> matchedPartitionPaths; | ||
| try { | ||
| matchedPartitionPaths = tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths, | ||
| partitionFields, partitionColumnPredicates); | ||
| partitionFields, partitionColumnPredicates, partitionPredicateExpressions); |
There was a problem hiding this comment.
Preserve incremental partition scoping in this new overload.
Line 361 adds a second listPartitionPaths(...) path, but unlike the existing overload it never applies the incremental-read restriction from TimelineUtils.getWrittenPartitions(findInstantsInRange()). That means partition-filtered incremental queries can list partitions that match the predicate but were not written in the requested range, and then enumerate the wrong file slices. Please keep the incremental branch here too, or intersect the filtered result with the written partitions before converting to PartitionPath.
🐛 Suggested fix
protected List<PartitionPath> listPartitionPaths(List<String> relativePartitionPaths,
Types.RecordType partitionFields,
Expression partitionColumnPredicates,
List<Object> partitionPredicateExpressions) {
List<String> matchedPartitionPaths;
try {
matchedPartitionPaths = tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths,
partitionFields, partitionColumnPredicates, partitionPredicateExpressions);
+ if (queryType == HoodieTableQueryType.INCREMENTAL
+ && incrementalQueryStartTime.isPresent()
+ && !isBeforeTimelineStarts()) {
+ Set<String> writtenPartitions = TimelineUtils.getWrittenPartitions(findInstantsInRange()).stream()
+ .collect(Collectors.toSet());
+ matchedPartitionPaths = matchedPartitionPaths.stream()
+ .filter(writtenPartitions::contains)
+ .collect(Collectors.toList());
+ }
} catch (IOException e) {
throw new HoodieIOException("Error fetching partition paths", e);
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java`
around lines 361 - 368, The new overload of listPartitionPaths in
BaseHoodieTableFileIndex currently calls
tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(...) but omits
the incremental-read scoping; update this method to apply the same incremental
partition restriction used by the other overload by obtaining the written
partitions via TimelineUtils.getWrittenPartitions(findInstantsInRange()) (or by
intersecting the matchedPartitionPaths result with that written-partitions set)
before converting to List<PartitionPath>, ensuring only partitions written in
the requested instant range are returned.
| @Override | ||
| public void skipBytesToRead(int numBytes) throws IOException { | ||
| while (numBytes > 0) { | ||
| int skipped = skipBytes(numBytes); | ||
| numBytes -= skipped; | ||
| } | ||
| } |
There was a problem hiding this comment.
Potential infinite loop when end-of-stream is reached.
DataInputStream.skipBytes(int) can return 0 when the end of stream is reached or when no bytes are available. This would cause an infinite loop since numBytes would never decrease.
🐛 Proposed fix to add end-of-stream check
`@Override`
public void skipBytesToRead(int numBytes) throws IOException {
while (numBytes > 0) {
int skipped = skipBytes(numBytes);
+ if (skipped <= 0) {
+ throw new IOException("Unable to skip remaining " + numBytes + " bytes; end of stream reached");
+ }
numBytes -= skipped;
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcImageManager.java`
around lines 169 - 175, The skipBytesToRead method can loop forever if
skipBytes(numBytes) returns 0 at end-of-stream; modify
CdcImageManager.skipBytesToRead to detect skipped == 0 and throw an EOFException
(or IOException) so the caller fails instead of spinning, or alternatively call
readFully to consume remaining bytes; specifically update skipBytesToRead to
check the return value of skipBytes and break/throw on zero, referencing the
skipBytesToRead and skipBytes methods.
| @Override | ||
| public boolean hasNext() { | ||
| if (recordIterator != null) { | ||
| if (recordIterator.hasNext()) { | ||
| return true; | ||
| } else { | ||
| recordIterator.close(); | ||
| recordIterator = null; | ||
| } | ||
| } | ||
| if (fileSplitIterator.hasNext()) { | ||
| recordIterator = recordIteratorFunc.apply(fileSplitIterator.next()); | ||
| return recordIterator.hasNext(); | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
Empty splits cause premature termination and potential resource leak.
If recordIteratorFunc returns an iterator with no elements, hasNext() returns false immediately on line 123 without:
- Closing the empty
recordIterator - Continuing to try the next file split
This can cause the iterator to incorrectly report no more data when subsequent splits may have records.
🐛 Proposed fix using a loop to skip empty splits
`@Override`
public boolean hasNext() {
- if (recordIterator != null) {
- if (recordIterator.hasNext()) {
- return true;
- } else {
- recordIterator.close();
- recordIterator = null;
+ while (true) {
+ if (recordIterator != null) {
+ if (recordIterator.hasNext()) {
+ return true;
+ } else {
+ recordIterator.close();
+ recordIterator = null;
+ }
}
- }
- if (fileSplitIterator.hasNext()) {
- recordIterator = recordIteratorFunc.apply(fileSplitIterator.next());
- return recordIterator.hasNext();
+ if (fileSplitIterator.hasNext()) {
+ recordIterator = recordIteratorFunc.apply(fileSplitIterator.next());
+ } else {
+ return false;
+ }
}
- return false;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| @Override | |
| public boolean hasNext() { | |
| if (recordIterator != null) { | |
| if (recordIterator.hasNext()) { | |
| return true; | |
| } else { | |
| recordIterator.close(); | |
| recordIterator = null; | |
| } | |
| } | |
| if (fileSplitIterator.hasNext()) { | |
| recordIterator = recordIteratorFunc.apply(fileSplitIterator.next()); | |
| return recordIterator.hasNext(); | |
| } | |
| return false; | |
| } | |
| `@Override` | |
| public boolean hasNext() { | |
| while (true) { | |
| if (recordIterator != null) { | |
| if (recordIterator.hasNext()) { | |
| return true; | |
| } else { | |
| recordIterator.close(); | |
| recordIterator = null; | |
| } | |
| } | |
| if (fileSplitIterator.hasNext()) { | |
| recordIterator = recordIteratorFunc.apply(fileSplitIterator.next()); | |
| } else { | |
| return false; | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/format/cdc/CdcIterators.java`
around lines 111 - 126, hasNext() currently returns false if recordIteratorFunc
produces an empty iterator, leaking the empty recordIterator and stopping
iteration prematurely; change hasNext() so that when fileSplitIterator.hasNext()
yields a candidate, you create recordIterator via recordIteratorFunc.apply(...),
and if recordIterator.hasNext() is false you must call recordIterator.close(),
set recordIterator to null, and continue looping to the next file split (use a
while loop over fileSplitIterator) until you either find a non-empty
recordIterator or exhaust splits; ensure any closed/empty recordIterator is
nulled to avoid leaks.
| public boolean isAlreadyBootstrap() throws Exception { | ||
| return this.bootstrapOperator.isAlreadyBootstrap(); | ||
| if (this.bootstrapOperator instanceof BootstrapOperator) { | ||
| return ((BootstrapOperator) this.bootstrapOperator).isAlreadyBootstrap(); | ||
| } | ||
| return this.bootstrapOperator != null; | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if RLIBootstrapOperator has an isAlreadyBootstrap method or inherits one
ast-grep --pattern 'class RLIBootstrapOperator {
$$$
isAlreadyBootstrap($$$) {
$$$
}
$$$
}'
# Also check AbstractBootstrapOperator
rg -n "isAlreadyBootstrap" --type java -C 3Repository: yihua/hudi
Length of output: 4008
🏁 Script executed:
# Search for RLIBootstrapOperator class definition
rg -n "class RLIBootstrapOperator" --type java -A 20
# Search for AbstractBootstrapOperator definition
rg -n "class AbstractBootstrapOperator\|interface AbstractBootstrapOperator" --type java -A 15Repository: yihua/hudi
Length of output: 3054
🏁 Script executed:
# Search for AbstractBootstrapOperator definition
rg -n "abstract class AbstractBootstrapOperator\|class AbstractBootstrapOperator" --type java -A 50
# Also check complete RLIBootstrapOperator for isAlreadyBootstrap
rg -n "isAlreadyBootstrap" hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/RLIBootstrapOperator.javaRepository: yihua/hudi
Length of output: 36
🏁 Script executed:
# Find AbstractBootstrapOperator file
fd "AbstractBootstrapOperator" --type f
# Search for AbstractBootstrapOperator more broadly
rg -n "AbstractBootstrapOperator" --type java | head -20Repository: yihua/hudi
Length of output: 1124
🏁 Script executed:
# Read AbstractBootstrapOperator class definition
cat -n hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/AbstractBootstrapOperator.javaRepository: yihua/hudi
Length of output: 4037
🏁 Script executed:
# Check BootstrapOperator to understand the full picture
cat -n hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/bootstrap/BootstrapOperator.java | grep -A 5 -B 5 "isAlreadyBootstrap"Repository: yihua/hudi
Length of output: 377
Consider exposing bootstrap completion state for RLIBootstrapOperator.
The fallback logic return this.bootstrapOperator != null assumes any non-BootstrapOperator AbstractBootstrapOperator is fully bootstrapped when instantiated. Since RLIBootstrapOperator does not override or expose isAlreadyBootstrap(), the method cannot accurately reflect whether its bootstrap process has completed. Either add an isAlreadyBootstrap() method to RLIBootstrapOperator or define a default implementation in AbstractBootstrapOperator that all subclasses inherit.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/StreamWriteFunctionWrapper.java`
around lines 398 - 403, The method
StreamWriteFunctionWrapper.isAlreadyBootstrap() incorrectly treats any
non-BootstrapOperator AbstractBootstrapOperator instance as already bootstrapped
by returning bootstrapOperator != null; update the design so the bootstrap
completion is explicitly queryable: add an isAlreadyBootstrap() implementation
to RLIBootstrapOperator (or add a default isAlreadyBootstrap() in
AbstractBootstrapOperator) and have StreamWriteFunctionWrapper call that method
instead of using the null check; ensure the symbol names involved are
AbstractBootstrapOperator.isAlreadyBootstrap(),
RLIBootstrapOperator.isAlreadyBootstrap(), and
StreamWriteFunctionWrapper.isAlreadyBootstrap() so the wrapper can accurately
reflect bootstrap completion.
| // Should not throw exception | ||
| function.read(cdcSplit); |
There was a problem hiding this comment.
Direct read(cdcSplit) call makes this test unstable and contradicts test intent.
Line 241 now performs real read without guarding expected downstream I/O failure. In this temp setup, that can fail for reasons unrelated to type acceptance and turn the test flaky/failing.
Proposed fix
- // Should not throw exception
- function.read(cdcSplit);
+ Exception ex = assertThrows(Exception.class, () -> function.read(cdcSplit));
+ if (ex instanceof IllegalArgumentException) {
+ throw new AssertionError("read() should accept HoodieCdcSourceSplit; IllegalArgumentException indicates type-guard rejection", ex);
+ }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/source/reader/function/TestHoodieCdcSplitReaderFunction.java`
around lines 240 - 241, The direct call to function.read(cdcSplit) can surface
unrelated downstream I/O failures and make the test flaky; replace the plain
call with a guarded invocation that catches IOException from the read and
ignores or treats it as non-fatal for this test (while keeping any assertions
about type acceptance intact). Specifically, modify the test around
function.read(cdcSplit) to wrap the call in a try-catch for IOException
(catching exceptions thrown by function.read) and suppress/ignore those
IOExceptions so the test only validates the intended type-acceptance behavior of
the reader.
| val partitionPredicateExpressionSeq = partitionPredicateExpressions.asScala.map(_.asInstanceOf[Expression]).toSeq | ||
| filterPartitionsBasedOnRelativePathPrefixes(relativePathPrefix, | ||
| sparkSession.sessionState.catalog.externalCatalog | ||
| .listPartitionsByFilter(catalogDatabaseName, catalogTableName, partitionPredicateExpressionSeq, | ||
| SQLConf.get.sessionLocalTimeZone)) |
There was a problem hiding this comment.
Guard nullable/heterogeneous predicate list before Spark-catalyst casting.
At Line 100, direct .asScala.map(_.asInstanceOf[Expression]) can fail with NPE (null list) or ClassCastException (unexpected object type), causing partition listing failures in catalog mode.
Defensive fix
- val partitionPredicateExpressionSeq = partitionPredicateExpressions.asScala.map(_.asInstanceOf[Expression]).toSeq
+ val partitionPredicateExpressionSeq: Seq[Expression] =
+ Option(partitionPredicateExpressions)
+ .map(_.asScala.map {
+ case e: Expression => e
+ case other => throw new IllegalArgumentException(
+ s"Unsupported partition predicate expression type: ${other.getClass.getName}")
+ }.toSeq)
+ .getOrElse(Seq.empty)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/metadata/CatalogBackedTableMetadata.scala`
around lines 100 - 104, The partition predicate list is being cast unsafely in
CatalogBackedTableMetadata, risking NPE or ClassCastException; before calling
sparkSession.sessionState.catalog.externalCatalog.listPartitionsByFilter,
defensively handle partitionPredicateExpressions by treating null as empty and
only converting/casting values that are actually catalyst Expressions (e.g., use
an Option or null check and a safe collect/flatMap to keep only instances of
org.apache.spark.sql.catalyst.expressions.Expression), producing a
Seq[Expression] to pass into filterPartitionsBasedOnRelativePathPrefixes so no
nulls or wrong-typed elements reach the catalog call.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java (1)
1420-1436: Consider filtering empty strings before validation for clearer error messages.The validation doesn't explicitly filter empty/whitespace-only entries. If a user provides
"compaction,"(trailing comma), the split produces["compaction", ""], andActionType.valueOf("")throws with a less descriptive error: "Unknown metadata table service manager action: ".This is consistent with the strict validation philosophy (reject malformed input), but the error message could be clearer. The runtime check in
HoodieTableServiceManagerConfig.isEnabledAndActionSupported()filters empty strings, so this is only a validation UX concern.💡 Optional improvement for clearer error messages
private static void validateTableServiceManagerActions(String actions) { for (String action : actions.split(",")) { String trimmed = action.trim(); + if (trimmed.isEmpty()) { + throw new IllegalArgumentException("Empty action found in metadata table service manager actions. " + + "Supported actions are: " + SUPPORTED_TABLE_SERVICE_MANAGER_ACTIONS); + } ActionType actionType; try { actionType = ActionType.valueOf(trimmed);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java` around lines 1420 - 1436, The validateTableServiceManagerActions method currently passes empty/whitespace tokens to ActionType.valueOf causing unclear errors; update validateTableServiceManagerActions to trim and skip any empty tokens (e.g., after split, if trimmed.isEmpty() continue) or explicitly throw a clearer IllegalArgumentException for empty entries, and keep the existing checks against SUPPORTED_TABLE_SERVICE_MANAGER_ACTIONS for non-empty tokens; reference validateTableServiceManagerActions and hoodie config helper HoodieTableServiceManagerConfig.isEnabledAndActionSupported to ensure consistent behavior with the runtime filter.hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java (1)
329-351: Minor duplication in custom lock provider rejection tests.
testCreateMetadataWriteConfigRejectsCustomLockProviderandtestCreateMetadataWriteConfigRejectsCustomLockProviderstest the same behavior with different class names. Consider consolidating into a parameterized test if desired, but this is a nitpick.Also applies to: 480-501
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java` around lines 329 - 351, Two nearly identical tests (testCreateMetadataWriteConfigRejectsCustomLockProvider and testCreateMetadataWriteConfigRejectsCustomLockProviders) duplicate coverage of rejecting non-built-in lock providers; consolidate them into a single parameterized test that iterates over the different custom lock provider class names to assert HoodieMetadataWriteUtils.createMetadataWriteConfig throws HoodieException and the message contains "only supported for built-in lock providers" and the class name. Update or replace the two existing test methods in TestHoodieMetadataWriteUtils to use a parameterized JUnit construct (e.g., `@ParameterizedTest` with `@ValueSource` or `@MethodSource`) to supply the custom class name strings and perform the same assertions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.java`:
- Around line 329-351: Two nearly identical tests
(testCreateMetadataWriteConfigRejectsCustomLockProvider and
testCreateMetadataWriteConfigRejectsCustomLockProviders) duplicate coverage of
rejecting non-built-in lock providers; consolidate them into a single
parameterized test that iterates over the different custom lock provider class
names to assert HoodieMetadataWriteUtils.createMetadataWriteConfig throws
HoodieException and the message contains "only supported for built-in lock
providers" and the class name. Update or replace the two existing test methods
in TestHoodieMetadataWriteUtils to use a parameterized JUnit construct (e.g.,
`@ParameterizedTest` with `@ValueSource` or `@MethodSource`) to supply the custom
class name strings and perform the same assertions.
In
`@hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java`:
- Around line 1420-1436: The validateTableServiceManagerActions method currently
passes empty/whitespace tokens to ActionType.valueOf causing unclear errors;
update validateTableServiceManagerActions to trim and skip any empty tokens
(e.g., after split, if trimmed.isEmpty() continue) or explicitly throw a clearer
IllegalArgumentException for empty entries, and keep the existing checks against
SUPPORTED_TABLE_SERVICE_MANAGER_ACTIONS for non-empty tokens; reference
validateTableServiceManagerActions and hoodie config helper
HoodieTableServiceManagerConfig.isEnabledAndActionSupported to ensure consistent
behavior with the runtime filter.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0085b203-3b22-4dd7-8590-7f1d8f84950e
📒 Files selected for processing (4)
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/metadata/TestHoodieMetadataWriteUtils.javahudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.javahudi-common/src/main/java/org/apache/hudi/common/config/HoodieTableServiceManagerConfig.javahudi-common/src/test/java/org/apache/hudi/common/config/TestHoodieMetadataConfig.java
Mirror of apache#18295 for automated bot review.
Original author: @kbuci
Base branch: master
Summary by CodeRabbit
New Features
Improvements
Tests