[OSS PR #18597] test(schema): Add lance fileformat test for custom types on MOR by hudi-agent · Pull Request #23 · hudi-agent/hudi

hudi-agent · 2026-04-27T10:48:51Z

Mirror of apache#18597 for automated bot review.

Original author: @voonhous
Base branch: master

Summary by CodeRabbit

New Features
- VECTOR, BLOB and VARIANT types; read_blob() SQL for efficient blob access; Lance as a base-file format; batched blob read execution and SQL integration; continuous-sort buffering for Flink ingestion; pluggable Parquet writer config injection.
Improvements
- Safer empty-clean behavior and optional archival blocking based on latest clean ECTR to avoid premature commit removal.
Bug Fixes
- Removed duplicate YARN setting; improved lock/cleanup robustness; fixed partition-column handling in reads.
Chores
- Updated Lance/Spark versions and added Docker Compose stacks for local integration testing.

…che#18508) * fix: whitelist Flink _2.12 artifacts in scala-2.13 enforcer rule - Flink only publishes flink-table-planner_2.12 (no _2.13 variant exists). The scala-2.13 profile's blanket ban on *_2.12 artifacts breaks the build when combining -Dspark4.0 -Dscala-2.13 with -Dflink1.20. - Add <includes> exceptions for org.apache.flink:*_2.12 and its transitive _2.12 dependencies (org.scala-lang.modules, com.twitter) since Flink has largely decoupled from Scala since 1.15 and the _2.12 suffix is internal. * fix: tighten Flink _2.12 whitelist and add rationale comment Address review feedback on apache#18508: - Narrow scala-lang.modules and com.twitter includes to the specific transitives (scala-xml_2.12, chill_2.12) so the enforcer still catches unexpected _2.12 leakage from those groups. - Add an XML comment explaining why these _2.12 artifacts are whitelisted despite the blanket scala-2.13 ban (Flink's _2.12 suffix is a legacy naming artifact post 1.15 Scala decoupling).

…int.sh (apache#18527) - The property was inserted into yarn-site.xml twice with the same value in the MULTIHOMED_NETWORK=1 block. - Duplicates are harmless at runtime (Hadoop's Configuration parser takes the last value for duplicates and both writes use the same value), but the second write is dead code. - Applies to base, base_java11, and base_java17.

- MapUtils.isNullOrEmpty(Map) was byte-identical to CollectionUtils.isNullOrEmpty(Map), and MapUtils.nonEmpty(Map) had no matching overload in CollectionUtils. - Move the sole remaining unique helper (containsAll) into CollectionUtils, delete MapUtils, and fold its test into TestCollectionUtils so there is one canonical utility class for Map/Collection emptiness and containment checks.

…bleList (apache#18530) - Replace Stream.of(elements).collect(Collectors.toList()) with a direct Arrays.asList copy. - Same semantics (mutable snapshot wrapped in an unmodifiable view), one less stream/spliterator/collector allocation per call. - Called from constants/initializers, so the saving is per-call but free.

…pache#18083) * Implemented a continuous sorting mode for the append sink to maintain sorted order incrementally and avoid single-partition lag during ingestion by reducing large pause time from sort and backpressure Summary: - Added AppendWriteFunctionWithContinuousSort which keeps records in a TreeMap keyed by a code-generated normalized key and an insertion sequence, drains oldest entries when a configurable threshold is reached, and writes drained records immediately; snapshot/endInput drain remaining records. - Updated AppendWriteFunctions.create to instantiate the continuous sorter when WRITE_BUFFER_SORT_CONTINUOUS_ENABLED is true. - Introduced three new FlinkOptions: WRITE_BUFFER_SORT_CONTINUOUS_ENABLED, WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_THRESHOLD_PERCENT, and WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_SIZE, and added runtime validation (buffer > 0, 0 < threshold < 100, drainSize > 0, parsed non-empty sort keys). - Added ITTestAppendWriteFunctionWithContinuousSort integration tests covering buffer flush triggers, sorted output correctness (with and without continuous drain), drain threshold/size behaviors, and invalid-parameter error cases. --------- Co-authored-by: dsaisharath <dsaisharath@uber.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…apache#18204) - Added HudiHiveSyncJob under hudi-utilities as an external runner for Hive sync. - Added CLI/config support for base path, base file format, props file, and override configs. - Wired the job to build sync properties and invoke HiveSyncTool directly. --------- Co-authored-by: sivabalan <n.siva.b@gmail.com>

@yihua

…() (apache#18461) * fix: close HoodieStorage resource in FileSystemBasedLockProvider.close() Adds a finally block to properly close the HoodieStorage instance after the lock file is deleted, preventing resource leaks. Fixes apache#14922 * fix: simplify log message per review feedback Addresses nit from @yihua: shorten log to "Failed to close HoodieStorage" since logger already includes class context. * fix: add comment explaining HoodieStorage.close() is no-op for HadoopStorage Addresses review feedback from danny0405: HoodieHadoopStorage.close() is a no-op because Hadoop FileSystem instances are shared within the JVM process lifecycle. Added a comment to explain this behavior and why the call is still retained for interface contract correctness.

…lters (apache#18531) * perf(common): Avoid double-iterating log files in file-system-view filters - filterUncommittedFiles and filterUncommittedLogs each called fileSlice.getLogFiles() twice: once to filter+collect into committedLogFiles, and again for .count() to compare sizes. - Each call produced a fresh stream over the underlying collection. - Materialize the log files once and compare against the resulting list size. - Also switch fileSlices.size() == 0 to isEmpty() in fetchAllLogsMergedFileSlice. - No behaviour changes. * perf(common): add FileSlice#getLogFileCnt and drop intermediate list - Address review comment on apache#18531: - The backing TreeSet's size() is O(1), so expose it as FileSlice#getLogFileCnt and use it for the size comparison in filterUncommittedFiles / filterUncommittedLogs. - This removes the allLogFiles materialization and keeps a single stream pass for the filter.

…pache#18488) * feat(vector): Add Spark SQL DDL CREATE TABLE support for VECTOR type - Enable VECTOR(dim[, elementType]) syntax in Spark SQL DDL so users can create tables with vector columns directly via SQL instead of only through the DataFrame API. - Changes: - Extend ANTLR grammar to accept identifier params in primitiveDataType - Add VECTOR case in visitPrimitiveDataType (supports FLOAT, DOUBLE, INT8) - Add VECTOR metadata attachment in addMetadataForType - Add DDL tests for VECTOR columns in TestCreateTable * Fix tests and address comments * Improve VECTOR DDL test coverage with targeted tests and routing - Relax isHoodieCommand VECTOR check from " vector(" to " vector" in all 4 extended parser files. - The stricter " vector(" variant only routes SQL containing VECTOR type declarations with parentheses (e.g. VECTOR(128)), which means VECTOR without parens is delegated to Spark's native parser and never reaches our Hudi code path. - Relaxing to " vector" routes all VECTOR-related SQL through our parser, enabling us to exercise the "vector with empty params" branch of the `case ("vector", _ :: _)` pattern - previously reported as partial coverage because the empty-list side of the `_ :: _` check was never hit. - This is also consistent with the existing BLOB routing pattern " blob". Add two targeted tests: 1. test create table with INT8 VECTOR column - isolated INT8 test that independently exercises the `case INT8 => ByteType` branch 2. test create table with VECTOR without dimension fails - routes VECTOR alone through the Hudi parser to cover the empty-list branch

* [MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0 Bumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector from 0.0.15 to 0.4.0. Updates affected import paths and adapts to the LanceArrowUtils.toArrowSchema signature change (drops the errorOnDuplicatedFieldNames parameter). * [MINOR] Rename Hudi's ShowIndexes logical plan to HoodieShowIndexes Lance-spark 0.4.0 (bumped in 7e4967c) ships its own `org.apache.spark.sql.catalyst.plans.logical.ShowIndexes` inside `lance-spark-base_*.jar`. This collides with Hudi's own same-FQCN case class (added in hudi-spark-common). Both jars end up on the classpath of hudi-spark3.3.x/3.4.x/3.5.x/4.0.x, and since the two classes have different case-class arity (Lance's is 1-arg, Hudi's is 2-arg), Scala pattern matches like `case ShowIndexes(table, output)` fail to compile. Rename Hudi's class to `HoodieShowIndexes` (and its companion object) to sidestep the collision. This is an internal logical-plan class consumed only by Hudi's own parser / CatalystPlanUtils / analyzer — no public SQL or API surface changes. Call-sites updated: - Index.scala (definition + companion) - HoodieSpark{33,34,35,40}CatalystPlanUtils.scala (pattern match) - HoodieSpark{3_3,3_4,3_5,4_0}ExtendedSqlAstBuilder.scala (construct) - IndexCommands.scala (doc reference) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(lance): look up Arrow vectors by field name in LanceRecordIterator With lance-spark 0.4.0, VectorSchemaRoot.getFieldVectors() returns vectors in the file's on-disk order rather than in the order of the projection requested via LanceFileReader.readAll(). Wrapping vectors positionally therefore mismatches the UnsafeProjection built from the requested schema, causing UnsafeProjection to call type accessors on the wrong column (e.g. getInt on a VarCharVector) and fail with UnsupportedOperationException for MoR reads where the FileGroupRecord Buffer rearranges columns relative to the file's write order. Fix by looking up each vector by field name from the requested schema so the ColumnVector[] order matches what UnsafeProjection expects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * [MINOR] Tighten comments on LanceRecordIterator and HoodieShowIndexes Shorten the prose blocks above the column-order remapping in LanceRecordIterator and above HoodieShowIndexes to 2-3 sentences each, keeping the why (lance-spark 0.4.0 on-disk column order; FQCN shadow from lance-spark-base) without the full incident narrative. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…es (apache#18380) This PR adds support to block archival based on the Earliest Commit To Retain (ECTR) from the last completed clean operation, preventing potential data leaks when cleaning configurations change between clean and archival runs. Problem: Currently, archival recomputes ECTR independently based on cleaning configs at archival time, rather than reading it from the last clean plan. When cleaning configs change between clean and archival operations, archival may archive commits whose data files haven't been cleaned yet, leading to timeline metadata loss for existing data files. Summary and Changelog User-facing summary: Users can now optionally enable archival blocking based on ECTR from the last clean to prevent archiving commits whose data files haven't been cleaned. This is useful when cleaning configurations may change over time or when strict data retention guarantees are needed. Detailed changelog: Configuration Changes: Added new advanced config hoodie.archive.block.on.latest.clean.ectr (default: false) When enabled, archival reads ECTR from last completed clean metadata Blocks archival of commits with timestamp >= ECTR Marked as advanced config for power users Available since version 1.2.0 Implementation Changes: TimelineArchiverV1.java: Added ECTR blocking logic in getCommitInstantsToArchive() method Reads ECTR from last completed clean's metadata (lines 274-294) Filters commit timeline to exclude commits >= ECTR (lines 322-326) Follows same pattern as existing compaction/clustering retention checks Includes error handling with graceful degradation (logs warning if metadata read fails) HoodieArchivalConfig.java: Added config property BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR Builder method: withBlockArchivalOnCleanECTR(boolean) HoodieWriteConfig.java: Added access method shouldBlockArchivalOnCleanECTR()

* fix: prevent parseTypeDescriptor crash for non-custom logical types in schema conversion guards - The BLOB/VECTOR guard conditions in HoodieSparkSchemaConverters called parseTypeDescriptor() for any StructType/ArrayType with hudi_type metadata, which threw IllegalArgumentException for types like VARIANT that are not custom logical types. - Add isCustomLogicalTypeDescriptor() safe check to short-circuit the guards before parseTypeDescriptor() is called. - Add regression test that reproduces the struct+metadata VARIANT path. * fix: treat hudi_type=VARIANT as a first-class custom logical type - Address review feedback on apache#18510, restructure the crash fix so a StructType tagged with hudi_type=VARIANT is handled consistently with BLOB/VECTOR. - The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal. - Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB. - In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant. - On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path. - Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check. - Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection. * Address missing code coverage complains

…#18511) - Hive 2.x/3.x does not support VARIANT type natively. - When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects. - Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading. - Includes unit test for the conversion. - Recursively convert VariantType inside nested StructType/ArrayType/MapType so columns like STRUCT<a:VARIANT>, ARRAY<VARIANT>, and MAP<STRING,VARIANT> are also rewritten to the Hive-compatible physical struct. - Emit the variant struct with canonical (metadata, value) field order to match HoodieSchema.createVariant() and the Parquet/Iceberg convention. - Extract buildHiveCompatibleCatalogTable helper so the schema conversion and property merge are directly unit-testable. - Expand TestVariantDataType with nested-variant cases, canonical-order assertions, and coverage for buildHiveCompatibleCatalogTable. - Clean up Scaladoc (use backticks) and rename the test field from v to variant_col.

…8448) Closes apache#18050 In multi-writer mode, when multiple writers detect a failed inflight commit, each writer independently schedules and executes its own rollback. This leads to duplicate rollback instants on the timeline for the same failed commit — causing unnecessary work, potential conflicts, and timeline clutter. Summary and Changelog Adds a mechanism to avoid duplicate rollback plans in multi-writer mode by: Timeline reload under lock: Before scheduling a new rollback, the writer reloads the active timeline inside the lock to check if another writer already scheduled a rollback for the same failed commit. If found, it reuses the existing plan instead of creating a new one. Heartbeat-based ownership: Before executing a rollback, the writer acquires a heartbeat for the rollback instant under a transaction lock. If another writer already holds an active heartbeat (i.e., is currently executing the rollback), the current writer skips execution. If the heartbeat is expired, the writer takes ownership and proceeds. Completed rollback detection: Inside the heartbeat acquisition lock, the writer also checks if the rollback was already completed on the timeline by another writer, and skips if so. Changes: BaseHoodieTableServiceClient: Refactored rollback() into schedule and execute phases. Extracted resolveOrScheduleRollback() (reuses existing pending rollbacks or schedules new ones under lock) and acquireRollbackHeartbeatIfMultiWriter() (heartbeat-based ownership with completed-rollback detection). Wrapped heartbeatClient.stop() in try-catch in the finally block to avoid masking rollback exceptions. HoodieWriteConfig: New advanced config hoodie.rollback.avoid.duplicate.plan (default false), gated on multi-writer mode via shouldAvoidDuplicateRollbackPlan(). TestClientRollback: Added 6 tests covering: expired heartbeat takeover, active heartbeat skip, commit-not-in-timeline, first-writer-schedules-new-plan, already-completed-by-another-writer, and concurrent two-writer rollback of the same commit. Impact New advanced config: hoodie.rollback.avoid.duplicate.plan — opt-in, default false, only effective in multi-writer mode. No breaking changes to public APIs or storage format. No behavioral change for single-writer mode or when the config is disabled. --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local> Co-authored-by: sivabalan <n.siva.b@gmail.com>

…tack (apache#18550)

…he#18552) * fix: Parquet small-precision decimals decode ClassCastException Parquet encodes DECIMAL physically as INT32 (precision <= 9), INT64 (precision <= 18), or BINARY / FIXED_LEN_BYTE_ARRAY. The Hudi-Flink 1.18 ParquetDecimalVector predated this and unconditionally cast the child vector to BytesColumnVector inside getDecimal(), throwing ClassCastException whenever the reader materialized a small-precision decimal as a HeapIntVector / HeapLongVector. Sync ParquetDecimalVector with Apache Flink 2.1's implementation: - getDecimal dispatches on the physical child vector type (ParquetSchemaConverter#is32BitDecimal / is64BitDecimal) and decodes int / long / bytes accordingly. - Wrapper additionally implements WritableLongVector, WritableIntVector, and WritableBytesVector by delegation so that column readers can write through it without unwrapping. - Underlying vector field is now private; accessed via getVector(). Migrated ArrayColumnReader's nine direct-field accesses. - Adds TestParquetDecimalVector covering every dispatch path, the backward-compatible bytes-at-small-precision case, unsupported vector types, null handling, and the Writable* delegation contracts.

…e#18460)

…le format (apache#18162) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

… for RocksDBIndexBackend (apache#18560)

…e lists (apache#18497)

…ache#18545) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…n Spark 4 (apache#18564) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

…che#18571) RECURSION_OVERFLOW_SCHEMA passes new byte[0] (via getUTF8Bytes("")) as the default value of a BYTES HoodieSchemaField, which wraps an Avro Schema.Field. Avro 1.12.0's Schema.validateDefault rejects byte[] for BYTES defaults — it now requires a String (interpreted as ISO-8859-1 bytes, Avro's canonical JSON form for BYTES defaults). Under 1.11.x the validator was lenient. Because the failure is in a static initializer, the JVM caches the ExceptionInInitializerError and every subsequent class access throws NoClassDefFoundError: Could not initialize class org.apache.hudi.utilities.sources.helpers.ProtoConversionUtil$AvroSupport. Any downstream service loading this class (e.g. DeltaStreamer using ProtoClassBasedSchemaProvider) is permanently broken on JVMs that have Avro 1.12+ on the classpath, even though Hudi itself still pins 1.11.4. Fix: use "" (empty string) instead of getUTF8Bytes(""). The default value is never read at runtime — it is always overwritten by ByteBuffer.wrap(messageValue.toByteArray()) at line ~365 before the field is populated — and "" serializes bit-for-bit identically to new byte[0] on the wire (both produce "default":"" in the JSON schema), so behavior is strictly preserved under Avro 1.11 while satisfying 1.12's stricter validator. Rejected alternatives (experimental testing against both Avro versions): | default arg | 1.11.x | 1.12.0 | | new byte[0] (current) | OK | FAIL | | "" (fix) | OK | OK | | ByteBuffer.wrap(...) | OK | FAIL | | null | OK | OK, but strips the default entirely | Also drops the now-unused getUTF8Bytes static import. Fixes: apache#18569 Signed-off-by: tiennguyen-onehouse <tien@onehouse.ai>

apache#18570) HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds two schemas side-by-side: requestedSchema (what to return to Spark) and dataSchema (what to read from parquet). It augments requestedSchema with any partition fields in mandatoryFields before pruning, but pipes dataStructType through unchanged. Spark's dataStructType excludes partition columns by convention, and HoodieSchemaUtils.pruneDataSchema iterates over its second arg, so any mandatory partition field is silently dropped from the resulting dataSchema. The FileGroupReader then does not read the column from the parquet base file, and for non-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload) the output converter writes null for every affected row. Most visible on MOR file slices that have both a base file and a log file, since the readBaseFile path (which would append partition values from the directory name) is skipped in favor of the FileGroupReader path. Regression introduced by apache#13711 ("Improve Logical Type Handling on Col Stats"), which added the pruneDataSchema wrapping but only on the requested-schema side. Fix: mirror requestedStructType's construction — augment dataStructType with the mandatory partition fields before pruning. Also adds a regression test (TestFileGroupReaderPartitionColumn) that reproduces the scenario end-to-end: MOR + CustomKeyGenerator + PostgresDebeziumAvroPayload + GLOBAL_SIMPLE with update.partition.path=true, round-2 partition-key change producing a base+log slice, then verifies untouched records in that slice read back with the correct partition-column value. Fixes: apache#18568 Signed-off-by: tiennguyen-onehouse <tien@onehouse.ai>

…he#18379) This PR adds support for custom Parquet configuration injection across all file writer factories in Apache Hudi. This feature allows users to inject custom Parquet configurations (e.g., native Parquet bloom filters, custom compression settings, dictionary encoding overrides) at runtime without modifying Hudi's core code. Motivation: Users sometimes need to apply specific Parquet configurations for certain tables or partitions (e.g., disable dictionary encoding for high-cardinality columns, enable native Parquet bloom filters for specific columns, or apply custom encoding strategies). Previously, these configurations were hard-coded or required code changes. This PR introduces a pluggable mechanism via the HoodieParquetConfigInjector interface. Summary and Changelog Summary: Added support for custom Parquet configuration injection across Spark, Avro, and Flink file writers. Users can now implement the HoodieParquetConfigInjector interface and specify it via the hoodie.parquet.config.injector.class configuration to inject custom Parquet settings at write time. Changes: Core Implementation (hudi-client-common): - Added HoodieParquetConfigInjector interface with withProps() method that accepts StoragePath, StorageConfiguration, and HoodieConfig and returns modified configurations Spark Integration (hudi-spark-client): - Modified HoodieSparkFileWriterFactory.newParquetFileWriter() to check for and invoke config injector (lines 66-79) - Added comprehensive tests in TestHoodieParquetConfigInjector: testDisableDictionaryEncodingViaInjector() - validates dictionary encoding can be disabled testInvalidInjectorClassThrowsException() - validates error handling testNoInjectorUsesDefaultConfig() - validates backward compatibility - Tests validate actual Parquet metadata (encodings) rather than just configuration Avro Integration (hudi-hadoop-common): - Modified HoodieAvroFileWriterFactory.newParquetFileWriter() to support config injection (lines 71-85) - Updated getHoodieAvroWriteSupport() signature to accept StorageConfiguration - Added TestHoodieAvroParquetConfigInjector with similar test coverage Flink Integration (hudi-flink-client): - Modified HoodieRowDataFileWriterFactory.newParquetFileWriter() to support config injection (lines 126-140) - Added TestHoodieRowDataParquetConfigInjector with similar test coverage Configuration: - Added HOODIE_PARQUET_CONFIG_INJECTOR_CLASS config key in HoodieStorageConfig - Added withParquetConfigInjectorClass() builder method Impact Public API: New interface: HoodieParquetConfigInjector (marked with appropriate annotations) New configuration: hoodie.parquet.config.injector.class (optional, defaults to empty string) User-facing changes: Users can now customize Parquet settings per file/partition without modifying Hudi code Fully backward compatible - existing code continues to work without changes No performance impact when feature is not used (single isNullOrEmpty() check)

This PR adds support for creating empty clean commits to optimize clean planning performance for append-only datasets. Problem: In datasets with incremental cleaning enabled that receive infrequent updates or are primarily append-only, the clean planner performs a full table scan on every ingestion run because there are no clean plans to mark progress. This leads to significant performance overhead, especially for large tables. Solution: Introduce a new configuration hoodie.write.empty.clean.internval.hours that allows creating empty clean commits after a configurable duration. These empty clean commits update the earliestCommitToRetain value, enabling subsequent clean planning operations to only scan partitions modified after the last empty clean, avoiding expensive full table scans. Summary and Changelog User-facing changes: New advanced config hoodie.write.empty.clean.create.duration.ms (default: -1, disabled) to control when empty clean commits should be created When enabled with incremental cleaning, Hudi will create empty clean commits after the specified duration (in milliseconds) to optimize clean planning performance Detailed changes: Config Addition (HoodieCleanConfig.java): - Addedhoodie.write.empty.clean.internval.hours config property with builder method - Marked as advanced config for power users Clean Execution (CleanActionExecutor.java): - Modified clean parallelism calculation to ensure minimum of 1 (was causing issues with empty plans) - Added createEmptyCleanMetadata() method to construct metadata for empty cleans - Updated runClean() to handle empty clean stats by creating appropriate metadata Clean Planning (CleanPlanActionExecutor.java): - Added getEmptyCleanerPlan() method to construct cleaner plans with no files to delete - Modified requestClean() to return empty plans when partitions list is empty - Added logic in requestCleanInternal() to check if empty clean commit should be created based on: Incremental cleaning enabled Time since last clean > configured threshold Valid earliestInstantToRetain present Impact Performance Impact: Positive - significantly reduces clean planning time for append-only or infrequently updated datasets by avoiding full table scans API Changes: None - purely additive configuration Behavior Changes: When enabled, users will see empty clean commits in the timeline at the configured intervals These commits have totalFilesDeleted=0 and empty partition metadata but contain valid earliestCommitToRetain metadata

…ad (apache#18582) Fixes: apache#18579 Hudi writes VECTOR as bare FIXED_LEN_BYTE_ARRAY with no Parquet logical-type annotation, so Hive's Parquet reader reconstructs the Avro schema as plain FIXED named after the column. Projecting that to the canonical VECTOR schema failed with "cannot support rewrite value for schema type" because HoodieSchemaType.FIXED and VECTOR are distinct and rewritePrimaryTypeWithDiffSchemaType had no VECTOR case. FIXED_BYTES-backed VECTOR is byte-identical to FIXED at matching size, so pass the writable through.

)

Increase -Xmx from 3g to 4g in the default, java11 and java17 Maven profiles to prevent OOM failures during CI test runs.

Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor and payload shape of a custom-typed column survive inline compaction of a log-only MOR table into a base file. - TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata (5 commits via SQL + MERGE INTO to trigger default inline compaction). - TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+, asserting native VariantType round-trips through compaction. - TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline uses named_struct with hex byte literals; out-of-line creates real files via BlobTestHelpers.createTestFile and verifies bytes via read_blob().

coderabbitai · 2026-04-27T10:49:12Z

📝 Walkthrough

Walkthrough

Adds new Vector/Blob/Variant data types and Lance file format support, pluggable Parquet config injection, Flink continuous-sort buffer, enhanced rollback/archival/clean logic, extensive Spark/Flink/Hadoop test coverage, Docker Compose stacks for integration testing, and numerous utilities, refactors, and dependency updates across the codebase.

Changes

Cohort / File(s)	Summary
Docker Compose Environments `docker/compose/docker-compose_hadoop340_hive2310_spark402_*.yml`	Adds full AMD64/ARM64 compose stacks provisioning HDFS, YARN, Hive Metastore/Postgres, HiveServer2, Spark master/workers/adhoc, ZooKeeper, Kafka, MinIO, mc init container, volumes, healthchecks, and a `hudi` network.
Hadoop Entrypoints `docker/hoodie/hadoop/base*/entrypoint.sh`	Removes duplicate `yarn.nodemanager.bind-host` insertion in multihomed YARN setup across Java11/17/base variants.
Parquet Config Injection & Storage Config `hudi-common/src/main/java/org/apache/hudi/io/HoodieParquetConfigInjector.java`, `hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java`, writer factories (`hudi-hadoop-common`, `hudi-client/hudi-spark-client`, `hudi-client/hudi-flink-client`)	Adds pluggable Parquet writer config injector interface, apply utility, storage config option and builder helpers, integrates injection into Avro/Parquet writer factories; streaming writes disallowed for injector. Includes test injectors and unit/integration tests.
Vector / Variant / Blob Types & Schema Handling `hudi-common/src/main/java/.../HoodieSchema.java`, `internal/schema/...`, `VectorConversionUtils.java`, `HoodieSparkSchemaConverters.scala`, multiple parser/AST files across Spark 3.3–4.0/4.1 variants	Implements VECTOR and VARIANT custom logical types, canonical metadata keys/serialization, parse/validation, internal-schema round-trip support (including BLOB), and utilities to build/parse vector footer values. Parser/AST/DDL changes propagate VECTOR/VARIANT support across Spark versions.
Lance File Format Integration `hudi-client/.../HoodieSparkLanceReader.java`, `HoodieSparkLanceWriter.java`, `LanceRecordIterator.java`, `hudi-hadoop-mr/.../HoodieLanceInputFormat*.java`, `SparkLanceReaderBase.scala`	Adds Lance base-file support, vector metadata preservation and restoration, name-aligned Arrow→Spark vector handling, Lance writer/reader changes, and Hive InputFormat placeholders (unsupported for reading via Hive).
Batched Blob Reader & SQL Integration `hudi-spark-datasource/.../blob/*` (ReadBlobExpression, ReadBlobRule, BatchedBlobReader/Exec/Strategy, ScalarFunctions)	Implements batched blob I/O: marker expression, logical rewrite rule, planner strategy, physical exec node, broadcasted storage conf, scalar `read_blob()` function registration, and end-to-end batching/merging logic.
Spark Extension Hooks `hudi-spark/src/main/scala/.../SparkAdapter.scala`, per-Spark adapters (`Spark3_3Adapter`, `Spark3_4Adapter`, `Spark4_0Adapter`, `Spark3_5Adapter`, `Spark4DefaultSource`)	Adds adapter extension points to inject scalar functions and planner strategies; registers `read_blob` functions and `BatchedBlobReaderStrategy` across Spark adapters and Spark4 data-source registration for Variant support.
Rollback Scheduling & Execution `hudi-client/hudi-client-common/.../BaseHoodieTableServiceClient.java`, `HoodieWriteConfig` additions, rollback tests	Refactors rollback into schedule vs execute phases, adds avoid-duplicate-plan config for multi-writer mode, heartbeat acquisition path for multi-writer, and extensive concurrency/pending-plan tests.
Timeline Archival & Empty-Clean Handling `TimelineArchiverV1.java`, `HoodieArchivalConfig.java`, `HoodieCleanConfig.java`, `CleanActionExecutor.java`, `CleanPlanActionExecutor.java`, multiple clean tests	Adds archival blocking on latest-clean ECTR (config flag), explicit empty-clean metadata/plan construction, new clean interval config for synthetic empty cleans, ECTR-aware archival filtering, and comprehensive unit/functional tests.
Flink Continuous Sort Buffer `hudi-flink-datasource/hudi-flink/.../AppendWriteFunctionWithContinuousSort.java`, `AppendWriteFunctions.java`, `BufferType.java`, Flink tests	Adds CONTINUOUS_SORT buffer implementation (TreeMap ordering, normalized-key copying), drain/batch controls and memory tracking, option plumbing, and integration tests including object-reuse scenarios.
RocksDB Global Index: Dictionary Encoding `CodedRecordGlobalLocationSerializer.java`, `RecordGlobalLocationSerializer.java`, `RocksDBIndexBackend.java`, index tests	Introduces dictionary-encoded partition-path serialization for partitioned tables, protected helper methods for (de)serialization, backend flag for partitioned tables switching serializer, and tests for compatibility/size/unknown-id errors.
Utility Consolidation & Small API Adds `CollectionUtils.java` (map helpers), removed `MapUtils.java`, `FileSlice.getLogFileCnt()`, `HoodieNativeAvroHFileReader` helper, `AbstractTableFileSystemView` tweaks, Hive/Glue sync small updates	Moves map helpers into CollectionUtils, deletes MapUtils, adds log-file count accessor, centralizes HFile close helper, and updates clients to use consolidated utilities.
Hudi Hive Sync Job & Utilities `hudi-utilities/.../HudiHiveSyncJob.java`, tests	Adds standalone HudiHiveSyncJob CLI (JCommander) that builds Spark context and invokes HiveSyncTool; includes integration test scaffolding.
Numerous Tests & Test Infra many test files across hudi-common, hudi-hadoop-common, hudi-client, hudi-flink, hudi-spark-datasource (vector/blob/variant/lance/archival/clean/rollback), test utils updates	Massive additions and updates to unit, integration, and functional tests covering new features (Vector/Blob/Variant/Lance), injection behaviors, Flink continuous sort, RocksDB serializer, archival/clean/rollback scenarios, and object-reuse test harness plumbing.
Dependency/Build Updates `pom.xml`, various pom changes (Lance groupId/version, Spark 4.0.2, argLine JVM heap increases)	Bumps Lance coordinates (groupId from `com.lancedb` → `org.lance`, versions), upgrades Spark 4.0.x, increases Maven surefire argLine Xmx defaults, and adjusts enforcer allowlist for Scala-2.12 transitive Flink artifacts.
Miscellaneous Fixes & Refactors assorted files (Hive tests, HoodieAvroWriteSupport, HoodieBaseLanceWriter, Parquet/Avro writer factories, reader vector handling, runtime utils, Flink RuntimeContext helpers)	Small behavior changes: safer close semantics, additional extension points, writer close hooks for Lance metadata, vector handling fixes, new runtime helpers for Flink object-reuse, and related small refactorings.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client (SQL/API)
    participant Planner as Planner (ReadBlobRule / Extension)
    participant Driver as Driver (BatchedBlobReaderStrategy)
    participant Executor as Executor (BatchedBlobReadExec Task)
    participant Storage as Storage (HoodieStorage / FileReader)

    Client->>Planner: parse query with read_blob()
    Planner->>Planner: rewrite Project/Filter -> insert BatchedBlobRead nodes
    Planner->>Driver: plan with BatchedBlobReaderStrategy
    Driver->>Executor: produce BatchedBlobReadExec (broadcast storageConf)
    Executor->>Storage: request merged ranges (group by file, maxGap/ lookahead)
    Storage-->>Executor: return merged byte buffers
    Executor->>Executor: slice buffers into per-row payloads
    Executor-->>Client: return rows with blob bytes

(Note: rectangles/colors not required for sequence nodes per instructions; diagram shows high-level batched blob flow.)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I hopped through Parquet, Lance, and blob,
Built vectors, variants, and a Flink-sorted mob,
Docker buckets, MinIO cheer,
Rollbacks tidy — no duplicate fear,
Hudi hops forward — kit complete, full-throb!

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch oss-18597

coderabbitai

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark4_0ExtendedSqlParser.scala (1)

131-143: ⚠️ Potential issue | 🟠 Major

isHoodieCommand can miss VECTOR/BLOB in valid SQL forms.

Line 141 and Line 142 only match when preceded by a literal space. This misses cases like ARRAY<VECTOR(...)> or MAP<STRING,VECTOR(...)>, so Hoodie parsing may not be triggered.

💡 Suggested fix

   private def isHoodieCommand(sqlText: String): Boolean = {
     val normalized = sqlText.toLowerCase(Locale.ROOT).trim().replaceAll("\\s+", " ")
+    val hasCustomTypeKeyword = "\\b(blob|vector)\\b".r.findFirstIn(normalized).isDefined
     normalized.contains("system_time as of") ||
       normalized.contains("timestamp as of") ||
       normalized.contains("system_version as of") ||
       normalized.contains("version as of") ||
       normalized.contains("create index") ||
       normalized.contains("drop index") ||
       normalized.contains("show indexes") ||
       normalized.contains("refresh index") ||
-      normalized.contains(" blob") ||
-      normalized.contains(" vector")
+      hasCustomTypeKeyword
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark4_0ExtendedSqlParser.scala`
around lines 131 - 143, The isHoodieCommand method currently checks for " blob"
and " vector" which require a leading space and will miss forms like
ARRAY<VECTOR(...)> or MAP<STRING,VECTOR(...)>; update isHoodieCommand to detect
blob/vector more robustly (e.g., replace normalized.contains(" blob") and
normalized.contains(" vector") with a regex word-boundary check like
normalized.matches(".*\\b(blob|vector)\\b.*") or checks for "vector(" as well)
so VECTOR and BLOB are matched regardless of surrounding punctuation; update
only the isHoodieCommand function to use the new contains/match logic.

hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_4ExtendedSqlParser.scala (1)

122-133: ⚠️ Potential issue | 🟠 Major

Include VARIANT in the Hoodie-command detection.

This method now routes VECTOR DDL correctly, but VARIANT is still missing. CREATE TABLE ... col VARIANT will keep falling back to Spark's parser, so the new custom type support is still incomplete on Spark 3.4.

Suggested fix

     normalized.contains("show indexes") ||
     normalized.contains("refresh index") ||
     normalized.contains(" blob") ||
-    normalized.contains(" vector")
+    normalized.contains(" vector") ||
+    normalized.contains(" variant")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_4ExtendedSqlParser.scala`
around lines 122 - 133, The isHoodieCommand method currently checks for "
vector" and " blob" but omits VARIANT, so DDL with a column type VARIANT won't
be routed to Hoodie parser; update the normalized.contains checks in
isHoodieCommand to also include a check for " variant" (matching the same
lowercase/whitespace-normalized approach as the other type checks) so CREATE
TABLE ... col VARIANT is detected and handled by the Hoodie parser.

hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlParser.scala (1)

122-133: ⚠️ Potential issue | 🟠 Major

Route VARIANT DDL through the Hudi parser too.

isHoodieCommand now catches VECTOR, but it still misses VARIANT. That means CREATE TABLE ... col VARIANT will continue to go through Spark's parser instead of the Hudi path, so the new custom type support won't be reachable from SQL on Spark 3.3.

Suggested fix

     normalized.contains("show indexes") ||
     normalized.contains("refresh index") ||
     normalized.contains(" blob") ||
-    normalized.contains(" vector")
+    normalized.contains(" vector") ||
+    normalized.contains(" variant")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlParser.scala`
around lines 122 - 133, The isHoodieCommand method in
HoodieSpark3_3ExtendedSqlParser.scala currently checks for type tokens like "
blob" and " vector" but misses " variant", so add a normalized.contains("
variant") check to the boolean chain inside the isHoodieCommand function (where
normalized is computed) so CREATE TABLE ... col VARIANT will be routed through
the Hudi parser; keep the same normalization logic (toLowerCase + trim +
replaceAll("\\s+", " ")) and add the contains(" variant") entry alongside the
other contains checks.

🟠 Major comments (18)

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/index/CodedRecordGlobalLocationSerializer.java-33-35 (1)
33-35: ⚠️ Potential issue | 🟠 Major

Dictionary state is not persisted across serializer instances, causing restart failures.

Lines 33–35 maintain instance-local partition path dictionary mappings that are lost when the serializer instance is recreated. Since RocksDBIndexBackend (line 48) creates a fresh CodedRecordGlobalLocationSerializer() on each instantiation, reusing a persisted RocksDB path after task restart will fail: old entries store only dictionary IDs (integers), which cannot be decoded without the original dictionary. The getPartitionPath() method (lines 58–64) throws IllegalStateException when encountering unknown IDs from prior serializer instances.

The javadoc at lines 30–31 explicitly documents this limitation: "Serialized bytes can only be deserialized correctly by the same serializer instance that encoded them."

To support restart safety, either persist dictionary metadata (via a dedicated RocksDB column family with bootstrap loading), switch to non-coded serialization for persisted index backends, or document that RocksDB paths cannot be reused across instances.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/index/CodedRecordGlobalLocationSerializer.java`
around lines 33 - 35, CodedRecordGlobalLocationSerializer currently keeps
partitionPathToDictId and dictIdToPartitionPath as instance-local fields which
are lost across restarts; modify CodedRecordGlobalLocationSerializer to persist
the dictionary to RocksDB so deserialization after restart can reconstruct
mappings: add a RocksDB column family (e.g., "partition_dict") accessed from the
serializer constructor to bootstrap existing mappings into partitionPathToDictId
and dictIdToPartitionPath, update that column family whenever you assign a new
dict id (the code paths that allocate ids—where partitionPathToDictId is
mutated), and read from it in getPartitionPath/deserialize to resolve historic
ids; ensure writes are atomic with the index updates in RocksDBIndexBackend (or,
if you prefer, replace CodedRecordGlobalLocationSerializer with a non-coded
serializer for persisted backends) so that old integer-only serialized entries
remain decodable after task restart.
hudi-flink-datasource/hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/vector/ParquetDecimalVector.java-75-225 (1)
75-225: ⚠️ Potential issue | 🟠 Major

Fix inconsistent delegation in wrapped-vector type checks.

ParquetDecimalVector implements typed interfaces (WritableLongVector, WritableIntVector, WritableBytesVector) while accepting any ColumnVector in the constructor. This creates two problems:

Asymmetric failure modes: getDecimal() works with readable types (IntColumnVector, LongColumnVector, BytesColumnVector), but getInt(), getLong(), and getBytes() require the writable variants and throw RuntimeException if the child lacks them. Other mutators (setInt(), setLong(), fill()) silently no-op instead.

Type contract violation: A read-only IntColumnVector wrapper passes construction but fails later when calling typed getters, or silently drops writes via mutators.

Constrain the constructor to require a writable child type and validate up front, or delegate getters uniformly through readable interfaces and make all unsupported mutators throw consistently rather than silently failing.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/vector/ParquetDecimalVector.java`
around lines 75 - 225, ParquetDecimalVector currently accepts any ColumnVector
but mixes readable and writable interfaces inconsistently; update
ParquetDecimalVector to validate and enforce a writable child at construction
(throw IllegalArgumentException if the supplied vector is not a
WritableColumnVector and the concrete writable types required), and then change
all getters/mutators to consistently cast to the appropriate Writable*
interfaces (e.g. getInt/getLong/getBytes -> use
WritableIntVector/WritableLongVector/WritableBytesVector; setInt/setLong/fill
and other mutators -> call the writable methods directly) rather than silently
no-op or throw different exceptions later; ensure the constructor and
class-level validation covers the specific writable subtypes used by methods
like getDecimal, getInt, getLong, getBytes, setInt, setLong, fill so behavior is
deterministic and failures occur at construction time.
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithContinuousSort.java-179-213 (1)
179-213: ⚠️ Potential issue | 🟠 Major

The memory guard does not account for the row being inserted.

processElement decides whether to drain based on the current sizeTracer.bufferSize, then inserts the next row and only afterwards updates the tracer. Because that tracer is also pinned to the first row's size, a larger later record can push the buffer well past maxBufferSize and keep it there until another record or a checkpoint arrives. That defeats the intended memory bound for variable-width payloads.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithContinuousSort.java`
around lines 179 - 213, The memory check must account for the incoming row's
size before accepting it: in processElement compute the currentRecordSize (use
ObjectSizeCalculator.getObjectSize(data) after doing the object-reuse copy) and
use sizeTracer.bufferSize + currentRecordSize > sizeTracer.maxBufferSize (or >=)
as the condition to call drainRecords(drainSize); after draining re-check using
the same currentRecordSize to decide whether to throw; set estimatedRecordSize
if it was zero (e.g., estimatedRecordSize = currentRecordSize) and call
sizeTracer.trace(currentRecordSize) when inserting instead of tracing the stale
estimatedRecordSize; refer to processElement, sizeTracer.bufferSize,
sizeTracer.maxBufferSize, ObjectSizeCalculator.getObjectSize,
estimatedRecordSize, sortedRecords.put and sizeTracer.trace to locate edits.
hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java-79-101 (1)
79-101: ⚠️ Potential issue | 🟠 Major

Avoid logging raw config/properties; this can leak credentials.

cfg and props may include sensitive values (for example hive.pass), and these are currently logged verbatim.
Proposed fix
-    LOG.info("Cfg received: {}", cfg);
+    LOG.info("Cfg received: basePath={}, baseFileFormat={}, propsFilePathSet={}, hoodieConfCount={}",
+        cfg.basePath, cfg.baseFileFormat, cfg.propsFilePath != null, cfg.configs.size());
@@
-      LOG.info("HiveSyncConfig props used to sync data {}", props);
+      LOG.info("HiveSyncConfig prepared for sync. propertyKeys={}", props.keySet());
Also applies to: 136-142
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`
around lines 79 - 101, Remove or sanitize sensitive fields before logging the
configuration and properties in HudiHiveSyncJob; specifically stop logging cfg
and props verbatim in the static main block (the LOG.info("Cfg received: {}",
cfg) call) and in the run() method where LOG.info("HiveSyncConfig props used to
sync data {}", props) is invoked. Implement and call a helper like
sanitizeConfig or maskSensitive(Map/Config) that redacts known sensitive keys
(e.g., hive.pass, password, secret, key) and only logs non-sensitive keys or a
redacted summary, and replace the direct uses of cfg and props in those LOG.info
calls with the sanitized output. Ensure the helper is referenced from
HudiHiveSyncJob.run() and the main startup logging so both locations (including
the later mentions around lines 136-142) no longer expose raw credentials.
hudi-common/src/main/java/org/apache/hudi/common/util/CollectionUtils.java-77-79 (1)
77-79: ⚠️ Potential issue | 🟠 Major

containsAll(Map, Map) should be null-safe to avoid runtime NPEs.

Current implementation throws when either map is null. This can break callers that compare metastore/serde parameter maps that may be absent.
Proposed fix
   public static boolean containsAll(Map<?, ?> m1, Map<?, ?> m2) {
-    return m1.entrySet().containsAll(m2.entrySet());
+    if (isNullOrEmpty(m2)) {
+      return true;
+    }
+    if (isNullOrEmpty(m1)) {
+      return false;
+    }
+    return m1.entrySet().containsAll(m2.entrySet());
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-common/src/main/java/org/apache/hudi/common/util/CollectionUtils.java`
around lines 77 - 79, The containsAll(Map<?,?> m1, Map<?,?> m2) method in
CollectionUtils is not null-safe and can throw NPEs; update it to treat null
maps as empty (i.e., return true if both are null or if m2 is null/empty, return
false if m1 is null but m2 is non-empty) so callers comparing optional
metastore/serde parameter maps don't crash; implement these null checks at the
start of containsAll(Map, Map) (in class CollectionUtils) before calling
entrySet().containsAll().
hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java-81-85 (1)
81-85: ⚠️ Potential issue | 🟠 Major

Use a buildSparkContext variant that doesn't hardcode a default master.

When --spark-master is omitted, the code forces local[2] execution via the defaultMaster parameter. In cluster environments, this prevents inheriting the cluster's configured master (e.g., from spark-defaults.conf).

Use UtilHelpers.buildSparkContext(String appName, boolean enableHiveSupport, Map<String, String> configs) instead, which respects the Spark configuration without forcing a fallback master. This allows proper delegation to cluster settings when --spark-master is not specified.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`
around lines 81 - 85, The code in HudiHiveSyncJob currently forces "local[2]"
when cfg.sparkMaster is empty by calling
UtilHelpers.buildSparkContext("HudiHiveSyncJob", "local[2]", true); change this
to call the overload that accepts appName, enableHiveSupport and a configs map
so it does not hardcode a default master: use
UtilHelpers.buildSparkContext("HudiHiveSyncJob", true, configs) (construct
configs from existing Spark/Hudi config values or an empty map) when
cfg.sparkMaster is null/empty, and keep using the master-aware overload when
cfg.sparkMaster is provided; ensure hive support remains enabled.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java-1352-1379 (1)
1352-1379: ⚠️ Potential issue | 🟠 Major

Re-read the target instant after taking the rollback lock.

commitInstantOpt is captured before beginStateChange(...) and is only refreshed when shouldAvoidDuplicateRollbackPlan() is enabled. With that flag off, this method can still schedule a rollback from a stale snapshot if another writer changes the timeline in between.
Suggested fix
     if (!skipLocking) {
       txnManager.beginStateChange(Option.empty(), Option.empty());
     }
     try {
+      table.getMetaClient().reloadActiveTimeline();
+      commitInstantOpt = Option.fromJavaOptional(table.getActiveTimeline().getCommitsTimeline().getInstantsAsStream()
+          .filter(instant -> EQUALS.test(instant.requestedTime(), commitInstantTime))
+          .findFirst());
+
       if (config.shouldAvoidDuplicateRollbackPlan()) {
         // Check if another writer already scheduled a rollback for this instant to avoid duplicates.
-        table.getMetaClient().reloadActiveTimeline();
         Option<HoodiePendingRollbackInfo> pendingRollbackOpt = getPendingRollbackInfo(table.getMetaClient(), commitInstantTime);
         if (pendingRollbackOpt.isPresent()) {
           // Case 2a: a concurrent writer already scheduled the rollback — re-use it.
           return Option.of(Pair.of(pendingRollbackOpt.get().getRollbackInstant(),
               Option.of(pendingRollbackOpt.get().getRollbackPlan())));
         }
-        commitInstantOpt = Option.fromJavaOptional(table.getActiveTimeline().getCommitsTimeline().getInstantsAsStream()
-            .filter(instant -> EQUALS.test(instant.requestedTime(), commitInstantTime))
-            .findFirst());
       }
       if (commitInstantOpt.isEmpty()) {
         log.error("Cannot find instant {} in the timeline of table {} for rollback", commitInstantTime, config.getBasePath());
         return Option.empty();
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java`
around lines 1352 - 1379, commitInstantOpt is captured before acquiring the
rollback lock (txnManager.beginStateChange) and when
shouldAvoidDuplicateRollbackPlan() is false we never refresh it, so we may
operate on a stale timeline; after calling beginStateChange(...) (i.e., once the
rollback lock is held) reload the timeline via
table.getMetaClient().reloadActiveTimeline() and recompute commitInstantOpt
using the same logic that uses
table.getActiveTimeline().getCommitsTimeline().getInstantsAsStream() (the code
that filters by EQUALS.test(instant.requestedTime(), commitInstantTime)), before
proceeding to log, scheduleRollback (table.scheduleRollback) or create the new
rollback instant; ensure this refresh happens regardless of
shouldAvoidDuplicateRollbackPlan() so the code always uses the latest timeline
snapshot under lock.
docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml-230-243 (1)
230-243: ⚠️ Potential issue | 🟠 Major

Update MinIO to use current credential variables.

MINIO_ACCESS_KEY and MINIO_SECRET_KEY have been deprecated since April 2021 (MinIO RELEASE.2021-04-22T15-44-28Z). Using minio/minio:latest with these deprecated variables creates a maintenance risk—future MinIO releases may remove support entirely. Switch to the current standard:
Required changes
     environment:
-      - MINIO_ACCESS_KEY=minio
-      - MINIO_SECRET_KEY=minio123
+      - MINIO_ROOT_USER=minio
+      - MINIO_ROOT_PASSWORD=minio123
       - MINIO_DOMAIN=minio
Additionally, consider pinning the image version instead of using latest to avoid unexpected behavioral changes in future MinIO releases.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml` around
lines 230 - 243, The minio service is using deprecated environment vars
MINIO_ACCESS_KEY and MINIO_SECRET_KEY and an unpinned image; update the minio
service to use the current variables MINIO_ROOT_USER and MINIO_ROOT_PASSWORD
instead of MINIO_ACCESS_KEY / MINIO_SECRET_KEY, replace or add those values
under the environment block for the "minio" service, and pin the image to a
stable MinIO release tag (avoid image: 'minio/minio:latest') to prevent
unexpected upgrades; keep the existing
hostname/container_name/ports/volumes/command intact.
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml-19-19 (1)
19-19: ⚠️ Potential issue | 🟠 Major

Pin all service images to specific version tags instead of using :latest.

This compose file contains twelve services using floating :latest tags or untagged images, making the test environment non-deterministic and prone to unexpected failures when base images are updated. Lines affected: 19, 37, 60, 93, 116, 154, 175, 193, 213, 231, 246. For comparison, zookeeper (line 135), kafka (line 144), and postgresql (line 86) already use pinned versions—apply the same practice to the remaining services.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml` at line
19, Replace all Docker image references that use floating tags or omit tags
(e.g., the apachehudi/hudi-hadoop_3.4.0-namenode:latest entry and the other
image lines mentioned) with explicit, immutable tags or digests to make the
compose deterministic; update the image keys for each service to a specific
version (or `@sha256` digest) consistent with your CI/test matrix, mirroring how
zookeeper, kafka, and postgresql are already pinned, and ensure the service
names (e.g., the namenode image reference, datanode, worker, spark, hive
metastore, etc.) are updated accordingly so no service relies on :latest.
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml-230-243 (1)
230-243: ⚠️ Potential issue | 🟠 Major

Migrate MinIO from deprecated credentials to current standard.

MinIO deprecated MINIO_ACCESS_KEY and MINIO_SECRET_KEY in April 2021. Current deployments should use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD instead. Combined with minio/minio:latest, this creates a maintenance risk: future MinIO releases may drop backward compatibility for the deprecated variables.

This pattern appears across multiple docker-compose files in the repository (docker-compose_hadoop340_hive2310_spark402_amd64.yml, docker-compose_hadoop340_hive2310_spark402_arm64.yml, docker-compose_hadoop340_hive313_spark401_amd64.yml, docker-compose_hadoop340_hive313_spark401_arm64.yml, docker-compose_hadoop334_hive313_spark353_amd64.yml, and docker-compose_hadoop334_hive313_spark353_arm64.yml).
Proposed fix
   minio:
-    image: 'minio/minio:latest'
+    image: 'minio/minio:<pinned-release>'
     hostname: minio
     container_name: minio
@@
     environment:
-      - MINIO_ACCESS_KEY=minio
-      - MINIO_SECRET_KEY=minio123
+      - MINIO_ROOT_USER=minio
+      - MINIO_ROOT_PASSWORD=minio123
       - MINIO_DOMAIN=minio
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml` around
lines 230 - 243, The MinIO service block named "minio" uses deprecated
environment variables MINIO_ACCESS_KEY and MINIO_SECRET_KEY; update this to use
MINIO_ROOT_USER and MINIO_ROOT_PASSWORD (preserving existing credential values)
in the minio service definition and the other listed docker-compose files so the
container runs with the modern variables; ensure the command and ports remain
unchanged and verify the variable names are consistently replaced in all
occurrences across the repository (e.g., entries in
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml and the
other listed compose files).
docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml-19-19 (1)
19-19: ⚠️ Potential issue | 🟠 Major

Address missing ARM64 support in legacy service images.

This ARM64-specific compose file itself is the intended solution for apachehudi images (which lack unified multi-arch manifests per HUDI-3601), but three supporting service images are blocking ARM64 execution:

bde2020/hive-metastore-postgresql:2.3.0 — AMD64-only (postgres:9.5.3 base from 2016, no ARM64 variant)

bitnamilegacy/kafka:3.4.1 — AMD64-only (ARM64 support starts at 3.3.2+)

bitnamilegacy/zookeeper:3.6.4 — AMD64-only (ARM64 support starts at 3.9.3+)

Either substitute these with ARM64-supporting versions (e.g., bitnamilegacy/kafka:4.0.0-debian-12-r10, bitnamilegacy/zookeeper:3.9.3-debian-12-r22) or document that this stack only runs on AMD64 systems despite the filename.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml` at line
19, The compose file references legacy AMD64-only service images (symbols:
bde2020/hive-metastore-postgresql:2.3.0, bitnamilegacy/kafka:3.4.1,
bitnamilegacy/zookeeper:3.6.4) which block ARM64 execution; update the
docker-compose entries to ARM64-compatible images (e.g., replace kafka and
zookeeper with bitnamilegacy/kafka:4.0.0-debian-12-r10 and
bitnamilegacy/zookeeper:3.9.3-debian-12-r22, and swap hive-metastore-postgresql
to a maintained ARM64 PostgreSQL-backed metastore image) or add a clear note in
the compose file header stating the stack is AMD64-only if replacements aren’t
viable. Ensure the image keys in the compose (the same lines that currently set
those three image values) are changed to the ARM64 tags or the header comment is
added.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala-693-700 (1)
693-700: ⚠️ Potential issue | 🟠 Major

Pin the inline compaction threshold explicitly.

Both tests hard-code “still log-only” through the 4th write and “compacts on the 5th,” but neither table definition sets hoodie.compact.inline.max.delta.commits. That makes these assertions depend on a global default instead of the test setup.
Suggested fix
           |  hoodie.index.type = 'INMEMORY',
           |  hoodie.compact.inline = 'true',
+          |  hoodie.compact.inline.max.delta.commits = '4',
           |  hoodie.clean.commits.retained = '1'
Apply the same change to the Lance variant.
Also applies to: 817-825
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala`
around lines 693 - 700, The table definitions in TestVectorDataSource that set
tblproperties (the block including primaryKey, type, preCombineField,
hoodie.index.type, hoodie.compact.inline, hoodie.clean.commits.retained) do not
explicitly set the inline compaction threshold; add
hoodie.compact.inline.max.delta.commits = '4' to that properties map so the
tests that expect "still log-only" for the 4th write and compaction on the 5th
are deterministic, and apply the identical change to the Lance variant
table-definition block as well.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/blob/TestBatchedBlobReader.scala-254-266 (1)
254-266: ⚠️ Potential issue | 🟠 Major

Make the preserved-column assertions deterministic.

This test indexes directly into resultDF.collect(), but there is no ordering step here. If readBatched changes partitioning or row order, the assertions on results(0)/results(1) will become flaky even when the implementation is correct.
Suggested fix
-    val results = resultDF.collect()
+    val results = resultDF.orderBy("record_id").collect()
     assertEquals(2, results.length)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/blob/TestBatchedBlobReader.scala`
around lines 254 - 266, The assertions index into resultDF.collect() without a
defined ordering, which makes the test flaky; before collecting, apply a
deterministic sort (e.g., orderBy the preserved key/record_id or other unique
column) on resultDF returned by readBatched in TestBatchedBlobReader so the rows
are in a stable order, then collect and assert values for the expected rows (use
record_id or sequence in orderBy to locate rows deterministically).
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/ReadBlobRule.scala-79-82 (1)
79-82: ⚠️ Potential issue | 🟠 Major

Replace generic exceptions with AnalysisException for user-facing query validation errors.

These branches (79–82, 102–103, 165–167) reject user SQL and should report analysis errors consistent with Spark's optimizer/analyzer conventions, not generic runtime exceptions. AnalysisException is already imported in this file and is the standard mechanism for query-level validation failures during planning.

Also applies to: 93, 169 (consider whether internal consistency errors should also use AnalysisException).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/ReadBlobRule.scala`
around lines 79 - 82, Replace user-facing IllegalArgumentException throws with
Spark's AnalysisException in ReadBlobRule: for the case matching
containsReadBlobInAnyExpression(node) (and the other branches flagged at lines
~93, 102–103, 165–167, 169), change throw new IllegalArgumentException(...) to
throw new AnalysisException(...) preserving the existing error messages; ensure
you only convert branches that are validating user SQL (keep internal
consistency checks as-is or review individually) and that the file's imported
AnalysisException is used.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala-397-401 (1)
397-401: ⚠️ Potential issue | 🟠 Major

The Lance BLOB test is enabled in Spark 3.5+ profiles despite being documented as incomplete.

The test at lines 397–401 contains a comment stating "Lance writer has no BLOB handling today (RFC-100 Phase 2). Expected to fail until support lands in HoodieSparkLanceWriter." However, this test is only skipped when lance.skip.tests=true, which applies to Spark 3.3/3.4 profiles. For Spark 3.5 and 4.0 profiles (where lance.skip.tests=false), and for any default build run, the test will execute and fail, creating red builds instead of a properly tracked pending test. Either fix the BLOB handling in Lance or add an explicit skip/pending marker for this test across all profiles that enable it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 397 - 401, The test "Test Query Log Only MOR Table With BLOB
OUT_OF_LINE column triggers compaction (Lance)" in TestBlobDataType.scala is
currently only skipped when the system property lance.skip.tests=true, which
leaves it enabled in 3.5/4.0 profiles; mark it as pending/ignored across all
profiles by either replacing test(...) with ignore(...) or adding an
unconditional assume(false, "Lance writer has no BLOB handling (RFC-100 Phase 2)
— test pending until implemented") at the start of that test; update the single
test definition so it no longer runs by default until HoodieSparkLanceWriter
gains BLOB support.
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala-318-329 (1)
318-329: ⚠️ Potential issue | 🟠 Major

Allow overlapping or duplicate ranges instead of throwing.

Two rows can legitimately reference the same bytes, or partially overlapping slices, in the same external file. The current overlap check turns that into an exception as soon as both rows land in the same lookahead batch.
Proposed fix
-        val gap = row.offset - currentEndOffset
-        // Check for overlap
-        if (row.offset < currentEndOffset) {
-          throw new IllegalArgumentException(
-            s"Overlapping blob ranges detected: previous range [${currentStartOffset}, ${currentEndOffset}) and current row [${row.offset}, ${row.offset + row.length}) in file ${row.filePath}"
-          )
-        }
-        if (gap >= 0 && gap <= maxGap) {
+        val gap = row.offset - currentEndOffset
+        if (gap <= maxGap) {
           // Merge into current range
           currentEndOffset = math.max(currentEndOffset, row.offset + row.length)
           currentRows += row
         } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`
around lines 318 - 329, The overlap check in BatchedBlobReader.scala (within the
batching logic that uses currentStartOffset, currentEndOffset, currentRows and
row) incorrectly throws on overlapping or duplicate ranges; change it to accept
overlaps by removing the IllegalArgumentException and treating
overlapping/duplicate rows like adjacent ones: update currentEndOffset =
math.max(currentEndOffset, row.offset + row.length) and add row to currentRows
so the batch includes both references, ensuring downstream read logic can handle
duplicate/overlapping slices; keep the existing gap/maxGap handling for
non-overlapping gaps.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala-237-243 (1)
237-243: ⚠️ Potential issue | 🟠 Major

Quarantine this known-failing Lance VARIANT test until support is implemented.

The test acknowledges that Lance writer has no VARIANT handling today and will fail until RFC-100 Phase 2 lands. With lance.skip.tests=false by default in the Spark 4.0 build configuration, this test will execute and fail on every default run instead of being skipped. Other Lance tests in the codebase use @DisabledIfSystemProperty(named = "lance.skip.tests", matches = "true") for proper test isolation; this test should either use the same annotation or be marked ignore() until VARIANT support is available.
Suggested quarantine
-  test("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") {
+  ignore("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`
around lines 237 - 243, The test "Test Query Log Only MOR Table With VARIANT
column triggers compaction (Lance)" is a known-failing Lance VARIANT test and
should be quarantined; update the test declaration in TestVariantDataType.scala
to either add the JUnit annotation `@DisabledIfSystemProperty`(named =
"lance.skip.tests", matches = "true") above the test or convert the test(...)
call to ignore(...) (or equivalent test-framework skip) so the test is skipped
when lance.skip.tests is not set to true, referencing the test string or the
enclosing test method to locate the change.
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/InternalSchemaConverter.java-71-79 (1)
71-79: ⚠️ Potential issue | 🟠 Major

Avoid positive field IDs inside the BLOB sentinel record.

Using 0..3 for external_reference child fields can collide with regular schema field IDs (which are also allocated from 0), creating ambiguity in ID-based schema operations. Use dedicated negative sentinel IDs for these nested fields too.
🔧 Suggested fix
+  static final int BLOB_REFERENCE_PATH_FIELD_ID = -13;
+  static final int BLOB_REFERENCE_OFFSET_FIELD_ID = -14;
+  static final int BLOB_REFERENCE_LENGTH_FIELD_ID = -15;
+  static final int BLOB_REFERENCE_MANAGED_FIELD_ID = -16;
+
   private static Types.RecordType buildBlobInternalRecordType() {
     List<Types.Field> referenceFields = new ArrayList<>(4);
-    referenceFields.add(Types.Field.get(0, false,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_PATH_FIELD_ID, false,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH, Types.StringType.get()));
-    referenceFields.add(Types.Field.get(1, true,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_OFFSET_FIELD_ID, true,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_OFFSET, Types.LongType.get()));
-    referenceFields.add(Types.Field.get(2, true,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_LENGTH_FIELD_ID, true,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_LENGTH, Types.LongType.get()));
-    referenceFields.add(Types.Field.get(3, false,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_MANAGED_FIELD_ID, false,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED, Types.BooleanType.get()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/InternalSchemaConverter.java`
around lines 71 - 79, The BLOB sentinel's nested fields in
InternalSchemaConverter use positive IDs (referenceFields list), which can clash
with real schema field IDs; change the field ID integers for the four
referenceFields (the entries created with Types.Field.get(...) for
HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH, EXTERNAL_REFERENCE_OFFSET,
EXTERNAL_REFERENCE_LENGTH, EXTERNAL_REFERENCE_IS_MANAGED) to dedicated negative
sentinel IDs (e.g., -1, -2, -3, -4 or another reserved negative sequence) so
these nested IDs cannot collide with normal allocated IDs; update any
comments/constants if present to document the negative-id sentinel choice.

🟡 Minor comments (10)

hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java-666-671 (1)

666-671: ⚠️ Potential issue | 🟡 Minor

getWritePipelineWithObjectReuse bypasses pipeline-mode routing.

This helper hardcodes InsertFunctionWrapper, unlike getWritePipeline(...) which respects bulk-insert/bucket/stream modes. That can create the wrong test pipeline for non-append configs and reduce test validity.

Proposed adjustment

 public static TestFunctionWrapper<RowData> getWritePipelineWithObjectReuse(
     String basePath, Configuration conf) throws Exception {
-  org.apache.flink.api.common.ExecutionConfig execConfig = new org.apache.flink.api.common.ExecutionConfig();
-  execConfig.enableObjectReuse();
-  return new InsertFunctionWrapper<>(basePath, conf, execConfig);
+  org.apache.flink.api.common.ExecutionConfig execConfig = new org.apache.flink.api.common.ExecutionConfig();
+  execConfig.enableObjectReuse();
+  if (OptionsResolver.isAppendMode(conf)) {
+    return new InsertFunctionWrapper<>(basePath, conf, execConfig);
+  }
+  return getWritePipeline(basePath, conf);
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java`
around lines 666 - 671, The helper getWritePipelineWithObjectReuse currently
hardcodes returning new InsertFunctionWrapper<>(basePath, conf, execConfig)
which bypasses the routing logic used by getWritePipeline and can produce
incorrect pipelines for non-append configs; change
getWritePipelineWithObjectReuse to enable object reuse on a local
ExecutionConfig and then delegate to the existing getWritePipeline routing logic
(call getWritePipeline(basePath, conf, execConfig) or the overload that accepts
an ExecutionConfig) instead of directly instantiating InsertFunctionWrapper so
the same bulk-insert / bucket / stream selection is preserved.

hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java-83-91 (1)

83-91: ⚠️ Potential issue | 🟡 Minor

Propagate the supplied ExecutionConfig through the entire wrapper.

This constructor only updates the mocked environment/runtime context. The StreamTask created later still uses its own hard-coded ExecutionConfig, so async-clustering tests can end up running with different object-reuse semantics than the caller requested.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java`
around lines 83 - 91, The InsertFunctionWrapper constructor sets up a
MockEnvironment and MockStreamingRuntimeContext with the supplied
ExecutionConfig but doesn't propagate that ExecutionConfig to the StreamTask (or
any later-created task) which still uses a hard-coded/default ExecutionConfig;
update the wrapper so the same ExecutionConfig instance passed into
InsertFunctionWrapper(...) is used when constructing the
StreamTask/MockStreamTask and any related task builders/creators (e.g., where
StreamTask is instantiated or ExecutionConfig is fetched), by passing the
executionConfig variable through those constructors or setters so object-reuse
semantics match the caller's ExecutionConfig.

hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java-877-883 (1)

877-883: ⚠️ Potential issue | 🟡 Minor

Prefer try-with-resources for the new write clients.

These clients are closed after assertions, and writeClient1.close() can also NPE if construction fails, which can mask the real test failure.

Proposed fix

-      SparkRDDWriteClient writeClient1 = null;
-      try {
-        writeClient1 = getHoodieWriteClient(config);
-        writeClient1.savepoint(fourthCommitTs, "user", "comment");
-      } finally {
-        writeClient1.close();
-      }
+      try (SparkRDDWriteClient<?> writeClient1 = getHoodieWriteClient(config)) {
+        writeClient1.savepoint(fourthCommitTs, "user", "comment");
+      }

Apply the same pattern to the other new writeClient usages in these tests.

Also applies to: 893-899, 905-926, 977-986, 1011-1023

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java`
around lines 877 - 883, Wrap creation of SparkRDDWriteClient instances returned
by getHoodieWriteClient in try-with-resources so they are always closed even if
construction or subsequent assertions fail; specifically replace manual
new/close usage around variables like writeClient (used with runCleaner and
metaClient checks) with try (SparkRDDWriteClient<?> writeClient =
getHoodieWriteClient(config)) { ... } blocks, and apply the same pattern to the
other writeClient occurrences referenced in this test (lines around 893-899,
905-926, 977-986, 1011-1023) so no close() is called on a
null/partially-constructed client and resources are reliably released.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java-260-263 (1)

260-263: ⚠️ Potential issue | 🟡 Minor

Use >= for the empty-clean interval boundary.

At exactly the configured interval, this still suppresses the empty clean, so the planner does one more full-scan cycle than necessary. This looks like an off-by-one in the threshold check.

Proposed fix

-          eligibleForEmptyCleanCommit = currentCleanTimeMs - lastCleanTimeMs > (TimeUnit.HOURS.toMillis(config.getIntervalToCreateEmptyCleanHours()));
+          eligibleForEmptyCleanCommit = currentCleanTimeMs - lastCleanTimeMs >= TimeUnit.HOURS.toMillis(config.getIntervalToCreateEmptyCleanHours());

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java`
around lines 260 - 263, The empty-clean eligibility check currently computes
currentCleanTimeMs and lastCleanTimeMs and sets eligibleForEmptyCleanCommit
using a strict greater-than comparison; change the condition in the assignment
to use a greater-than-or-equal (>=) comparison so that when the elapsed time
equals TimeUnit.HOURS.toMillis(config.getIntervalToCreateEmptyCleanHours()) the
empty clean is allowed; update the expression that assigns
eligibleForEmptyCleanCommit (which references latestDateTime,
currentCleanTimeMs, lastCleanTimeMs,
HoodieInstantTimeGenerator.parseDateFromInstantTime(...), and
config.getIntervalToCreateEmptyCleanHours()) to use >= instead of >.

docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml-18-34 (1)

18-34: ⚠️ Potential issue | 🟡 Minor

Either mount the declared namenode volume or remove it.

namenode is declared as a top-level volume, but the namenode service never mounts it. Right now the file advertises persistence while HDFS metadata still lives in the container filesystem and disappears with the container.

Also applies to: 259-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml` around
lines 18 - 34, The docker-compose advertises a top-level volume named "namenode"
but the namenode service does not mount it, so either mount the volume into the
namenode service (e.g., add a volumes: entry under the namenode service like " -
namenode:/hadoop/dfs/name" or another appropriate HDFS namenode metadata
directory) or remove the top-level "namenode" volume declaration if persistence
is not desired; update the same pattern for the other service mentioned (the
other service declared at lines 259-263) so declared volumes are actually
mounted or removed.

docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml-18-34 (1)

18-34: ⚠️ Potential issue | 🟡 Minor

Either mount the declared namenode volume or remove it.

namenode is declared as a top-level volume, but the namenode service never mounts it. Right now the file advertises persistence while HDFS metadata still lives in the container filesystem and disappears with the container.

Also applies to: 259-263
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml` around
lines 18 - 34, The docker-compose declares a top-level volume named "namenode"
but the namenode service does not mount it, so either mount it into the service
or remove the unused volume declaration; fix by adding a volumes section to the
namenode service (e.g., under service "namenode" add volumes: -
namenode:/hadoop/dfs/name or the container's NameNode metadata path) and ensure
the top-level volumes: block still defines "namenode", or alternatively delete
the top-level "namenode" volume if persistence is not desired; apply the same
change to any other services that declare volumes but do not mount them.

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java-56-77 (1)

56-77: ⚠️ Potential issue | 🟡 Minor

Use the injected config as the single source of truth.

populateMetaFields is captured before applyConfigInjector, while the rest of the parquet writer settings come from hoodieConfig. If an injector changes any setting that affects meta-field or bloom-filter behavior, this path initializes the writer from a mixed config snapshot.

Suggested fix

-    boolean populateMetaFields = config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
-
     Pair<StorageConfiguration, HoodieConfig> injectedConfigs = HoodieParquetConfigInjector.applyConfigInjector(path, storage.getConf(), config);
     StorageConfiguration storageConfiguration = injectedConfigs.getLeft();
     HoodieConfig hoodieConfig = injectedConfigs.getRight();
+    boolean populateMetaFields = hoodieConfig.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java`
around lines 56 - 77, The code reads populateMetaFields from config before
calling HoodieParquetConfigInjector.applyConfigInjector, producing a mixed
snapshot; move the populateMetaFields assignment to after the
applyConfigInjector call and derive it from the injected HoodieConfig (i.e., use
hoodieConfig.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS)), then
pass that value into enableBloomFilter(populateMetaFields, hoodieConfig) and
into the HoodieRowParquetConfig construction so all parquet/writer settings
consistently come from the injected hoodieConfig; update references to the
populateMetaFields local variable and ensure
HoodieParquetConfigInjector.applyConfigInjector, enableBloomFilter, and
HoodieRowParquetConfig use the injected hoodieConfig value.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReaderStrategy.scala-40-48 (1)

40-48: ⚠️ Potential issue | 🟡 Minor

Validate batching configs before calling toInt.

A bad session conf currently bubbles up as a raw NumberFormatException during planning, and non-positive values are not rejected here. Please fail fast with a clear config error before building the exec node.

Suggested fix

+import scala.util.Try
+
 case class BatchedBlobReaderStrategy(sparkSession: SparkSession) extends SparkStrategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
     case read @ BatchedBlobRead(child, _, _) =>
       // TODO find proper way to access these configs
-      val maxGapBytes = HoodieSparkConfUtils.getConfigValue(
-        Map.empty, sparkSession.sessionState.conf,
-        BatchedBlobReader.MAX_GAP_BYTES_CONF,
-        String.valueOf(BatchedBlobReader.DEFAULT_MAX_GAP_BYTES)).toInt
+      val maxGapBytes = parsePositiveInt(
+        BatchedBlobReader.MAX_GAP_BYTES_CONF,
+        BatchedBlobReader.DEFAULT_MAX_GAP_BYTES)
 
-      val lookaheadSize = HoodieSparkConfUtils.getConfigValue(
-        Map.empty, sparkSession.sessionState.conf,
-        BatchedBlobReader.LOOKAHEAD_SIZE_CONF,
-        String.valueOf(BatchedBlobReader.DEFAULT_LOOKAHEAD_SIZE)).toInt
+      val lookaheadSize = parsePositiveInt(
+        BatchedBlobReader.LOOKAHEAD_SIZE_CONF,
+        BatchedBlobReader.DEFAULT_LOOKAHEAD_SIZE)
 
       val storageConf = new HadoopStorageConfiguration(sparkSession.sparkContext.hadoopConfiguration)
       BatchedBlobReadExec(
@@
     case _ => Nil
   }
+
+  private def parsePositiveInt(key: String, defaultValue: Int): Int = {
+    val raw = HoodieSparkConfUtils.getConfigValue(
+      Map.empty,
+      sparkSession.sessionState.conf,
+      key,
+      defaultValue.toString)
+
+    Try(raw.toInt).filter(_ > 0).getOrElse {
+      throw new IllegalArgumentException(s"$key must be a positive integer, got: $raw")
+    }
+  }
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReaderStrategy.scala`
around lines 40 - 48, The two config reads for maxGapBytes and lookaheadSize in
BatchedBlobReaderStrategy call .toInt directly and can throw
NumberFormatException or accept non-positive values; change the logic to
validate parsing and value ranges before building the exec node: for each config
key (BatchedBlobReader.MAX_GAP_BYTES_CONF and
BatchedBlobReader.LOOKAHEAD_SIZE_CONF) retrieve the string via
HoodieSparkConfUtils.getConfigValue, attempt a safe parse (e.g., parseInt with
explicit error handling or toIntOption), if parsing fails throw an
IllegalArgumentException with a clear message including the config name and bad
value, and if parsed value <= 0 also throw a clear IllegalArgumentException;
then assign the validated integers to maxGapBytes and lookaheadSize (using
DEFAULT_* when appropriate or only after validation).

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestFileGroupReaderPartitionColumn.scala-130-135 (1)

130-135: ⚠️ Potential issue | 🟡 Minor

Avoid toMap here; it can mask duplicate-key regressions.

collect().map(...).toMap silently keeps one value per id. If MOR merge regresses and returns duplicate rows per key, this test can still pass.

💡 Suggested hardening

-    val rows = spark.read.format("hudi").load(basePath)
-      .select("id", "country")
-      .collect()
-      .map(r => r.getLong(0) -> (if (r.isNullAt(1)) null else r.getString(1)))
-      .toMap
+    val grouped = spark.read.format("hudi").load(basePath)
+      .select("id", "country")
+      .collect()
+      .groupBy(_.getLong(0))
+
+    assertEquals(
+      Set.empty[Long],
+      grouped.collect { case (id, rs) if rs.lengthCompare(1) != 0 => id }.toSet,
+      "Expected exactly one row per id")
+
+    val rows = grouped.map { case (id, rs) =>
+      id -> (if (rs.head.isNullAt(1)) null else rs.head.getString(1))
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestFileGroupReaderPartitionColumn.scala`
around lines 130 - 135, The current test builds `rows` using
`collect().map(...).toMap`, which hides duplicate `id` values; change the test
to detect duplicates explicitly by collecting into a Seq (keep the
`collect().map(r => r.getLong(0) -> (if (r.isNullAt(1)) null else
r.getString(1)))` transformation but do not call `toMap`), then group by the id
(e.g., call `.groupBy(_._1)`), assert each group has size 1 (fail the test if
any id has >1 rows), and only then build a Map of id -> value for subsequent
assertions; reference the `rows` binding and the `collect().map(...).toMap`
usage when locating where to change the code.

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala-752-775 (1)

752-775: ⚠️ Potential issue | 🟡 Minor

Use the actual metadata contract in the docs and error text.

This helper no longer searches for hudi_blob=true; it searches for hudi_type=BLOB. Leaving the old wording here will send callers to the wrong metadata key when resolution fails.

Proposed fix

-   * Find the blob column by searching for metadata hudi_blob=true.
+   * Find the blob column by searching for metadata hudi_type=BLOB.
@@
-    // Find field with metadata key hudi_blob=true
+    // Find field with metadata key hudi_type=BLOB
@@
-        s"No blob column found with metadata hudi_blob=true. " +
+        s"No blob column found with metadata ${HoodieSchema.TYPE_METADATA_FIELD}=BLOB. " +

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`
around lines 752 - 775, Update the Javadoc and error message in getBlobColumn to
reference the actual metadata contract ("hudi_type=BLOB") instead of the
outdated "hudi_blob=true"; specifically change the method comment and the
IllegalArgumentException text to mention the metadata key
HoodieSchema.TYPE_METADATA_FIELD with value HoodieSchemaType.BLOB (or the
literal "hudi_type=BLOB") so callers are directed to the correct metadata key
used by the lookup logic that checks
field.metadata.getString(HoodieSchema.TYPE_METADATA_FIELD).

coderabbitai · 2026-04-27T10:59:59Z

+      if (!this.heartbeatClient.isHeartbeatExpired(rollbackInstantOpt.get().requestedTime())) {
+        LOG.error("Rollback heartbeat already exists for instant {}", rollbackInstantOpt.get().requestedTime());
+        return false;
+      }
+      if (table.getMetaClient().reloadActiveTimeline().getRollbackTimeline().filterCompletedInstants().getInstantsAsStream()
+          .anyMatch(instant -> EQUALS.test(instant.requestedTime(), rollbackInstantOpt.get().requestedTime()))) {
+        LOG.info("Requested rollback instant {} is already completed in the active timeline", rollbackInstantOpt.get().requestedTime());
+        return false;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash sed -n '116,135p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java rg -n '\bLOG\b|\blog\b' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java

Repository: hudi-agent/hudi

Length of output: 5789

Change LOG to log for compilation.

@Slf4j generates log, not LOG. Lines 1320 and 1325 use an undefined logger field and will not compile.

Suggested fix

- LOG.error("Rollback heartbeat already exists for instant {}", rollbackInstantOpt.get().requestedTime()); + log.error("Rollback heartbeat already exists for instant {}", rollbackInstantOpt.get().requestedTime()); return false; } if (table.getMetaClient().reloadActiveTimeline().getRollbackTimeline().filterCompletedInstants().getInstantsAsStream() .anyMatch(instant -> EQUALS.test(instant.requestedTime(), rollbackInstantOpt.get().requestedTime()))) { - LOG.info("Requested rollback instant {} is already completed in the active timeline", rollbackInstantOpt.get().requestedTime()); + log.info("Requested rollback instant {} is already completed in the active timeline", rollbackInstantOpt.get().requestedTime()); return false; }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java` around lines 1319 - 1326, The code in BaseHoodieTableServiceClient is using an undefined logger name `LOG`; change those calls to the SLF4J-generated `log` instance. Specifically, replace the two occurrences where `LOG.error(...)` and `LOG.info(...)` are invoked around the heartbeat and rollback timeline checks (the block that calls heartbeatClient.isHeartbeatExpired(rollbackInstantOpt.get().requestedTime()) and the subsequent table.getMetaClient().reloadActiveTimeline().getRollbackTimeline()...anyMatch(...) check) to use `log.error(...)` and `log.info(...)` respectively so the class compiles with the `@Slf4j-provided` logger.

coderabbitai · 2026-04-27T10:59:59Z

+        // Build ColumnVector[] in Spark-schema order by looking each field up by name;
+        // lance-spark 0.4.0's VectorSchemaRoot may return the file's on-disk order, which
+        // would misalign the UnsafeProjection. Cached on the first batch and reused thereafter.
        if (columnVectors == null) {
-          columnVectors = root.getFieldVectors().stream()
-                  .map(LanceArrowColumnVector::new)
-                  .toArray(ColumnVector[]::new);
+          List<FieldVector> fieldVectors = root.getFieldVectors();
+          Map<String, FieldVector> byName = new HashMap<>(fieldVectors.size() * 2);
+          for (FieldVector fv : fieldVectors) {
+            byName.put(fv.getName(), fv);
+          }
+          StructField[] sparkFields = sparkSchema.fields();
+          if (sparkFields.length != fieldVectors.size()) {
+            throw new HoodieException("Lance batch column count " + fieldVectors.size()
+                + " does not match expected Spark schema size " + sparkFields.length
+                + " for file: " + path);
+          }
+          columnVectors = new ColumnVector[sparkFields.length];
+          for (int i = 0; i < sparkFields.length; i++) {
+            String name = sparkFields[i].name();
+            FieldVector fv = byName.get(name);
+            if (fv == null) {
+              throw new HoodieException("Lance batch missing expected column '" + name
+                  + "' for file: " + path + "; available columns: " + byName.keySet());
+            }
+            columnVectors[i] = new LanceArrowColumnVector(fv);
+          }
        }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does lance-spark 0.4.0 ArrowReader reuse FieldVector instances across batches?

💡 Result:

No, lance-spark 0.4.0 ArrowReader does not reuse FieldVector instances across batches. No source explicitly states reuse of FieldVector instances. Lance Spark uses JNI bindings to lance-core Rust library for reading Lance data into Arrow RecordBatches. Each batch read likely creates new FieldVectors, as typical in Arrow readers without explicit pooling or zero-copy mechanisms documented. Memory management mentions like "fix: Arrow memory leak in LanceCountStarPartitionReader" suggest allocation per operation rather than reuse. Official docs, releases up to v0.4.0-beta.1, and source searches show no evidence of FieldVector reuse across batches.

Citations:

1: https://github.com/lancedb/lance-spark/releases

2: https://github.com/lance-format/lance-spark

3: lance-format/lance-spark@3f84c3e

4: https://github.com/lance-format/lance-spark/releases/tag/v0.4.0-beta.1

Do not cache columnVectors across batches — Lance creates new FieldVector instances per batch, causing stale data to be read.

The code caches the columnVectors array on the first batch (if (columnVectors == null)) and reuses it for all subsequent batches. However, lance-spark 0.4.0's ArrowReader returns new FieldVector instances with each batch read. This means the cached LanceArrowColumnVector wrappers continue referencing the original FieldVectors from the first batch, reading stale data for all subsequent batches.

Remove the null check and rebuild the column vectors on each batch read. Alternatively, update the LanceArrowColumnVector references to point to the new FieldVectors from the current batch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java` around lines 113 - 138, The current implementation caches the ColumnVector[] in LanceRecordIterator (the columnVectors field) on the first batch, but lance-spark returns new FieldVector instances each batch causing stale data; remove the "if (columnVectors == null)" guard so that for every batch you rebuild columnVectors by calling root.getFieldVectors(), mapping byName and creating new LanceArrowColumnVector(fv) instances (or alternatively update each existing LanceArrowColumnVector to point to the new FieldVector), ensuring the logic around sparkSchema.fields() and the existing error checks (throwing HoodieException with path) remains the same.

coderabbitai · 2026-04-27T11:00:00Z

+    )
+    val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema).coalesce(1)
+
+    writeDataframe(tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite)


⚠️ Potential issue | 🔴 Critical

These writes target a precombine field that does not exist in the test schema.

writeDataframe(...) hardcodes hoodie.datasource.write.precombine.field=age, but neither testNullableVectorRoundTrip nor testVectorProjection defines an age column. These writes will fail before they exercise the vector path.

Suggested fix

- writeDataframe(tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite) + writeDataframe( + tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite, + extraOptions = Map(PRECOMBINE_FIELD.key() -> "id"))

- writeDataframe(tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite) + writeDataframe( + tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite, + extraOptions = Map(PRECOMBINE_FIELD.key() -> "id"))

Also applies to: 1080-1080

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala` at line 1048, The writes in testNullableVectorRoundTrip and testVectorProjection call writeDataframe which currently hardcodes hoodie.datasource.write.precombine.field=age, but those tests' DataFrames have no age column so the write fails; fix by making writeDataframe accept a precombineField parameter (or accept a Map of write options) and update the two test calls to pass a valid existing column name from their schemas (e.g., "id" or the timestamp column) instead of "age", or alternatively add an "age" column to the test DataFrames if that is preferred; update references to writeDataframe in these tests to use the new parameter and ensure the property name hoodie.datasource.write.precombine.field is set to the chosen existing column.

Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and BLOB onto the Lance base file format, and extend all variants with a 6th deltacommit so the cleaner has a chance to retire the post-compaction log-only slice and write a .clean instant. - VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table config and that a .lance base file exists under the table path after compaction. - VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter -> LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the LANCE format config sticks to hoodie.properties immediately after CREATE TABLE so the table-level invariant is checked even when the writer fails downstream. - All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after the compaction-triggering 5th commit. The 5th commit's auto-clean runs before inline compaction, so the prior log slice is not yet superseded; the 6th commit's postCommit clean retires it and writes the .clean instant. The cleaner-timeline assertion uses reloadActiveTimeline() to avoid a stale cached view.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala (1)

127-234: Consider extracting shared MOR compaction flow into a helper.

The default and LANCE tests duplicate a long commit/merge/assert sequence. A shared helper (parameterized by base file format + extra assertions) would reduce drift and maintenance cost.

Also applies to: 244-349

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`
around lines 127 - 234, The test duplicates a long MOR compaction flow; extract
the repeated sequence into a helper method (e.g., runMorCompactionFlow or
assertMorCompactionRoundtrip) that accepts parameters like the target table
name, tmp path, and an optional callback for extra assertions; move the repeated
spark.sql insert/merge calls, the DataSourceTestUtils.isLogFileOnly checks,
checkAnswer verifications, the variant field round-trip assertion (finding
schema field "v" and asserting typeName == "variant"), and the createMetaClient
timeline assertion into that helper, then call it from withRecordType() and the
LANCE test with the appropriate parameters to remove duplication while
preserving existing checks (references: withRecordType(),
DataSourceTestUtils.isLogFileOnly, checkAnswer, createMetaClient, and the schema
lookup for "v").

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala (1)

56-283: Consider extracting a shared MOR compaction scenario helper.

The four tests duplicate the same create/insert/merge/compaction/clean assertions with only blob-shape and format differences. A small helper would reduce drift and make future custom-type additions easier to maintain.

Also applies to: 285-517

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 56 - 283, Extract the repeated MOR compaction scenario into a
single helper (e.g., runMorCompactionScenario) that accepts parameters for the
blob value builder (inlineBlobLiteral or outOfLineBlobLiteral), any pre-created
blob files, and two validator callbacks (post-compaction bytes check and
blob-shape assertions); move the shared steps—table creation (using create table
SQL with type='mor' and hoodie.compact.inline='true'), the three inserts/merges
that drive compaction, the DataSourceTestUtils.isLogFileOnly assertions, the
post-compaction read_blob verification (using spark.sql(...).collect().map(...)
and DataSourceTestUtils/BlobTestHelpers checks), metadata assertions on
spark.table(...).schema (checking HoodieSchema.TYPE_METADATA_FIELD and
HoodieSchemaType.BLOB), the final merge that triggers cleaning, and metaClient
cleaner timeline assertion—into that helper and replace the four near-duplicate
test bodies (Test Query Log Only MOR Table With BLOB INLINE/OUT_OF_LINE and the
other two similar tests) to call it with the appropriate literal generator
(inlineBlobLiteral or outOfLineBlobLiteral), file args (file1..file4 or none),
and the specific byte/shape assertions; reference helpers and symbols in the
diff such as inlineBlobLiteral, outOfLineBlobLiteral,
DataSourceTestUtils.isLogFileOnly, read_blob,
BlobTestHelpers.assertBytesContent, HoodieSchema.TYPE_METADATA_FIELD,
HoodieSchema.Blob.* and createMetaClient when wiring parameters.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`:
- Around line 313-316: Tests in TestBlobDataType.scala currently only assert the
table config reports HoodieFileFormat.LANCE but do not verify that compaction
actually produced a .lance base file; add a filesystem-level assertion after the
existing config check (after where createMetaClient(spark,
tablePath).getTableConfig.getBaseFileFormat is asserted) that scans the
tablePath recursively via Hadoop FileSystem (e.g.,
FileSystem.get(spark.sparkContext.hadoopConfiguration)) and asserts at least one
file path endsWith ".lance" (or matches the expected Lance base-file suffix) to
ensure a real .lance artifact was written; use the existing tablePath variable
and place the check in both spots mentioned so the test fails if no .lance base
file exists after compaction.
- Around line 358-365: The test currently only asserts EXTERNAL_REFERENCE_PATH
is null for an INLINE reference; tighten it by verifying all leaves of the
reference struct are the expected null/default values: after obtaining ref from
blob.getStruct(refIdx) assert
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)),
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_OFFSET)), and
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_LENGTH)), and
assert the managed flag (HoodieSchema.Blob.EXTERNAL_REFERENCE_MANAGED) is the
expected default (e.g., false or null as per schema) to ensure no partially
populated INLINE reference passes.
- Around line 397-402: The test "Test Query Log Only MOR Table With BLOB
OUT_OF_LINE column triggers compaction (Lance)" currently runs by default
because the assume check uses System.getProperty("lance.skip.tests") != "true"
while the property defaults to "false"; change the gating to opt-in by checking
for equality (e.g., System.getProperty("lance.skip.tests") == "true") so the
test only runs when explicitly enabled, or alternatively mark the test with
`@Ignore` (or wrap assertions in an assertThrows/try-catch) until
HoodieSparkLanceWriter adds BLOB support to avoid failing default test runs.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`:
- Around line 137-144: The table DDL in TestVariantDataType sets
hoodie.compact.inline='true' but leaves the compaction trigger threshold
implicit, causing commit-count flakiness; add an explicit compaction trigger
like hoodie.compact.inline.max.delta.commits='<expected_count>' to the
tblproperties in the shown DDL block (and the other occurrence around lines
255-262) so the test’s expected commit boundaries (the commits asserted at
3/4/5/6) are stable; update the table properties in the TestVariantDataType test
cases to include this property with the threshold the test expects.
- Around line 239-243: The test currently runs as a normal pass-path but
comments say VARIANT support is missing; change the `assume(...)` guard in
TestVariantDataType so VARIANT tests are opt-in instead of running by default:
replace the existing `assume(System.getProperty("lance.skip.tests") != "true",
...)` check with an explicit opt-in like
`assume(System.getProperty("lance.enable.variant") == "true", "LANCE VARIANT
tests disabled unless -Dlance.enable.variant=true")` (or equivalent logic) so
the test is skipped by default until Lance VARIANT support is implemented.

---

Nitpick comments:
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`:
- Around line 56-283: Extract the repeated MOR compaction scenario into a single
helper (e.g., runMorCompactionScenario) that accepts parameters for the blob
value builder (inlineBlobLiteral or outOfLineBlobLiteral), any pre-created blob
files, and two validator callbacks (post-compaction bytes check and blob-shape
assertions); move the shared steps—table creation (using create table SQL with
type='mor' and hoodie.compact.inline='true'), the three inserts/merges that
drive compaction, the DataSourceTestUtils.isLogFileOnly assertions, the
post-compaction read_blob verification (using spark.sql(...).collect().map(...)
and DataSourceTestUtils/BlobTestHelpers checks), metadata assertions on
spark.table(...).schema (checking HoodieSchema.TYPE_METADATA_FIELD and
HoodieSchemaType.BLOB), the final merge that triggers cleaning, and metaClient
cleaner timeline assertion—into that helper and replace the four near-duplicate
test bodies (Test Query Log Only MOR Table With BLOB INLINE/OUT_OF_LINE and the
other two similar tests) to call it with the appropriate literal generator
(inlineBlobLiteral or outOfLineBlobLiteral), file args (file1..file4 or none),
and the specific byte/shape assertions; reference helpers and symbols in the
diff such as inlineBlobLiteral, outOfLineBlobLiteral,
DataSourceTestUtils.isLogFileOnly, read_blob,
BlobTestHelpers.assertBytesContent, HoodieSchema.TYPE_METADATA_FIELD,
HoodieSchema.Blob.* and createMetaClient when wiring parameters.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`:
- Around line 127-234: The test duplicates a long MOR compaction flow; extract
the repeated sequence into a helper method (e.g., runMorCompactionFlow or
assertMorCompactionRoundtrip) that accepts parameters like the target table
name, tmp path, and an optional callback for extra assertions; move the repeated
spark.sql insert/merge calls, the DataSourceTestUtils.isLogFileOnly checks,
checkAnswer verifications, the variant field round-trip assertion (finding
schema field "v" and asserting typeName == "variant"), and the createMetaClient
timeline assertion into that helper, then call it from withRecordType() and the
LANCE test with the appropriate parameters to remove duplication while
preserving existing checks (references: withRecordType(),
DataSourceTestUtils.isLogFileOnly, checkAnswer, createMetaClient, and the schema
lookup for "v").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 1c13b86e-e4d3-45ee-8bfa-0ad371455b11

📥 Commits

Reviewing files that changed from the base of the PR and between 19c9a6a and cf663ca.

📒 Files selected for processing (4)

hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala

🚧 Files skipped from review as they are similar to previous changes (1)

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala

coderabbitai · 2026-04-27T21:00:00Z

+      // Verify the LANCE config was actually persisted to hoodie.properties.
+      assertResult(HoodieFileFormat.LANCE)(
+        createMetaClient(spark, tablePath).getTableConfig.getBaseFileFormat)
+


⚠️ Potential issue | 🟠 Major

Assert actual .lance base-file creation after compaction.

Line 314 and Line 435 only validate table config, not emitted file artifacts. If compaction silently writes a non-Lance base file, these tests can still pass. Please add a filesystem assertion after Line 343 and Line 467 that at least one .lance base file exists.

🧪 Proposed test hardening

+ private def hasBaseFileWithExtension(root: String, ext: String): Boolean = { + val stream = java.nio.file.Files.walk(new File(root).toPath) + try { + stream.anyMatch(path => + java.nio.file.Files.isRegularFile(path) && + path.getFileName.toString.endsWith(ext)) + } finally { + stream.close() + } + } ... assertResult(false)(DataSourceTestUtils.isLogFileOnly(tablePath)) + assert(hasBaseFileWithExtension(tablePath, ".lance"), + s"Expected at least one .lance base file after compaction under $tablePath") ... assertResult(false)(DataSourceTestUtils.isLogFileOnly(tablePath)) + assert(hasBaseFileWithExtension(tablePath, ".lance"), + s"Expected at least one .lance base file after compaction under $tablePath")

Also applies to: 343-343, 434-437, 467-467

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala` around lines 313 - 316, Tests in TestBlobDataType.scala currently only assert the table config reports HoodieFileFormat.LANCE but do not verify that compaction actually produced a .lance base file; add a filesystem-level assertion after the existing config check (after where createMetaClient(spark, tablePath).getTableConfig.getBaseFileFormat is asserted) that scans the tablePath recursively via Hadoop FileSystem (e.g., FileSystem.get(spark.sparkContext.hadoopConfiguration)) and asserts at least one file path endsWith ".lance" (or matches the expected Lance base-file suffix) to ensure a real .lance artifact was written; use the existing tablePath variable and place the check in both spots mentioned so the test fails if no .lance base file exists after compaction.

coderabbitai · 2026-04-27T21:00:00Z

+        // Lance materializes the `reference` struct as non-null with all-null leaves for
+        // INLINE rows (vs. a null struct on Parquet). `type` is the canonical INLINE
+        // discriminator per RFC-100; tolerate either shape and just check the leaves.
+        val refIdx = blob.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE)
+        if (!blob.isNullAt(refIdx)) {
+          val ref = blob.getStruct(refIdx)
+          assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))
+        }


⚠️ Potential issue | 🟡 Minor

Strengthen INLINE Lance reference-shape validation.

Line 364 checks only external_path. A partially populated reference struct (offset/length/managed) would still pass. Please assert all reference leaves are null (or expected defaults) when type = INLINE.

🔍 Proposed assertion extension

if (!blob.isNullAt(refIdx)) { val ref = blob.getStruct(refIdx) assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH))) + assert(ref.isNullAt(ref.fieldIndex("offset"))) + assert(ref.isNullAt(ref.fieldIndex("length"))) + assert(ref.isNullAt(ref.fieldIndex("managed"))) }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// Lance materializes the `reference` struct as non-null with all-null leaves for

// INLINE rows (vs. a null struct on Parquet). `type` is the canonical INLINE

// discriminator per RFC-100; tolerate either shape and just check the leaves.

val refIdx = blob.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE)

if (!blob.isNullAt(refIdx)) {

val ref = blob.getStruct(refIdx)

assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))

}

// Lance materializes the `reference` struct as non-null with all-null leaves for

// INLINE rows (vs. a null struct on Parquet). `type` is the canonical INLINE

// discriminator per RFC-100; tolerate either shape and just check the leaves.

val refIdx = blob.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE)

if (!blob.isNullAt(refIdx)) {

val ref = blob.getStruct(refIdx)

assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))

assert(ref.isNullAt(ref.fieldIndex("offset")))

assert(ref.isNullAt(ref.fieldIndex("length")))

assert(ref.isNullAt(ref.fieldIndex("managed")))

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala` around lines 358 - 365, The test currently only asserts EXTERNAL_REFERENCE_PATH is null for an INLINE reference; tighten it by verifying all leaves of the reference struct are the expected null/default values: after obtaining ref from blob.getStruct(refIdx) assert ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)), ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_OFFSET)), and ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_LENGTH)), and assert the managed flag (HoodieSchema.Blob.EXTERNAL_REFERENCE_MANAGED) is the expected default (e.g., false or null as per schema) to ensure no partially populated INLINE reference passes.

coderabbitai · 2026-04-27T21:00:01Z

+  test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE column triggers compaction (Lance)") {
+    assume(System.getProperty("lance.skip.tests") != "true",
+      "Lance tests disabled via -Dlance.skip.tests=true")
+    // Lance writer has no BLOB handling today (RFC-100 Phase 2). Expected to fail
+    // until support lands in HoodieSparkLanceWriter; this test pins the gap.
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Verify whether CI/build defaults force skipping Lance tests. rg -n -C2 --iglob '*pom*.xml' --iglob '*.gradle*' --iglob '*.yml' --iglob '*.yaml' \ 'lance\.skip\.tests|Dlance\.skip\.tests|skip.*lance|lance.*skip' # Inspect all usages to confirm intended gating semantics repo-wide. rg -n -C2 --iglob '*.scala' \ 'System\.getProperty$"lance\.skip\.tests"$|assume\(.*lance\.skip\.tests'

Repository: hudi-agent/hudi

Length of output: 5814

🏁 Script executed:

#!/bin/bash set -euo pipefail # Get the full test context around line 397-402 sed -n '390,410p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala # Check if test has any `@Ignore`, `@Pending`, or other skip annotations rg -B5 'Test Query Log Only MOR Table With BLOB OUT_OF_LINE' \ hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala # Look for test implementation to see if there's failure handling sed -n '397,460p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

Repository: hudi-agent/hudi

Length of output: 4505

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check for assertThrows, expectedException or try-catch patterns in the OUT_OF_LINE test sed -n '397,500p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | \ grep -i "assertthrows\|expected\|exception\|try\|catch" # Check if there's a non-Lance variant to compare the expected behavior grep -n "Test Query Log Only MOR Table With BLOB OUT_OF_LINE" hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala # Verify test count - how many similar Lance gated tests exist rg -c 'assume$System\.getProperty\("lance\.skip\.tests"$' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

Repository: hudi-agent/hudi

Length of output: 416

🏁 Script executed:

#!/bin/bash set -euo pipefail # Get the complete test implementation to see where it ends and check for failure handling sed -n '397,550p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | tail -60 # Also check the non-Lance version (line 161) to see if it has assertions or different behavior sed -n '161,280p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | grep -A120 'test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE' | head -80

Repository: hudi-agent/hudi

Length of output: 6345

🏁 Script executed:

#!/bin/bash set -euo pipefail # Check if there are any `@Ignore` or `@Pending` annotations on Lance tests in this file rg -B3 'assume$System\.getProperty\("lance\.skip\.tests"$' \ hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | \ grep -E '@Ignore|@Pending|@Disabled' || echo "No `@Ignore/`@Pending/@Disabled found" # Check for similar pattern in other test files to see how other "expected to fail" tests are handled rg -B2 'Expected to fail' hudi-spark-datasource/hudi-spark/src/test/scala/ | head -30

Repository: hudi-agent/hudi

Length of output: 1737

Test runs by default despite documented expected failure - needs opt-in gating or quarantine.

The opt-out gate at line 398 uses != "true", but pom.xml defaults lance.skip.tests to false, causing the test to run. Comments at lines 400–401 state the test is expected to fail because Lance lacks BLOB handling (RFC-100 Phase 2). The test executes full assertions and operations with no failure-handling mechanism (@Ignore, try-catch, assertThrows), so it will deterministically fail if Lance doesn't support BLOBs, destabilizing default test runs. Switch to opt-in gating (check == "true" instead) or add @Ignore until support lands.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala` around lines 397 - 402, The test "Test Query Log Only MOR Table With BLOB OUT_OF_LINE column triggers compaction (Lance)" currently runs by default because the assume check uses System.getProperty("lance.skip.tests") != "true" while the property defaults to "false"; change the gating to opt-in by checking for equality (e.g., System.getProperty("lance.skip.tests") == "true") so the test only runs when explicitly enabled, or alternatively mark the test with `@Ignore` (or wrap assertions in an assertThrows/try-catch) until HoodieSparkLanceWriter adds BLOB support to avoid failing default test runs.

coderabbitai · 2026-04-27T21:00:01Z

+           | tblproperties (
+           |  primaryKey = 'id',
+           |  type = 'mor',
+           |  preCombineField = 'ts',
+           |  hoodie.index.type = 'INMEMORY',
+           |  hoodie.compact.inline = 'true',
+           |  hoodie.clean.commits.retained = '1'
+           | )


⚠️ Potential issue | 🟡 Minor

Pin compaction trigger configs to avoid commit-count flakiness.

These tests assert exact behavior at commits 3/4/5/6, but the threshold is implicit. Please set it explicitly in table properties so defaults/config drift cannot break the assertions.

Suggested fix

| hoodie.compact.inline = 'true', + | hoodie.compact.inline.max.delta.commits = '5', | hoodie.clean.commits.retained = '1'

| hoodie.compact.inline = 'true', + | hoodie.compact.inline.max.delta.commits = '5', | hoodie.clean.commits.retained = '1', | 'hoodie.table.base.file.format' = 'LANCE'

Also applies to: 255-262

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala` around lines 137 - 144, The table DDL in TestVariantDataType sets hoodie.compact.inline='true' but leaves the compaction trigger threshold implicit, causing commit-count flakiness; add an explicit compaction trigger like hoodie.compact.inline.max.delta.commits='<expected_count>' to the tblproperties in the shown DDL block (and the other occurrence around lines 255-262) so the test’s expected commit boundaries (the commits asserted at 3/4/5/6) are stable; update the table properties in the TestVariantDataType test cases to include this property with the threshold the test expects.

coderabbitai · 2026-04-27T21:00:01Z

+    assume(System.getProperty("lance.skip.tests") != "true",
+      "Lance tests disabled via -Dlance.skip.tests=true")
+    // Lance writer has no VARIANT handling today (RFC-100 Phase 2). Expected to fail
+    // until support lands in HoodieSparkLanceWriter; this test pins the gap.
+


⚠️ Potential issue | 🟠 Major

Make the LANCE VARIANT test semantics explicit (currently “expected to fail” but asserted as pass-path).

The comment says this case is expected to fail until support lands, but the test runs normal success assertions unless a manual JVM flag is set. That can make default test runs unstable.

Suggested fix (opt-in while feature is incomplete)

- assume(System.getProperty("lance.skip.tests") != "true", - "Lance tests disabled via -Dlance.skip.tests=true") - // Lance writer has no VARIANT handling today (RFC-100 Phase 2). Expected to fail - // until support lands in HoodieSparkLanceWriter; this test pins the gap. + assume(System.getProperty("lance.enable.variant.tests") == "true", + "Experimental: enable with -Dlance.enable.variant.tests=true") + // Enable by opt-in until HoodieSparkLanceWriter fully supports VARIANT.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assume(System.getProperty("lance.skip.tests") != "true",

"Lance tests disabled via -Dlance.skip.tests=true")

// Lance writer has no VARIANT handling today (RFC-100 Phase 2). Expected to fail

// until support lands in HoodieSparkLanceWriter; this test pins the gap.

assume(System.getProperty("lance.enable.variant.tests") == "true",

"Experimental: enable with -Dlance.enable.variant.tests=true")

// Enable by opt-in until HoodieSparkLanceWriter fully supports VARIANT.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala` around lines 239 - 243, The test currently runs as a normal pass-path but comments say VARIANT support is missing; change the `assume(...)` guard in TestVariantDataType so VARIANT tests are opt-in instead of running by default: replace the existing `assume(System.getProperty("lance.skip.tests") != "true", ...)` check with an explicit opt-in like `assume(System.getProperty("lance.enable.variant") == "true", "LANCE VARIANT tests disabled unless -Dlance.enable.variant=true")` (or equivalent logic) so the test is skipped by default until Lance VARIANT support is implemented.

aaaZayne and others added 30 commits April 20, 2026 04:38

introduce static helper method to remove clones (apache#18533)

55bf91a

feat(blob): Read Blobs in Spark SQL (apache#18098)

f35b69c

feat(blob): followup fixes for blob reader (apache#18538)

4ef56e4

chore(docker): add Hadoop 3.4.0 / Hive 2.3.10 / Spark 4.0.2 compose s…

cd83cf4

…tack (apache#18550)

fix: JDBC connection leak in HiveIncrementalPuller.saveDelta() (apach…

8623898

…e#18460)

chore(spark): bump spark4.version to 4.0.2 (apache#18549)

ddbdbb9

fix(lance): Add Hive InputFormat stubs and fix Spark SQL for Lance fi…

7f4dd31

…le format (apache#18162) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

feat(flink): Introduces dictionary encoding of payload partition path…

ace2871

… for RocksDBIndexBackend (apache#18560)

feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-siz…

1e64662

…e lists (apache#18497)

fix(vector): Register VECTOR HMS column as BINARY on Spark CREATE (ap…

9d1f817

…ache#18545) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

fix(variant): allow VariantType writes through Hudi's V1 DataSource o…

110b9be

…n Spark 4 (apache#18564) Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>

voonhous and others added 4 commits April 24, 2026 15:56

fix(clean): address review comments on empty clean support (apache#18587

c1569db

)

fix(ci): bump surefire test heap from 3g to 4g (apache#18589)

4f3e885

Increase -Xmx from 3g to 4g in the default, java11 and java17 Maven profiles to prevent OOM failures during CI test runs.

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

hudi-agent force-pushed the oss-18597 branch from 19c9a6a to cf663ca Compare April 27, 2026 20:55

coderabbitai Bot reviewed Apr 27, 2026

View reviewed changes

Conversation

hudi-agent commented Apr 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

hudi-agent commented Apr 27, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading