Skip to content

[OSS PR #18597] test(schema): Add lance fileformat test for custom types on MOR#23

Open
hudi-agent wants to merge 35 commits into
masterfrom
oss-18597
Open

[OSS PR #18597] test(schema): Add lance fileformat test for custom types on MOR#23
hudi-agent wants to merge 35 commits into
masterfrom
oss-18597

Conversation

@hudi-agent
Copy link
Copy Markdown
Owner

@hudi-agent hudi-agent commented Apr 27, 2026

Mirror of apache#18597 for automated bot review.

Original author: @voonhous
Base branch: master

Summary by CodeRabbit

  • New Features

    • VECTOR, BLOB and VARIANT types; read_blob() SQL for efficient blob access; Lance as a base-file format; batched blob read execution and SQL integration; continuous-sort buffering for Flink ingestion; pluggable Parquet writer config injection.
  • Improvements

    • Safer empty-clean behavior and optional archival blocking based on latest clean ECTR to avoid premature commit removal.
  • Bug Fixes

    • Removed duplicate YARN setting; improved lock/cleanup robustness; fixed partition-column handling in reads.
  • Chores

    • Updated Lance/Spark versions and added Docker Compose stacks for local integration testing.

aaaZayne and others added 30 commits April 20, 2026 04:38
…che#18508)

* fix: whitelist Flink _2.12 artifacts in scala-2.13 enforcer rule

- Flink only publishes flink-table-planner_2.12 (no _2.13 variant exists). The scala-2.13 profile's blanket ban on *_2.12 artifacts breaks the build when combining -Dspark4.0 -Dscala-2.13 with -Dflink1.20.
- Add <includes> exceptions for org.apache.flink:*_2.12 and its transitive _2.12 dependencies (org.scala-lang.modules, com.twitter) since Flink has largely decoupled from Scala since 1.15 and the _2.12 suffix is internal.

* fix: tighten Flink _2.12 whitelist and add rationale comment

Address review feedback on apache#18508:
- Narrow scala-lang.modules and com.twitter includes to the specific transitives (scala-xml_2.12, chill_2.12) so the enforcer still catches unexpected _2.12 leakage from those groups.
- Add an XML comment explaining why these _2.12 artifacts are whitelisted despite the blanket scala-2.13 ban (Flink's _2.12 suffix is a legacy naming artifact post 1.15 Scala decoupling).
…int.sh (apache#18527)

- The property was inserted into yarn-site.xml twice with the same value in the MULTIHOMED_NETWORK=1 block.
- Duplicates are harmless at runtime (Hadoop's Configuration parser takes the last value for duplicates and both writes use the same value), but the second write is dead code.
- Applies to base, base_java11, and base_java17.
- MapUtils.isNullOrEmpty(Map) was byte-identical to CollectionUtils.isNullOrEmpty(Map), and MapUtils.nonEmpty(Map) had no matching overload in CollectionUtils.
- Move the sole remaining unique helper (containsAll) into CollectionUtils, delete MapUtils, and fold its test into TestCollectionUtils so there is one canonical utility class for Map/Collection emptiness and containment checks.
…bleList (apache#18530)

- Replace Stream.of(elements).collect(Collectors.toList()) with a direct Arrays.asList copy.
- Same semantics (mutable snapshot wrapped in an unmodifiable view), one less stream/spliterator/collector allocation per call.
- Called from constants/initializers, so the saving is per-call but free.
…pache#18083)

* Implemented a continuous sorting mode for the append sink to maintain sorted order incrementally and avoid single-partition lag during ingestion by reducing large pause time from sort and backpressure

Summary:
- Added AppendWriteFunctionWithContinuousSort which keeps records in a TreeMap keyed by a code-generated normalized key and an insertion sequence, drains oldest entries when a configurable threshold is reached, and writes drained records immediately; snapshot/endInput drain remaining records.
- Updated AppendWriteFunctions.create to instantiate the continuous sorter when WRITE_BUFFER_SORT_CONTINUOUS_ENABLED is true.
- Introduced three new FlinkOptions: WRITE_BUFFER_SORT_CONTINUOUS_ENABLED, WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_THRESHOLD_PERCENT, and WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_SIZE, and added runtime validation (buffer > 0, 0 < threshold < 100, drainSize > 0, parsed non-empty sort keys).
- Added ITTestAppendWriteFunctionWithContinuousSort integration tests covering buffer flush triggers, sorted output correctness (with and without continuous drain), drain threshold/size behaviors, and invalid-parameter error cases.

---------

Co-authored-by: dsaisharath <dsaisharath@uber.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…apache#18204)

- Added HudiHiveSyncJob under hudi-utilities as an external runner for Hive sync.
- Added CLI/config support for base path, base file format, props file, and override configs.
- Wired the job to build sync properties and invoke HiveSyncTool directly.

---------

Co-authored-by: sivabalan <n.siva.b@gmail.com>
…() (apache#18461)

* fix: close HoodieStorage resource in FileSystemBasedLockProvider.close()

Adds a finally block to properly close the HoodieStorage instance
after the lock file is deleted, preventing resource leaks.

Fixes apache#14922

* fix: simplify log message per review feedback

Addresses nit from @yihua: shorten log to "Failed to close HoodieStorage" since logger already includes class context.

* fix: add comment explaining HoodieStorage.close() is no-op for HadoopStorage

Addresses review feedback from danny0405: HoodieHadoopStorage.close() is
a no-op because Hadoop FileSystem instances are shared within the JVM
process lifecycle. Added a comment to explain this behavior and why the
call is still retained for interface contract correctness.
…lters (apache#18531)

* perf(common): Avoid double-iterating log files in file-system-view filters

- filterUncommittedFiles and filterUncommittedLogs each called fileSlice.getLogFiles() twice: once to filter+collect into committedLogFiles, and again for .count() to compare sizes.
- Each call produced a fresh stream over the underlying collection.
- Materialize the log files once and compare against the resulting list size.
- Also switch fileSlices.size() == 0 to isEmpty() in fetchAllLogsMergedFileSlice.
- No behaviour changes.

* perf(common): add FileSlice#getLogFileCnt and drop intermediate list

- Address review comment on apache#18531:
  - The backing TreeSet's size() is O(1), so expose it as FileSlice#getLogFileCnt and use it for the size comparison in filterUncommittedFiles / filterUncommittedLogs.
- This removes the allLogFiles materialization and keeps a single stream pass for the filter.
…pache#18488)

* feat(vector): Add Spark SQL DDL CREATE TABLE support for VECTOR type

- Enable VECTOR(dim[, elementType]) syntax in Spark SQL DDL so users can create tables with vector columns directly via SQL instead of only through the DataFrame API.

- Changes:
  - Extend ANTLR grammar to accept identifier params in primitiveDataType
  - Add VECTOR case in visitPrimitiveDataType (supports FLOAT, DOUBLE, INT8)
  - Add VECTOR metadata attachment in addMetadataForType
  - Add DDL tests for VECTOR columns in TestCreateTable

* Fix tests and address comments

* Improve VECTOR DDL test coverage with targeted tests and routing

- Relax isHoodieCommand VECTOR check from " vector(" to " vector" in all 4 extended parser files.
- The stricter " vector(" variant only routes SQL containing VECTOR type declarations with parentheses (e.g. VECTOR(128)), which means VECTOR without parens is delegated to Spark's native parser and never reaches our Hudi code path.
- Relaxing to " vector" routes all VECTOR-related SQL through our parser, enabling us to exercise the "vector with empty params" branch of the `case ("vector", _ :: _)` pattern - previously reported as partial coverage because the empty-list side of the `_ :: _` check was never hit.
- This is also consistent with the existing BLOB routing pattern " blob".

Add two targeted tests:
1. test create table with INT8 VECTOR column - isolated INT8 test that independently exercises the `case INT8 => ByteType` branch
2. test create table with VECTOR without dimension fails - routes VECTOR alone through the Hudi parser to cover the empty-list branch
* [MINOR] Bump lance to 4.0.0 and lance-spark to 0.4.0

Bumps lance-core from 1.0.2 to 4.0.0 and lance-spark connector
from 0.0.15 to 0.4.0. Updates affected import paths and adapts to
the LanceArrowUtils.toArrowSchema signature change (drops the
errorOnDuplicatedFieldNames parameter).

* [MINOR] Rename Hudi's ShowIndexes logical plan to HoodieShowIndexes

Lance-spark 0.4.0 (bumped in 7e4967c) ships its own
`org.apache.spark.sql.catalyst.plans.logical.ShowIndexes` inside
`lance-spark-base_*.jar`. This collides with Hudi's own same-FQCN
case class (added in hudi-spark-common). Both jars end up on the
classpath of hudi-spark3.3.x/3.4.x/3.5.x/4.0.x, and since the two
classes have different case-class arity (Lance's is 1-arg, Hudi's
is 2-arg), Scala pattern matches like `case ShowIndexes(table, output)`
fail to compile.

Rename Hudi's class to `HoodieShowIndexes` (and its companion
object) to sidestep the collision. This is an internal logical-plan
class consumed only by Hudi's own parser / CatalystPlanUtils /
analyzer — no public SQL or API surface changes.

Call-sites updated:
- Index.scala (definition + companion)
- HoodieSpark{33,34,35,40}CatalystPlanUtils.scala (pattern match)
- HoodieSpark{3_3,3_4,3_5,4_0}ExtendedSqlAstBuilder.scala (construct)
- IndexCommands.scala (doc reference)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(lance): look up Arrow vectors by field name in LanceRecordIterator

With lance-spark 0.4.0, VectorSchemaRoot.getFieldVectors() returns
vectors in the file's on-disk order rather than in the order of the
projection requested via LanceFileReader.readAll(). Wrapping vectors
positionally therefore mismatches the UnsafeProjection built from the
requested schema, causing UnsafeProjection to call type accessors on
the wrong column (e.g. getInt on a VarCharVector) and fail with
UnsupportedOperationException for MoR reads where the FileGroupRecord
Buffer rearranges columns relative to the file's write order.

Fix by looking up each vector by field name from the requested schema
so the ColumnVector[] order matches what UnsafeProjection expects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* [MINOR] Tighten comments on LanceRecordIterator and HoodieShowIndexes

Shorten the prose blocks above the column-order remapping in
LanceRecordIterator and above HoodieShowIndexes to 2-3 sentences each,
keeping the why (lance-spark 0.4.0 on-disk column order; FQCN shadow
from lance-spark-base) without the full incident narrative.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…es (apache#18380)

This PR adds support to block archival based on the Earliest Commit To Retain (ECTR) from the last completed clean operation, preventing potential data leaks when cleaning configurations change between clean and archival runs.

Problem: Currently, archival recomputes ECTR independently based on cleaning configs at archival time, rather than reading it from the last clean plan. When cleaning configs change between clean and archival operations, archival may archive commits whose data files haven't been cleaned yet, leading to timeline metadata loss for existing data files.

Summary and Changelog
User-facing summary: Users can now optionally enable archival blocking based on ECTR from the last clean to prevent archiving commits whose data files haven't been cleaned. This is useful when cleaning configurations may change over time or when strict data retention guarantees are needed.

Detailed changelog:

Configuration Changes:

Added new advanced config hoodie.archive.block.on.latest.clean.ectr (default: false)
When enabled, archival reads ECTR from last completed clean metadata
Blocks archival of commits with timestamp >= ECTR
Marked as advanced config for power users
Available since version 1.2.0

Implementation Changes:

TimelineArchiverV1.java: Added ECTR blocking logic in getCommitInstantsToArchive() method
Reads ECTR from last completed clean's metadata (lines 274-294)
Filters commit timeline to exclude commits >= ECTR (lines 322-326)
Follows same pattern as existing compaction/clustering retention checks
Includes error handling with graceful degradation (logs warning if metadata read fails)
HoodieArchivalConfig.java: Added config property BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
Builder method: withBlockArchivalOnCleanECTR(boolean)
HoodieWriteConfig.java: Added access method shouldBlockArchivalOnCleanECTR()
* fix: prevent parseTypeDescriptor crash for non-custom logical types in schema conversion guards

- The BLOB/VECTOR guard conditions in HoodieSparkSchemaConverters called parseTypeDescriptor() for any StructType/ArrayType with hudi_type metadata, which threw IllegalArgumentException for types like VARIANT that are not custom logical types.
- Add isCustomLogicalTypeDescriptor() safe check to short-circuit the guards before parseTypeDescriptor() is called.
- Add regression test that reproduces the struct+metadata VARIANT path.

* fix: treat hudi_type=VARIANT as a first-class custom logical type

- Address review feedback on apache#18510, restructure the crash fix so a StructType tagged with hudi_type=VARIANT is handled consistently with BLOB/VECTOR.
- The hudi_type metadata is the deliberate escape hatch for engines without a native representation (notably Spark 3.5), so using it is itself as custom-logical-type signal.
- Add VARIANT to CUSTOM_LOGICAL_TYPES and give it a case in parseTypeDescriptor, mirroring BLOB.
- In HoodieSparkSchemaConverters, add a dedicated VARIANT pattern case that validates the expected unshredded structure ({metadata, value} binary fields) and produces HoodieSchema.Variant.
- On Spark 4.0+ the column round-trips as native VariantType via the existing reverse conversion path.
- Remove the isCustomLogicalTypeDescriptor short-circuit helper; with VARIANT now properly registered, the BLOB/VECTOR guards no longer need the pre-check.
- Add unit tests for parseTypeDescriptor VARIANT (success, case insensitivity, parameter rejection) and integration tests asserting VARIANT promotion and malformed-struct rejection.

* Address missing code coverage complains
…#18511)

- Hive 2.x/3.x does not support VARIANT type natively.
- When creating a Hudi table with VARIANT columns via SQL CREATE TABLE, Spark's HiveClient passes "variant" as a literal type string which Hive rejects.
- Convert VariantType to struct<value:binary, metadata:binary> in the CatalogTable schema before passing to HiveClient, while preserving the original VariantType in table properties so Spark can reconstruct it when reading.
- Includes unit test for the conversion.
- Recursively convert VariantType inside nested StructType/ArrayType/MapType so columns like STRUCT<a:VARIANT>, ARRAY<VARIANT>, and MAP<STRING,VARIANT> are also rewritten to the Hive-compatible physical struct.
- Emit the variant struct with canonical (metadata, value) field order to match HoodieSchema.createVariant() and the Parquet/Iceberg convention.
- Extract buildHiveCompatibleCatalogTable helper so the schema conversion and property merge are directly unit-testable.
- Expand TestVariantDataType with nested-variant cases, canonical-order assertions, and coverage for buildHiveCompatibleCatalogTable.
- Clean up Scaladoc (use backticks) and rename the test field from v to variant_col.
…8448)

Closes apache#18050

In multi-writer mode, when multiple writers detect a failed inflight commit, each writer independently schedules and executes its own rollback. This leads to duplicate rollback instants on the timeline for the same failed commit — causing unnecessary work, potential conflicts, and timeline clutter.

Summary and Changelog
Adds a mechanism to avoid duplicate rollback plans in multi-writer mode by:

Timeline reload under lock: Before scheduling a new rollback, the writer reloads the active timeline inside the lock to check if another writer already scheduled a rollback for the same failed commit. If found, it reuses the existing plan instead of creating a new one.
Heartbeat-based ownership: Before executing a rollback, the writer acquires a heartbeat for the rollback instant under a transaction lock. If another writer already holds an active heartbeat (i.e., is currently executing the rollback), the current writer skips execution. If the heartbeat is expired, the writer takes ownership and proceeds.
Completed rollback detection: Inside the heartbeat acquisition lock, the writer also checks if the rollback was already completed on the timeline by another writer, and skips if so.

Changes:

BaseHoodieTableServiceClient: Refactored rollback() into schedule and execute phases. Extracted resolveOrScheduleRollback() (reuses existing pending rollbacks or schedules new ones under lock) and acquireRollbackHeartbeatIfMultiWriter() (heartbeat-based ownership with completed-rollback detection). Wrapped heartbeatClient.stop() in try-catch in the finally block to avoid masking rollback exceptions.
HoodieWriteConfig: New advanced config hoodie.rollback.avoid.duplicate.plan (default false), gated on multi-writer mode via shouldAvoidDuplicateRollbackPlan().
TestClientRollback: Added 6 tests covering: expired heartbeat takeover, active heartbeat skip, commit-not-in-timeline, first-writer-schedules-new-plan, already-completed-by-another-writer, and concurrent two-writer rollback of the same commit.

Impact
New advanced config: hoodie.rollback.avoid.duplicate.plan — opt-in, default false, only effective in multi-writer mode.
No breaking changes to public APIs or storage format.
No behavioral change for single-writer mode or when the config is disabled.

---------

Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local>
Co-authored-by: sivabalan <n.siva.b@gmail.com>
…he#18552)

* fix: Parquet small-precision decimals decode ClassCastException

Parquet encodes DECIMAL physically as INT32 (precision <= 9), INT64
(precision <= 18), or BINARY / FIXED_LEN_BYTE_ARRAY. The Hudi-Flink 1.18
ParquetDecimalVector predated this and unconditionally cast the child
vector to BytesColumnVector inside getDecimal(), throwing
ClassCastException whenever the reader materialized a small-precision
decimal as a HeapIntVector / HeapLongVector.

Sync ParquetDecimalVector with Apache Flink 2.1's implementation:
 - getDecimal dispatches on the physical child vector type
   (ParquetSchemaConverter#is32BitDecimal / is64BitDecimal) and decodes
   int / long / bytes accordingly.
 - Wrapper additionally implements WritableLongVector,
   WritableIntVector, and WritableBytesVector by delegation so that
   column readers can write through it without unwrapping.
 - Underlying vector field is now private; accessed via getVector().
   Migrated ArrayColumnReader's nine direct-field accesses.
 - Adds TestParquetDecimalVector covering every dispatch path, the
   backward-compatible bytes-at-small-precision case, unsupported
   vector types, null handling, and the Writable* delegation contracts.
…le format (apache#18162)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
…n Spark 4 (apache#18564)

Co-authored-by: Y Ethan Guo <ethan.guoyihua@gmail.com>
…che#18571)

RECURSION_OVERFLOW_SCHEMA passes new byte[0] (via getUTF8Bytes("")) as
the default value of a BYTES HoodieSchemaField, which wraps an Avro
Schema.Field. Avro 1.12.0's Schema.validateDefault rejects byte[] for
BYTES defaults — it now requires a String (interpreted as ISO-8859-1
bytes, Avro's canonical JSON form for BYTES defaults). Under 1.11.x the
validator was lenient.

Because the failure is in a static initializer, the JVM caches the
ExceptionInInitializerError and every subsequent class access throws
NoClassDefFoundError: Could not initialize class
org.apache.hudi.utilities.sources.helpers.ProtoConversionUtil$AvroSupport.
Any downstream service loading this class (e.g. DeltaStreamer using
ProtoClassBasedSchemaProvider) is permanently broken on JVMs that have
Avro 1.12+ on the classpath, even though Hudi itself still pins 1.11.4.

Fix: use "" (empty string) instead of getUTF8Bytes(""). The default
value is never read at runtime — it is always overwritten by
ByteBuffer.wrap(messageValue.toByteArray()) at line ~365 before the
field is populated — and "" serializes bit-for-bit identically to
new byte[0] on the wire (both produce "default":"" in the JSON schema),
so behavior is strictly preserved under Avro 1.11 while satisfying
1.12's stricter validator.

Rejected alternatives (experimental testing against both Avro versions):
| default arg              | 1.11.x | 1.12.0 |
| new byte[0] (current)    | OK     | FAIL   |
| "" (fix)                 | OK     | OK     |
| ByteBuffer.wrap(...)     | OK     | FAIL   |
| null                     | OK     | OK, but strips the default entirely |

Also drops the now-unused getUTF8Bytes static import.

Fixes: apache#18569

Signed-off-by: tiennguyen-onehouse <tien@onehouse.ai>
apache#18570)

HoodieFileGroupReaderBasedFileFormat.buildReaderWithPartitionValues builds
two schemas side-by-side: requestedSchema (what to return to Spark) and
dataSchema (what to read from parquet). It augments requestedSchema with
any partition fields in mandatoryFields before pruning, but pipes
dataStructType through unchanged. Spark's dataStructType excludes
partition columns by convention, and HoodieSchemaUtils.pruneDataSchema
iterates over its second arg, so any mandatory partition field is
silently dropped from the resulting dataSchema. The FileGroupReader then
does not read the column from the parquet base file, and for
non-projection-compatible CUSTOM mergers (e.g. PostgresDebeziumAvroPayload)
the output converter writes null for every affected row.

Most visible on MOR file slices that have both a base file and a log
file, since the readBaseFile path (which would append partition values
from the directory name) is skipped in favor of the FileGroupReader path.

Regression introduced by apache#13711 ("Improve Logical Type Handling on Col
Stats"), which added the pruneDataSchema wrapping but only on the
requested-schema side.

Fix: mirror requestedStructType's construction — augment dataStructType
with the mandatory partition fields before pruning.

Also adds a regression test (TestFileGroupReaderPartitionColumn) that
reproduces the scenario end-to-end: MOR + CustomKeyGenerator +
PostgresDebeziumAvroPayload + GLOBAL_SIMPLE with
update.partition.path=true, round-2 partition-key change producing a
base+log slice, then verifies untouched records in that slice read back
with the correct partition-column value.

Fixes: apache#18568

Signed-off-by: tiennguyen-onehouse <tien@onehouse.ai>
…he#18379)

This PR adds support for custom Parquet configuration injection across all file writer factories in Apache Hudi. This feature allows users to inject custom Parquet configurations (e.g., native Parquet bloom filters, custom compression settings, dictionary encoding overrides) at runtime without modifying Hudi's core code.

Motivation: Users sometimes need to apply specific Parquet configurations for certain tables or partitions (e.g., disable dictionary encoding for high-cardinality columns, enable native Parquet bloom filters for specific columns, or apply custom encoding strategies). Previously, these configurations were hard-coded or required code changes. This PR introduces a pluggable mechanism via the HoodieParquetConfigInjector interface.

Summary and Changelog
Summary: Added support for custom Parquet configuration injection across Spark, Avro, and Flink file writers. Users can now implement
the HoodieParquetConfigInjector interface and specify it via the hoodie.parquet.config.injector.class configuration to inject custom
Parquet settings at write time.

Changes:

Core Implementation (hudi-client-common):
- Added HoodieParquetConfigInjector interface with withProps() method that accepts StoragePath, StorageConfiguration, and HoodieConfig
and returns modified configurations
Spark Integration (hudi-spark-client):
- Modified HoodieSparkFileWriterFactory.newParquetFileWriter() to check for and invoke config injector (lines 66-79)
- Added comprehensive tests in TestHoodieParquetConfigInjector:
testDisableDictionaryEncodingViaInjector() - validates dictionary encoding can be disabled
testInvalidInjectorClassThrowsException() - validates error handling
testNoInjectorUsesDefaultConfig() - validates backward compatibility
- Tests validate actual Parquet metadata (encodings) rather than just configuration
Avro Integration (hudi-hadoop-common):
- Modified HoodieAvroFileWriterFactory.newParquetFileWriter() to support config injection (lines 71-85)
- Updated getHoodieAvroWriteSupport() signature to accept StorageConfiguration
- Added TestHoodieAvroParquetConfigInjector with similar test coverage
Flink Integration (hudi-flink-client):
- Modified HoodieRowDataFileWriterFactory.newParquetFileWriter() to support config injection (lines 126-140)
- Added TestHoodieRowDataParquetConfigInjector with similar test coverage
Configuration:
- Added HOODIE_PARQUET_CONFIG_INJECTOR_CLASS config key in HoodieStorageConfig
- Added withParquetConfigInjectorClass() builder method

Impact
Public API:

New interface: HoodieParquetConfigInjector (marked with appropriate annotations)
New configuration: hoodie.parquet.config.injector.class (optional, defaults to empty string)

User-facing changes:
Users can now customize Parquet settings per file/partition without modifying Hudi code
Fully backward compatible - existing code continues to work without changes
No performance impact when feature is not used (single isNullOrEmpty() check)
This PR adds support for creating empty clean commits to optimize clean planning performance for append-only datasets.

Problem: In datasets with incremental cleaning enabled that receive infrequent updates or are primarily append-only, the clean planner performs a full table scan on every ingestion run because there are
no clean plans to mark progress. This leads to significant performance overhead, especially for large tables.

Solution: Introduce a new configuration hoodie.write.empty.clean.internval.hours that allows creating empty clean commits after a configurable duration. These empty clean commits update the
earliestCommitToRetain value, enabling subsequent clean planning operations to only scan partitions modified after the last empty clean, avoiding expensive full table scans.

Summary and Changelog
User-facing changes:
New advanced config hoodie.write.empty.clean.create.duration.ms (default: -1, disabled) to control when empty clean commits should be created
When enabled with incremental cleaning, Hudi will create empty clean commits after the specified duration (in milliseconds) to optimize clean planning performance

Detailed changes:

Config Addition (HoodieCleanConfig.java):
- Addedhoodie.write.empty.clean.internval.hours config property with builder method
- Marked as advanced config for power users
Clean Execution (CleanActionExecutor.java):
- Modified clean parallelism calculation to ensure minimum of 1 (was causing issues with empty plans)
- Added createEmptyCleanMetadata() method to construct metadata for empty cleans
- Updated runClean() to handle empty clean stats by creating appropriate metadata
Clean Planning (CleanPlanActionExecutor.java):
- Added getEmptyCleanerPlan() method to construct cleaner plans with no files to delete
- Modified requestClean() to return empty plans when partitions list is empty
- Added logic in requestCleanInternal() to check if empty clean commit should be created based on:
Incremental cleaning enabled
Time since last clean > configured threshold
Valid earliestInstantToRetain present

Impact
Performance Impact: Positive - significantly reduces clean planning time for append-only or infrequently updated datasets by avoiding full table scans

API Changes: None - purely additive configuration

Behavior Changes:

When enabled, users will see empty clean commits in the timeline at the configured intervals
These commits have totalFilesDeleted=0 and empty partition metadata but contain valid earliestCommitToRetain metadata
voonhous and others added 4 commits April 24, 2026 15:56
…ad (apache#18582)

Fixes: apache#18579

Hudi writes VECTOR as bare FIXED_LEN_BYTE_ARRAY with no Parquet logical-type
annotation, so Hive's Parquet reader reconstructs the Avro schema as plain
FIXED named after the column. Projecting that to the canonical VECTOR schema
failed with "cannot support rewrite value for schema type" because
HoodieSchemaType.FIXED and VECTOR are distinct and rewritePrimaryTypeWithDiffSchemaType
had no VECTOR case. FIXED_BYTES-backed VECTOR is byte-identical to FIXED at
matching size, so pass the writable through.
Increase -Xmx from 3g to 4g in the default, java11 and java17
Maven profiles to prevent OOM failures during CI test runs.
Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor
and payload shape of a custom-typed column survive inline compaction of
a log-only MOR table into a base file.

- TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata
  (5 commits via SQL + MERGE INTO to trigger default inline compaction).
- TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+,
  asserting native VariantType round-trips through compaction.
- TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline
  uses named_struct with hex byte literals; out-of-line creates real files
  via BlobTestHelpers.createTestFile and verifies bytes via read_blob().
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

Adds new Vector/Blob/Variant data types and Lance file format support, pluggable Parquet config injection, Flink continuous-sort buffer, enhanced rollback/archival/clean logic, extensive Spark/Flink/Hadoop test coverage, Docker Compose stacks for integration testing, and numerous utilities, refactors, and dependency updates across the codebase.

Changes

Cohort / File(s) Summary
Docker Compose Environments
docker/compose/docker-compose_hadoop340_hive2310_spark402_*.yml
Adds full AMD64/ARM64 compose stacks provisioning HDFS, YARN, Hive Metastore/Postgres, HiveServer2, Spark master/workers/adhoc, ZooKeeper, Kafka, MinIO, mc init container, volumes, healthchecks, and a hudi network.
Hadoop Entrypoints
docker/hoodie/hadoop/base*/entrypoint.sh
Removes duplicate yarn.nodemanager.bind-host insertion in multihomed YARN setup across Java11/17/base variants.
Parquet Config Injection & Storage Config
hudi-common/src/main/java/org/apache/hudi/io/HoodieParquetConfigInjector.java, hudi-common/src/main/java/org/apache/hudi/common/config/HoodieStorageConfig.java, writer factories (hudi-hadoop-common, hudi-client/hudi-spark-client, hudi-client/hudi-flink-client)
Adds pluggable Parquet writer config injector interface, apply utility, storage config option and builder helpers, integrates injection into Avro/Parquet writer factories; streaming writes disallowed for injector. Includes test injectors and unit/integration tests.
Vector / Variant / Blob Types & Schema Handling
hudi-common/src/main/java/.../HoodieSchema.java, internal/schema/..., VectorConversionUtils.java, HoodieSparkSchemaConverters.scala, multiple parser/AST files across Spark 3.3–4.0/4.1 variants
Implements VECTOR and VARIANT custom logical types, canonical metadata keys/serialization, parse/validation, internal-schema round-trip support (including BLOB), and utilities to build/parse vector footer values. Parser/AST/DDL changes propagate VECTOR/VARIANT support across Spark versions.
Lance File Format Integration
hudi-client/.../HoodieSparkLanceReader.java, HoodieSparkLanceWriter.java, LanceRecordIterator.java, hudi-hadoop-mr/.../HoodieLanceInputFormat*.java, SparkLanceReaderBase.scala
Adds Lance base-file support, vector metadata preservation and restoration, name-aligned Arrow→Spark vector handling, Lance writer/reader changes, and Hive InputFormat placeholders (unsupported for reading via Hive).
Batched Blob Reader & SQL Integration
hudi-spark-datasource/.../blob/* (ReadBlobExpression, ReadBlobRule, BatchedBlobReader/Exec/Strategy, ScalarFunctions)
Implements batched blob I/O: marker expression, logical rewrite rule, planner strategy, physical exec node, broadcasted storage conf, scalar read_blob() function registration, and end-to-end batching/merging logic.
Spark Extension Hooks
hudi-spark/src/main/scala/.../SparkAdapter.scala, per-Spark adapters (Spark3_3Adapter, Spark3_4Adapter, Spark4_0Adapter, Spark3_5Adapter, Spark4DefaultSource)
Adds adapter extension points to inject scalar functions and planner strategies; registers read_blob functions and BatchedBlobReaderStrategy across Spark adapters and Spark4 data-source registration for Variant support.
Rollback Scheduling & Execution
hudi-client/hudi-client-common/.../BaseHoodieTableServiceClient.java, HoodieWriteConfig additions, rollback tests
Refactors rollback into schedule vs execute phases, adds avoid-duplicate-plan config for multi-writer mode, heartbeat acquisition path for multi-writer, and extensive concurrency/pending-plan tests.
Timeline Archival & Empty-Clean Handling
TimelineArchiverV1.java, HoodieArchivalConfig.java, HoodieCleanConfig.java, CleanActionExecutor.java, CleanPlanActionExecutor.java, multiple clean tests
Adds archival blocking on latest-clean ECTR (config flag), explicit empty-clean metadata/plan construction, new clean interval config for synthetic empty cleans, ECTR-aware archival filtering, and comprehensive unit/functional tests.
Flink Continuous Sort Buffer
hudi-flink-datasource/hudi-flink/.../AppendWriteFunctionWithContinuousSort.java, AppendWriteFunctions.java, BufferType.java, Flink tests
Adds CONTINUOUS_SORT buffer implementation (TreeMap ordering, normalized-key copying), drain/batch controls and memory tracking, option plumbing, and integration tests including object-reuse scenarios.
RocksDB Global Index: Dictionary Encoding
CodedRecordGlobalLocationSerializer.java, RecordGlobalLocationSerializer.java, RocksDBIndexBackend.java, index tests
Introduces dictionary-encoded partition-path serialization for partitioned tables, protected helper methods for (de)serialization, backend flag for partitioned tables switching serializer, and tests for compatibility/size/unknown-id errors.
Utility Consolidation & Small API Adds
CollectionUtils.java (map helpers), removed MapUtils.java, FileSlice.getLogFileCnt(), HoodieNativeAvroHFileReader helper, AbstractTableFileSystemView tweaks, Hive/Glue sync small updates
Moves map helpers into CollectionUtils, deletes MapUtils, adds log-file count accessor, centralizes HFile close helper, and updates clients to use consolidated utilities.
Hudi Hive Sync Job & Utilities
hudi-utilities/.../HudiHiveSyncJob.java, tests
Adds standalone HudiHiveSyncJob CLI (JCommander) that builds Spark context and invokes HiveSyncTool; includes integration test scaffolding.
Numerous Tests & Test Infra
many test files across hudi-common, hudi-hadoop-common, hudi-client, hudi-flink, hudi-spark-datasource (vector/blob/variant/lance/archival/clean/rollback), test utils updates
Massive additions and updates to unit, integration, and functional tests covering new features (Vector/Blob/Variant/Lance), injection behaviors, Flink continuous sort, RocksDB serializer, archival/clean/rollback scenarios, and object-reuse test harness plumbing.
Dependency/Build Updates
pom.xml, various pom changes (Lance groupId/version, Spark 4.0.2, argLine JVM heap increases)
Bumps Lance coordinates (groupId from com.lancedborg.lance, versions), upgrades Spark 4.0.x, increases Maven surefire argLine Xmx defaults, and adjusts enforcer allowlist for Scala-2.12 transitive Flink artifacts.
Miscellaneous Fixes & Refactors
assorted files (Hive tests, HoodieAvroWriteSupport, HoodieBaseLanceWriter, Parquet/Avro writer factories, reader vector handling, runtime utils, Flink RuntimeContext helpers)
Small behavior changes: safer close semantics, additional extension points, writer close hooks for Lance metadata, vector handling fixes, new runtime helpers for Flink object-reuse, and related small refactorings.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client (SQL/API)
    participant Planner as Planner (ReadBlobRule / Extension)
    participant Driver as Driver (BatchedBlobReaderStrategy)
    participant Executor as Executor (BatchedBlobReadExec Task)
    participant Storage as Storage (HoodieStorage / FileReader)

    Client->>Planner: parse query with read_blob()
    Planner->>Planner: rewrite Project/Filter -> insert BatchedBlobRead nodes
    Planner->>Driver: plan with BatchedBlobReaderStrategy
    Driver->>Executor: produce BatchedBlobReadExec (broadcast storageConf)
    Executor->>Storage: request merged ranges (group by file, maxGap/ lookahead)
    Storage-->>Executor: return merged byte buffers
    Executor->>Executor: slice buffers into per-row payloads
    Executor-->>Client: return rows with blob bytes
Loading

(Note: rectangles/colors not required for sequence nodes per instructions; diagram shows high-level batched blob flow.)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐰 I hopped through Parquet, Lance, and blob,
Built vectors, variants, and a Flink-sorted mob,
Docker buckets, MinIO cheer,
Rollbacks tidy — no duplicate fear,
Hudi hops forward — kit complete, full-throb!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch oss-18597

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark4_0ExtendedSqlParser.scala (1)

131-143: ⚠️ Potential issue | 🟠 Major

isHoodieCommand can miss VECTOR/BLOB in valid SQL forms.

Line 141 and Line 142 only match when preceded by a literal space. This misses cases like ARRAY<VECTOR(...)> or MAP<STRING,VECTOR(...)>, so Hoodie parsing may not be triggered.

💡 Suggested fix
   private def isHoodieCommand(sqlText: String): Boolean = {
     val normalized = sqlText.toLowerCase(Locale.ROOT).trim().replaceAll("\\s+", " ")
+    val hasCustomTypeKeyword = "\\b(blob|vector)\\b".r.findFirstIn(normalized).isDefined
     normalized.contains("system_time as of") ||
       normalized.contains("timestamp as of") ||
       normalized.contains("system_version as of") ||
       normalized.contains("version as of") ||
       normalized.contains("create index") ||
       normalized.contains("drop index") ||
       normalized.contains("show indexes") ||
       normalized.contains("refresh index") ||
-      normalized.contains(" blob") ||
-      normalized.contains(" vector")
+      hasCustomTypeKeyword
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark4.0.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark4_0ExtendedSqlParser.scala`
around lines 131 - 143, The isHoodieCommand method currently checks for " blob"
and " vector" which require a leading space and will miss forms like
ARRAY<VECTOR(...)> or MAP<STRING,VECTOR(...)>; update isHoodieCommand to detect
blob/vector more robustly (e.g., replace normalized.contains(" blob") and
normalized.contains(" vector") with a regex word-boundary check like
normalized.matches(".*\\b(blob|vector)\\b.*") or checks for "vector(" as well)
so VECTOR and BLOB are matched regardless of surrounding punctuation; update
only the isHoodieCommand function to use the new contains/match logic.
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_4ExtendedSqlParser.scala (1)

122-133: ⚠️ Potential issue | 🟠 Major

Include VARIANT in the Hoodie-command detection.

This method now routes VECTOR DDL correctly, but VARIANT is still missing. CREATE TABLE ... col VARIANT will keep falling back to Spark's parser, so the new custom type support is still incomplete on Spark 3.4.

Suggested fix
     normalized.contains("show indexes") ||
     normalized.contains("refresh index") ||
     normalized.contains(" blob") ||
-    normalized.contains(" vector")
+    normalized.contains(" vector") ||
+    normalized.contains(" variant")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_4ExtendedSqlParser.scala`
around lines 122 - 133, The isHoodieCommand method currently checks for "
vector" and " blob" but omits VARIANT, so DDL with a column type VARIANT won't
be routed to Hoodie parser; update the normalized.contains checks in
isHoodieCommand to also include a check for " variant" (matching the same
lowercase/whitespace-normalized approach as the other type checks) so CREATE
TABLE ... col VARIANT is detected and handled by the Hoodie parser.
hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlParser.scala (1)

122-133: ⚠️ Potential issue | 🟠 Major

Route VARIANT DDL through the Hudi parser too.

isHoodieCommand now catches VECTOR, but it still misses VARIANT. That means CREATE TABLE ... col VARIANT will continue to go through Spark's parser instead of the Hudi path, so the new custom type support won't be reachable from SQL on Spark 3.3.

Suggested fix
     normalized.contains("show indexes") ||
     normalized.contains("refresh index") ||
     normalized.contains(" blob") ||
-    normalized.contains(" vector")
+    normalized.contains(" vector") ||
+    normalized.contains(" variant")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlParser.scala`
around lines 122 - 133, The isHoodieCommand method in
HoodieSpark3_3ExtendedSqlParser.scala currently checks for type tokens like "
blob" and " vector" but misses " variant", so add a normalized.contains("
variant") check to the boolean chain inside the isHoodieCommand function (where
normalized is computed) so CREATE TABLE ... col VARIANT will be routed through
the Hudi parser; keep the same normalization logic (toLowerCase + trim +
replaceAll("\\s+", " ")) and add the contains(" variant") entry alongside the
other contains checks.
🟠 Major comments (18)
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/index/CodedRecordGlobalLocationSerializer.java-33-35 (1)

33-35: ⚠️ Potential issue | 🟠 Major

Dictionary state is not persisted across serializer instances, causing restart failures.

Lines 33–35 maintain instance-local partition path dictionary mappings that are lost when the serializer instance is recreated. Since RocksDBIndexBackend (line 48) creates a fresh CodedRecordGlobalLocationSerializer() on each instantiation, reusing a persisted RocksDB path after task restart will fail: old entries store only dictionary IDs (integers), which cannot be decoded without the original dictionary. The getPartitionPath() method (lines 58–64) throws IllegalStateException when encountering unknown IDs from prior serializer instances.

The javadoc at lines 30–31 explicitly documents this limitation: "Serialized bytes can only be deserialized correctly by the same serializer instance that encoded them."

To support restart safety, either persist dictionary metadata (via a dedicated RocksDB column family with bootstrap loading), switch to non-coded serialization for persisted index backends, or document that RocksDB paths cannot be reused across instances.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/index/CodedRecordGlobalLocationSerializer.java`
around lines 33 - 35, CodedRecordGlobalLocationSerializer currently keeps
partitionPathToDictId and dictIdToPartitionPath as instance-local fields which
are lost across restarts; modify CodedRecordGlobalLocationSerializer to persist
the dictionary to RocksDB so deserialization after restart can reconstruct
mappings: add a RocksDB column family (e.g., "partition_dict") accessed from the
serializer constructor to bootstrap existing mappings into partitionPathToDictId
and dictIdToPartitionPath, update that column family whenever you assign a new
dict id (the code paths that allocate ids—where partitionPathToDictId is
mutated), and read from it in getPartitionPath/deserialize to resolve historic
ids; ensure writes are atomic with the index updates in RocksDBIndexBackend (or,
if you prefer, replace CodedRecordGlobalLocationSerializer with a non-coded
serializer for persisted backends) so that old integer-only serialized entries
remain decodable after task restart.
hudi-flink-datasource/hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/vector/ParquetDecimalVector.java-75-225 (1)

75-225: ⚠️ Potential issue | 🟠 Major

Fix inconsistent delegation in wrapped-vector type checks.

ParquetDecimalVector implements typed interfaces (WritableLongVector, WritableIntVector, WritableBytesVector) while accepting any ColumnVector in the constructor. This creates two problems:

  1. Asymmetric failure modes: getDecimal() works with readable types (IntColumnVector, LongColumnVector, BytesColumnVector), but getInt(), getLong(), and getBytes() require the writable variants and throw RuntimeException if the child lacks them. Other mutators (setInt(), setLong(), fill()) silently no-op instead.

  2. Type contract violation: A read-only IntColumnVector wrapper passes construction but fails later when calling typed getters, or silently drops writes via mutators.

Constrain the constructor to require a writable child type and validate up front, or delegate getters uniformly through readable interfaces and make all unsupported mutators throw consistently rather than silently failing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink1.18.x/src/main/java/org/apache/hudi/table/format/cow/vector/ParquetDecimalVector.java`
around lines 75 - 225, ParquetDecimalVector currently accepts any ColumnVector
but mixes readable and writable interfaces inconsistently; update
ParquetDecimalVector to validate and enforce a writable child at construction
(throw IllegalArgumentException if the supplied vector is not a
WritableColumnVector and the concrete writable types required), and then change
all getters/mutators to consistently cast to the appropriate Writable*
interfaces (e.g. getInt/getLong/getBytes -> use
WritableIntVector/WritableLongVector/WritableBytesVector; setInt/setLong/fill
and other mutators -> call the writable methods directly) rather than silently
no-op or throw different exceptions later; ensure the constructor and
class-level validation covers the specific writable subtypes used by methods
like getDecimal, getInt, getLong, getBytes, setInt, setLong, fill so behavior is
deterministic and failures occur at construction time.
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithContinuousSort.java-179-213 (1)

179-213: ⚠️ Potential issue | 🟠 Major

The memory guard does not account for the row being inserted.

processElement decides whether to drain based on the current sizeTracer.bufferSize, then inserts the next row and only afterwards updates the tracer. Because that tracer is also pinned to the first row's size, a larger later record can push the buffer well past maxBufferSize and keep it there until another record or a checkpoint arrives. That defeats the intended memory bound for variable-width payloads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithContinuousSort.java`
around lines 179 - 213, The memory check must account for the incoming row's
size before accepting it: in processElement compute the currentRecordSize (use
ObjectSizeCalculator.getObjectSize(data) after doing the object-reuse copy) and
use sizeTracer.bufferSize + currentRecordSize > sizeTracer.maxBufferSize (or >=)
as the condition to call drainRecords(drainSize); after draining re-check using
the same currentRecordSize to decide whether to throw; set estimatedRecordSize
if it was zero (e.g., estimatedRecordSize = currentRecordSize) and call
sizeTracer.trace(currentRecordSize) when inserting instead of tracing the stale
estimatedRecordSize; refer to processElement, sizeTracer.bufferSize,
sizeTracer.maxBufferSize, ObjectSizeCalculator.getObjectSize,
estimatedRecordSize, sortedRecords.put and sizeTracer.trace to locate edits.
hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java-79-101 (1)

79-101: ⚠️ Potential issue | 🟠 Major

Avoid logging raw config/properties; this can leak credentials.

cfg and props may include sensitive values (for example hive.pass), and these are currently logged verbatim.

Proposed fix
-    LOG.info("Cfg received: {}", cfg);
+    LOG.info("Cfg received: basePath={}, baseFileFormat={}, propsFilePathSet={}, hoodieConfCount={}",
+        cfg.basePath, cfg.baseFileFormat, cfg.propsFilePath != null, cfg.configs.size());
@@
-      LOG.info("HiveSyncConfig props used to sync data {}", props);
+      LOG.info("HiveSyncConfig prepared for sync. propertyKeys={}", props.keySet());

Also applies to: 136-142

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`
around lines 79 - 101, Remove or sanitize sensitive fields before logging the
configuration and properties in HudiHiveSyncJob; specifically stop logging cfg
and props verbatim in the static main block (the LOG.info("Cfg received: {}",
cfg) call) and in the run() method where LOG.info("HiveSyncConfig props used to
sync data {}", props) is invoked. Implement and call a helper like
sanitizeConfig or maskSensitive(Map/Config) that redacts known sensitive keys
(e.g., hive.pass, password, secret, key) and only logs non-sensitive keys or a
redacted summary, and replace the direct uses of cfg and props in those LOG.info
calls with the sanitized output. Ensure the helper is referenced from
HudiHiveSyncJob.run() and the main startup logging so both locations (including
the later mentions around lines 136-142) no longer expose raw credentials.
hudi-common/src/main/java/org/apache/hudi/common/util/CollectionUtils.java-77-79 (1)

77-79: ⚠️ Potential issue | 🟠 Major

containsAll(Map, Map) should be null-safe to avoid runtime NPEs.

Current implementation throws when either map is null. This can break callers that compare metastore/serde parameter maps that may be absent.

Proposed fix
   public static boolean containsAll(Map<?, ?> m1, Map<?, ?> m2) {
-    return m1.entrySet().containsAll(m2.entrySet());
+    if (isNullOrEmpty(m2)) {
+      return true;
+    }
+    if (isNullOrEmpty(m1)) {
+      return false;
+    }
+    return m1.entrySet().containsAll(m2.entrySet());
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-common/src/main/java/org/apache/hudi/common/util/CollectionUtils.java`
around lines 77 - 79, The containsAll(Map<?,?> m1, Map<?,?> m2) method in
CollectionUtils is not null-safe and can throw NPEs; update it to treat null
maps as empty (i.e., return true if both are null or if m2 is null/empty, return
false if m1 is null but m2 is non-empty) so callers comparing optional
metastore/serde parameter maps don't crash; implement these null checks at the
start of containsAll(Map, Map) (in class CollectionUtils) before calling
entrySet().containsAll().
hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java-81-85 (1)

81-85: ⚠️ Potential issue | 🟠 Major

Use a buildSparkContext variant that doesn't hardcode a default master.

When --spark-master is omitted, the code forces local[2] execution via the defaultMaster parameter. In cluster environments, this prevents inheriting the cluster's configured master (e.g., from spark-defaults.conf).

Use UtilHelpers.buildSparkContext(String appName, boolean enableHiveSupport, Map<String, String> configs) instead, which respects the Spark configuration without forcing a fallback master. This allows proper delegation to cluster settings when --spark-master is not specified.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@hudi-utilities/src/main/java/org/apache/hudi/utilities/HudiHiveSyncJob.java`
around lines 81 - 85, The code in HudiHiveSyncJob currently forces "local[2]"
when cfg.sparkMaster is empty by calling
UtilHelpers.buildSparkContext("HudiHiveSyncJob", "local[2]", true); change this
to call the overload that accepts appName, enableHiveSupport and a configs map
so it does not hardcode a default master: use
UtilHelpers.buildSparkContext("HudiHiveSyncJob", true, configs) (construct
configs from existing Spark/Hudi config values or an empty map) when
cfg.sparkMaster is null/empty, and keep using the master-aware overload when
cfg.sparkMaster is provided; ensure hive support remains enabled.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java-1352-1379 (1)

1352-1379: ⚠️ Potential issue | 🟠 Major

Re-read the target instant after taking the rollback lock.

commitInstantOpt is captured before beginStateChange(...) and is only refreshed when shouldAvoidDuplicateRollbackPlan() is enabled. With that flag off, this method can still schedule a rollback from a stale snapshot if another writer changes the timeline in between.

Suggested fix
     if (!skipLocking) {
       txnManager.beginStateChange(Option.empty(), Option.empty());
     }
     try {
+      table.getMetaClient().reloadActiveTimeline();
+      commitInstantOpt = Option.fromJavaOptional(table.getActiveTimeline().getCommitsTimeline().getInstantsAsStream()
+          .filter(instant -> EQUALS.test(instant.requestedTime(), commitInstantTime))
+          .findFirst());
+
       if (config.shouldAvoidDuplicateRollbackPlan()) {
         // Check if another writer already scheduled a rollback for this instant to avoid duplicates.
-        table.getMetaClient().reloadActiveTimeline();
         Option<HoodiePendingRollbackInfo> pendingRollbackOpt = getPendingRollbackInfo(table.getMetaClient(), commitInstantTime);
         if (pendingRollbackOpt.isPresent()) {
           // Case 2a: a concurrent writer already scheduled the rollback — re-use it.
           return Option.of(Pair.of(pendingRollbackOpt.get().getRollbackInstant(),
               Option.of(pendingRollbackOpt.get().getRollbackPlan())));
         }
-        commitInstantOpt = Option.fromJavaOptional(table.getActiveTimeline().getCommitsTimeline().getInstantsAsStream()
-            .filter(instant -> EQUALS.test(instant.requestedTime(), commitInstantTime))
-            .findFirst());
       }
       if (commitInstantOpt.isEmpty()) {
         log.error("Cannot find instant {} in the timeline of table {} for rollback", commitInstantTime, config.getBasePath());
         return Option.empty();
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java`
around lines 1352 - 1379, commitInstantOpt is captured before acquiring the
rollback lock (txnManager.beginStateChange) and when
shouldAvoidDuplicateRollbackPlan() is false we never refresh it, so we may
operate on a stale timeline; after calling beginStateChange(...) (i.e., once the
rollback lock is held) reload the timeline via
table.getMetaClient().reloadActiveTimeline() and recompute commitInstantOpt
using the same logic that uses
table.getActiveTimeline().getCommitsTimeline().getInstantsAsStream() (the code
that filters by EQUALS.test(instant.requestedTime(), commitInstantTime)), before
proceeding to log, scheduleRollback (table.scheduleRollback) or create the new
rollback instant; ensure this refresh happens regardless of
shouldAvoidDuplicateRollbackPlan() so the code always uses the latest timeline
snapshot under lock.
docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml-230-243 (1)

230-243: ⚠️ Potential issue | 🟠 Major

Update MinIO to use current credential variables.

MINIO_ACCESS_KEY and MINIO_SECRET_KEY have been deprecated since April 2021 (MinIO RELEASE.2021-04-22T15-44-28Z). Using minio/minio:latest with these deprecated variables creates a maintenance risk—future MinIO releases may remove support entirely. Switch to the current standard:

Required changes
     environment:
-      - MINIO_ACCESS_KEY=minio
-      - MINIO_SECRET_KEY=minio123
+      - MINIO_ROOT_USER=minio
+      - MINIO_ROOT_PASSWORD=minio123
       - MINIO_DOMAIN=minio

Additionally, consider pinning the image version instead of using latest to avoid unexpected behavioral changes in future MinIO releases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml` around
lines 230 - 243, The minio service is using deprecated environment vars
MINIO_ACCESS_KEY and MINIO_SECRET_KEY and an unpinned image; update the minio
service to use the current variables MINIO_ROOT_USER and MINIO_ROOT_PASSWORD
instead of MINIO_ACCESS_KEY / MINIO_SECRET_KEY, replace or add those values
under the environment block for the "minio" service, and pin the image to a
stable MinIO release tag (avoid image: 'minio/minio:latest') to prevent
unexpected upgrades; keep the existing
hostname/container_name/ports/volumes/command intact.
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml-19-19 (1)

19-19: ⚠️ Potential issue | 🟠 Major

Pin all service images to specific version tags instead of using :latest.

This compose file contains twelve services using floating :latest tags or untagged images, making the test environment non-deterministic and prone to unexpected failures when base images are updated. Lines affected: 19, 37, 60, 93, 116, 154, 175, 193, 213, 231, 246. For comparison, zookeeper (line 135), kafka (line 144), and postgresql (line 86) already use pinned versions—apply the same practice to the remaining services.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml` at line
19, Replace all Docker image references that use floating tags or omit tags
(e.g., the apachehudi/hudi-hadoop_3.4.0-namenode:latest entry and the other
image lines mentioned) with explicit, immutable tags or digests to make the
compose deterministic; update the image keys for each service to a specific
version (or `@sha256` digest) consistent with your CI/test matrix, mirroring how
zookeeper, kafka, and postgresql are already pinned, and ensure the service
names (e.g., the namenode image reference, datanode, worker, spark, hive
metastore, etc.) are updated accordingly so no service relies on :latest.
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml-230-243 (1)

230-243: ⚠️ Potential issue | 🟠 Major

Migrate MinIO from deprecated credentials to current standard.

MinIO deprecated MINIO_ACCESS_KEY and MINIO_SECRET_KEY in April 2021. Current deployments should use MINIO_ROOT_USER and MINIO_ROOT_PASSWORD instead. Combined with minio/minio:latest, this creates a maintenance risk: future MinIO releases may drop backward compatibility for the deprecated variables.

This pattern appears across multiple docker-compose files in the repository (docker-compose_hadoop340_hive2310_spark402_amd64.yml, docker-compose_hadoop340_hive2310_spark402_arm64.yml, docker-compose_hadoop340_hive313_spark401_amd64.yml, docker-compose_hadoop340_hive313_spark401_arm64.yml, docker-compose_hadoop334_hive313_spark353_amd64.yml, and docker-compose_hadoop334_hive313_spark353_arm64.yml).

Proposed fix
   minio:
-    image: 'minio/minio:latest'
+    image: 'minio/minio:<pinned-release>'
     hostname: minio
     container_name: minio
@@
     environment:
-      - MINIO_ACCESS_KEY=minio
-      - MINIO_SECRET_KEY=minio123
+      - MINIO_ROOT_USER=minio
+      - MINIO_ROOT_PASSWORD=minio123
       - MINIO_DOMAIN=minio
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml` around
lines 230 - 243, The MinIO service block named "minio" uses deprecated
environment variables MINIO_ACCESS_KEY and MINIO_SECRET_KEY; update this to use
MINIO_ROOT_USER and MINIO_ROOT_PASSWORD (preserving existing credential values)
in the minio service definition and the other listed docker-compose files so the
container runs with the modern variables; ensure the command and ports remain
unchanged and verify the variable names are consistently replaced in all
occurrences across the repository (e.g., entries in
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml and the
other listed compose files).
docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml-19-19 (1)

19-19: ⚠️ Potential issue | 🟠 Major

Address missing ARM64 support in legacy service images.

This ARM64-specific compose file itself is the intended solution for apachehudi images (which lack unified multi-arch manifests per HUDI-3601), but three supporting service images are blocking ARM64 execution:

  • bde2020/hive-metastore-postgresql:2.3.0 — AMD64-only (postgres:9.5.3 base from 2016, no ARM64 variant)
  • bitnamilegacy/kafka:3.4.1 — AMD64-only (ARM64 support starts at 3.3.2+)
  • bitnamilegacy/zookeeper:3.6.4 — AMD64-only (ARM64 support starts at 3.9.3+)

Either substitute these with ARM64-supporting versions (e.g., bitnamilegacy/kafka:4.0.0-debian-12-r10, bitnamilegacy/zookeeper:3.9.3-debian-12-r22) or document that this stack only runs on AMD64 systems despite the filename.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml` at line
19, The compose file references legacy AMD64-only service images (symbols:
bde2020/hive-metastore-postgresql:2.3.0, bitnamilegacy/kafka:3.4.1,
bitnamilegacy/zookeeper:3.6.4) which block ARM64 execution; update the
docker-compose entries to ARM64-compatible images (e.g., replace kafka and
zookeeper with bitnamilegacy/kafka:4.0.0-debian-12-r10 and
bitnamilegacy/zookeeper:3.9.3-debian-12-r22, and swap hive-metastore-postgresql
to a maintained ARM64 PostgreSQL-backed metastore image) or add a clear note in
the compose file header stating the stack is AMD64-only if replacements aren’t
viable. Ensure the image keys in the compose (the same lines that currently set
those three image values) are changed to the ARM64 tags or the header comment is
added.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala-693-700 (1)

693-700: ⚠️ Potential issue | 🟠 Major

Pin the inline compaction threshold explicitly.

Both tests hard-code “still log-only” through the 4th write and “compacts on the 5th,” but neither table definition sets hoodie.compact.inline.max.delta.commits. That makes these assertions depend on a global default instead of the test setup.

Suggested fix
           |  hoodie.index.type = 'INMEMORY',
           |  hoodie.compact.inline = 'true',
+          |  hoodie.compact.inline.max.delta.commits = '4',
           |  hoodie.clean.commits.retained = '1'

Apply the same change to the Lance variant.

Also applies to: 817-825

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala`
around lines 693 - 700, The table definitions in TestVectorDataSource that set
tblproperties (the block including primaryKey, type, preCombineField,
hoodie.index.type, hoodie.compact.inline, hoodie.clean.commits.retained) do not
explicitly set the inline compaction threshold; add
hoodie.compact.inline.max.delta.commits = '4' to that properties map so the
tests that expect "still log-only" for the 4th write and compaction on the 5th
are deterministic, and apply the identical change to the Lance variant
table-definition block as well.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/blob/TestBatchedBlobReader.scala-254-266 (1)

254-266: ⚠️ Potential issue | 🟠 Major

Make the preserved-column assertions deterministic.

This test indexes directly into resultDF.collect(), but there is no ordering step here. If readBatched changes partitioning or row order, the assertions on results(0)/results(1) will become flaky even when the implementation is correct.

Suggested fix
-    val results = resultDF.collect()
+    val results = resultDF.orderBy("record_id").collect()
     assertEquals(2, results.length)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/blob/TestBatchedBlobReader.scala`
around lines 254 - 266, The assertions index into resultDF.collect() without a
defined ordering, which makes the test flaky; before collecting, apply a
deterministic sort (e.g., orderBy the preserved key/record_id or other unique
column) on resultDF returned by readBatched in TestBatchedBlobReader so the rows
are in a stable order, then collect and assert values for the expected rows (use
record_id or sequence in orderBy to locate rows deterministically).
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/ReadBlobRule.scala-79-82 (1)

79-82: ⚠️ Potential issue | 🟠 Major

Replace generic exceptions with AnalysisException for user-facing query validation errors.

These branches (79–82, 102–103, 165–167) reject user SQL and should report analysis errors consistent with Spark's optimizer/analyzer conventions, not generic runtime exceptions. AnalysisException is already imported in this file and is the standard mechanism for query-level validation failures during planning.

Also applies to: 93, 169 (consider whether internal consistency errors should also use AnalysisException).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/ReadBlobRule.scala`
around lines 79 - 82, Replace user-facing IllegalArgumentException throws with
Spark's AnalysisException in ReadBlobRule: for the case matching
containsReadBlobInAnyExpression(node) (and the other branches flagged at lines
~93, 102–103, 165–167, 169), change throw new IllegalArgumentException(...) to
throw new AnalysisException(...) preserving the existing error messages; ensure
you only convert branches that are validating user SQL (keep internal
consistency checks as-is or review individually) and that the file's imported
AnalysisException is used.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala-397-401 (1)

397-401: ⚠️ Potential issue | 🟠 Major

The Lance BLOB test is enabled in Spark 3.5+ profiles despite being documented as incomplete.

The test at lines 397–401 contains a comment stating "Lance writer has no BLOB handling today (RFC-100 Phase 2). Expected to fail until support lands in HoodieSparkLanceWriter." However, this test is only skipped when lance.skip.tests=true, which applies to Spark 3.3/3.4 profiles. For Spark 3.5 and 4.0 profiles (where lance.skip.tests=false), and for any default build run, the test will execute and fail, creating red builds instead of a properly tracked pending test. Either fix the BLOB handling in Lance or add an explicit skip/pending marker for this test across all profiles that enable it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 397 - 401, The test "Test Query Log Only MOR Table With BLOB
OUT_OF_LINE column triggers compaction (Lance)" in TestBlobDataType.scala is
currently only skipped when the system property lance.skip.tests=true, which
leaves it enabled in 3.5/4.0 profiles; mark it as pending/ignored across all
profiles by either replacing test(...) with ignore(...) or adding an
unconditional assume(false, "Lance writer has no BLOB handling (RFC-100 Phase 2)
— test pending until implemented") at the start of that test; update the single
test definition so it no longer runs by default until HoodieSparkLanceWriter
gains BLOB support.
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala-318-329 (1)

318-329: ⚠️ Potential issue | 🟠 Major

Allow overlapping or duplicate ranges instead of throwing.

Two rows can legitimately reference the same bytes, or partially overlapping slices, in the same external file. The current overlap check turns that into an exception as soon as both rows land in the same lookahead batch.

Proposed fix
-        val gap = row.offset - currentEndOffset
-        // Check for overlap
-        if (row.offset < currentEndOffset) {
-          throw new IllegalArgumentException(
-            s"Overlapping blob ranges detected: previous range [${currentStartOffset}, ${currentEndOffset}) and current row [${row.offset}, ${row.offset + row.length}) in file ${row.filePath}"
-          )
-        }
-        if (gap >= 0 && gap <= maxGap) {
+        val gap = row.offset - currentEndOffset
+        if (gap <= maxGap) {
           // Merge into current range
           currentEndOffset = math.max(currentEndOffset, row.offset + row.length)
           currentRows += row
         } else {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`
around lines 318 - 329, The overlap check in BatchedBlobReader.scala (within the
batching logic that uses currentStartOffset, currentEndOffset, currentRows and
row) incorrectly throws on overlapping or duplicate ranges; change it to accept
overlaps by removing the IllegalArgumentException and treating
overlapping/duplicate rows like adjacent ones: update currentEndOffset =
math.max(currentEndOffset, row.offset + row.length) and add row to currentRows
so the batch includes both references, ensuring downstream read logic can handle
duplicate/overlapping slices; keep the existing gap/maxGap handling for
non-overlapping gaps.
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala-237-243 (1)

237-243: ⚠️ Potential issue | 🟠 Major

Quarantine this known-failing Lance VARIANT test until support is implemented.

The test acknowledges that Lance writer has no VARIANT handling today and will fail until RFC-100 Phase 2 lands. With lance.skip.tests=false by default in the Spark 4.0 build configuration, this test will execute and fail on every default run instead of being skipped. Other Lance tests in the codebase use @DisabledIfSystemProperty(named = "lance.skip.tests", matches = "true") for proper test isolation; this test should either use the same annotation or be marked ignore() until VARIANT support is available.

Suggested quarantine
-  test("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") {
+  ignore("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`
around lines 237 - 243, The test "Test Query Log Only MOR Table With VARIANT
column triggers compaction (Lance)" is a known-failing Lance VARIANT test and
should be quarantined; update the test declaration in TestVariantDataType.scala
to either add the JUnit annotation `@DisabledIfSystemProperty`(named =
"lance.skip.tests", matches = "true") above the test or convert the test(...)
call to ignore(...) (or equivalent test-framework skip) so the test is skipped
when lance.skip.tests is not set to true, referencing the test string or the
enclosing test method to locate the change.
hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/InternalSchemaConverter.java-71-79 (1)

71-79: ⚠️ Potential issue | 🟠 Major

Avoid positive field IDs inside the BLOB sentinel record.

Using 0..3 for external_reference child fields can collide with regular schema field IDs (which are also allocated from 0), creating ambiguity in ID-based schema operations. Use dedicated negative sentinel IDs for these nested fields too.

🔧 Suggested fix
+  static final int BLOB_REFERENCE_PATH_FIELD_ID = -13;
+  static final int BLOB_REFERENCE_OFFSET_FIELD_ID = -14;
+  static final int BLOB_REFERENCE_LENGTH_FIELD_ID = -15;
+  static final int BLOB_REFERENCE_MANAGED_FIELD_ID = -16;
+
   private static Types.RecordType buildBlobInternalRecordType() {
     List<Types.Field> referenceFields = new ArrayList<>(4);
-    referenceFields.add(Types.Field.get(0, false,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_PATH_FIELD_ID, false,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH, Types.StringType.get()));
-    referenceFields.add(Types.Field.get(1, true,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_OFFSET_FIELD_ID, true,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_OFFSET, Types.LongType.get()));
-    referenceFields.add(Types.Field.get(2, true,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_LENGTH_FIELD_ID, true,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_LENGTH, Types.LongType.get()));
-    referenceFields.add(Types.Field.get(3, false,
+    referenceFields.add(Types.Field.get(BLOB_REFERENCE_MANAGED_FIELD_ID, false,
         HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED, Types.BooleanType.get()));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-common/src/main/java/org/apache/hudi/internal/schema/convert/InternalSchemaConverter.java`
around lines 71 - 79, The BLOB sentinel's nested fields in
InternalSchemaConverter use positive IDs (referenceFields list), which can clash
with real schema field IDs; change the field ID integers for the four
referenceFields (the entries created with Types.Field.get(...) for
HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH, EXTERNAL_REFERENCE_OFFSET,
EXTERNAL_REFERENCE_LENGTH, EXTERNAL_REFERENCE_IS_MANAGED) to dedicated negative
sentinel IDs (e.g., -1, -2, -3, -4 or another reserved negative sequence) so
these nested IDs cannot collide with normal allocated IDs; update any
comments/constants if present to document the negative-id sentinel choice.
🟡 Minor comments (10)
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java-666-671 (1)

666-671: ⚠️ Potential issue | 🟡 Minor

getWritePipelineWithObjectReuse bypasses pipeline-mode routing.

This helper hardcodes InsertFunctionWrapper, unlike getWritePipeline(...) which respects bulk-insert/bucket/stream modes. That can create the wrong test pipeline for non-append configs and reduce test validity.

Proposed adjustment
 public static TestFunctionWrapper<RowData> getWritePipelineWithObjectReuse(
     String basePath, Configuration conf) throws Exception {
-  org.apache.flink.api.common.ExecutionConfig execConfig = new org.apache.flink.api.common.ExecutionConfig();
-  execConfig.enableObjectReuse();
-  return new InsertFunctionWrapper<>(basePath, conf, execConfig);
+  org.apache.flink.api.common.ExecutionConfig execConfig = new org.apache.flink.api.common.ExecutionConfig();
+  execConfig.enableObjectReuse();
+  if (OptionsResolver.isAppendMode(conf)) {
+    return new InsertFunctionWrapper<>(basePath, conf, execConfig);
+  }
+  return getWritePipeline(basePath, conf);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestData.java`
around lines 666 - 671, The helper getWritePipelineWithObjectReuse currently
hardcodes returning new InsertFunctionWrapper<>(basePath, conf, execConfig)
which bypasses the routing logic used by getWritePipeline and can produce
incorrect pipelines for non-append configs; change
getWritePipelineWithObjectReuse to enable object reuse on a local
ExecutionConfig and then delegate to the existing getWritePipeline routing logic
(call getWritePipeline(basePath, conf, execConfig) or the overload that accepts
an ExecutionConfig) instead of directly instantiating InsertFunctionWrapper so
the same bulk-insert / bucket / stream selection is preserved.
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java-83-91 (1)

83-91: ⚠️ Potential issue | 🟡 Minor

Propagate the supplied ExecutionConfig through the entire wrapper.

This constructor only updates the mocked environment/runtime context. The StreamTask created later still uses its own hard-coded ExecutionConfig, so async-clustering tests can end up running with different object-reuse semantics than the caller requested.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/utils/InsertFunctionWrapper.java`
around lines 83 - 91, The InsertFunctionWrapper constructor sets up a
MockEnvironment and MockStreamingRuntimeContext with the supplied
ExecutionConfig but doesn't propagate that ExecutionConfig to the StreamTask (or
any later-created task) which still uses a hard-coded/default ExecutionConfig;
update the wrapper so the same ExecutionConfig instance passed into
InsertFunctionWrapper(...) is used when constructing the
StreamTask/MockStreamTask and any related task builders/creators (e.g., where
StreamTask is instantiated or ExecutionConfig is fetched), by passing the
executionConfig variable through those constructors or setters so object-reuse
semantics match the caller's ExecutionConfig.
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java-877-883 (1)

877-883: ⚠️ Potential issue | 🟡 Minor

Prefer try-with-resources for the new write clients.

These clients are closed after assertions, and writeClient1.close() can also NPE if construction fails, which can mask the real test failure.

Proposed fix
-      SparkRDDWriteClient writeClient1 = null;
-      try {
-        writeClient1 = getHoodieWriteClient(config);
-        writeClient1.savepoint(fourthCommitTs, "user", "comment");
-      } finally {
-        writeClient1.close();
-      }
+      try (SparkRDDWriteClient<?> writeClient1 = getHoodieWriteClient(config)) {
+        writeClient1.savepoint(fourthCommitTs, "user", "comment");
+      }

Apply the same pattern to the other new writeClient usages in these tests.

Also applies to: 893-899, 905-926, 977-986, 1011-1023

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/table/functional/TestCleanPlanExecutor.java`
around lines 877 - 883, Wrap creation of SparkRDDWriteClient instances returned
by getHoodieWriteClient in try-with-resources so they are always closed even if
construction or subsequent assertions fail; specifically replace manual
new/close usage around variables like writeClient (used with runCleaner and
metaClient checks) with try (SparkRDDWriteClient<?> writeClient =
getHoodieWriteClient(config)) { ... } blocks, and apply the same pattern to the
other writeClient occurrences referenced in this test (lines around 893-899,
905-926, 977-986, 1011-1023) so no close() is called on a
null/partially-constructed client and resources are reliably released.
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java-260-263 (1)

260-263: ⚠️ Potential issue | 🟡 Minor

Use >= for the empty-clean interval boundary.

At exactly the configured interval, this still suppresses the empty clean, so the planner does one more full-scan cycle than necessary. This looks like an off-by-one in the threshold check.

Proposed fix
-          eligibleForEmptyCleanCommit = currentCleanTimeMs - lastCleanTimeMs > (TimeUnit.HOURS.toMillis(config.getIntervalToCreateEmptyCleanHours()));
+          eligibleForEmptyCleanCommit = currentCleanTimeMs - lastCleanTimeMs >= TimeUnit.HOURS.toMillis(config.getIntervalToCreateEmptyCleanHours());
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/clean/CleanPlanActionExecutor.java`
around lines 260 - 263, The empty-clean eligibility check currently computes
currentCleanTimeMs and lastCleanTimeMs and sets eligibleForEmptyCleanCommit
using a strict greater-than comparison; change the condition in the assignment
to use a greater-than-or-equal (>=) comparison so that when the elapsed time
equals TimeUnit.HOURS.toMillis(config.getIntervalToCreateEmptyCleanHours()) the
empty clean is allowed; update the expression that assigns
eligibleForEmptyCleanCommit (which references latestDateTime,
currentCleanTimeMs, lastCleanTimeMs,
HoodieInstantTimeGenerator.parseDateFromInstantTime(...), and
config.getIntervalToCreateEmptyCleanHours()) to use >= instead of >.
docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml-18-34 (1)

18-34: ⚠️ Potential issue | 🟡 Minor

Either mount the declared namenode volume or remove it.

namenode is declared as a top-level volume, but the namenode service never mounts it. Right now the file advertises persistence while HDFS metadata still lives in the container filesystem and disappears with the container.

Also applies to: 259-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_arm64.yml` around
lines 18 - 34, The docker-compose advertises a top-level volume named "namenode"
but the namenode service does not mount it, so either mount the volume into the
namenode service (e.g., add a volumes: entry under the namenode service like " -
namenode:/hadoop/dfs/name" or another appropriate HDFS namenode metadata
directory) or remove the top-level "namenode" volume declaration if persistence
is not desired; update the same pattern for the other service mentioned (the
other service declared at lines 259-263) so declared volumes are actually
mounted or removed.
docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml-18-34 (1)

18-34: ⚠️ Potential issue | 🟡 Minor

Either mount the declared namenode volume or remove it.

namenode is declared as a top-level volume, but the namenode service never mounts it. Right now the file advertises persistence while HDFS metadata still lives in the container filesystem and disappears with the container.

Also applies to: 259-263

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/compose/docker-compose_hadoop340_hive2310_spark402_amd64.yml` around
lines 18 - 34, The docker-compose declares a top-level volume named "namenode"
but the namenode service does not mount it, so either mount it into the service
or remove the unused volume declaration; fix by adding a volumes section to the
namenode service (e.g., under service "namenode" add volumes: -
namenode:/hadoop/dfs/name or the container's NameNode metadata path) and ensure
the top-level volumes: block still defines "namenode", or alternatively delete
the top-level "namenode" volume if persistence is not desired; apply the same
change to any other services that declare volumes but do not mount them.
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java-56-77 (1)

56-77: ⚠️ Potential issue | 🟡 Minor

Use the injected config as the single source of truth.

populateMetaFields is captured before applyConfigInjector, while the rest of the parquet writer settings come from hoodieConfig. If an injector changes any setting that affects meta-field or bloom-filter behavior, this path initializes the writer from a mixed config snapshot.

Suggested fix
-    boolean populateMetaFields = config.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
-
     Pair<StorageConfiguration, HoodieConfig> injectedConfigs = HoodieParquetConfigInjector.applyConfigInjector(path, storage.getConf(), config);
     StorageConfiguration storageConfiguration = injectedConfigs.getLeft();
     HoodieConfig hoodieConfig = injectedConfigs.getRight();
+    boolean populateMetaFields = hoodieConfig.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkFileWriterFactory.java`
around lines 56 - 77, The code reads populateMetaFields from config before
calling HoodieParquetConfigInjector.applyConfigInjector, producing a mixed
snapshot; move the populateMetaFields assignment to after the
applyConfigInjector call and derive it from the injected HoodieConfig (i.e., use
hoodieConfig.getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS)), then
pass that value into enableBloomFilter(populateMetaFields, hoodieConfig) and
into the HoodieRowParquetConfig construction so all parquet/writer settings
consistently come from the injected hoodieConfig; update references to the
populateMetaFields local variable and ensure
HoodieParquetConfigInjector.applyConfigInjector, enableBloomFilter, and
HoodieRowParquetConfig use the injected hoodieConfig value.
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReaderStrategy.scala-40-48 (1)

40-48: ⚠️ Potential issue | 🟡 Minor

Validate batching configs before calling toInt.

A bad session conf currently bubbles up as a raw NumberFormatException during planning, and non-positive values are not rejected here. Please fail fast with a clear config error before building the exec node.

Suggested fix
+import scala.util.Try
+
 case class BatchedBlobReaderStrategy(sparkSession: SparkSession) extends SparkStrategy {
   def apply(plan: LogicalPlan): Seq[SparkPlan] = plan match {
     case read @ BatchedBlobRead(child, _, _) =>
       // TODO find proper way to access these configs
-      val maxGapBytes = HoodieSparkConfUtils.getConfigValue(
-        Map.empty, sparkSession.sessionState.conf,
-        BatchedBlobReader.MAX_GAP_BYTES_CONF,
-        String.valueOf(BatchedBlobReader.DEFAULT_MAX_GAP_BYTES)).toInt
+      val maxGapBytes = parsePositiveInt(
+        BatchedBlobReader.MAX_GAP_BYTES_CONF,
+        BatchedBlobReader.DEFAULT_MAX_GAP_BYTES)
 
-      val lookaheadSize = HoodieSparkConfUtils.getConfigValue(
-        Map.empty, sparkSession.sessionState.conf,
-        BatchedBlobReader.LOOKAHEAD_SIZE_CONF,
-        String.valueOf(BatchedBlobReader.DEFAULT_LOOKAHEAD_SIZE)).toInt
+      val lookaheadSize = parsePositiveInt(
+        BatchedBlobReader.LOOKAHEAD_SIZE_CONF,
+        BatchedBlobReader.DEFAULT_LOOKAHEAD_SIZE)
 
       val storageConf = new HadoopStorageConfiguration(sparkSession.sparkContext.hadoopConfiguration)
       BatchedBlobReadExec(
@@
     case _ => Nil
   }
+
+  private def parsePositiveInt(key: String, defaultValue: Int): Int = {
+    val raw = HoodieSparkConfUtils.getConfigValue(
+      Map.empty,
+      sparkSession.sessionState.conf,
+      key,
+      defaultValue.toString)
+
+    Try(raw.toInt).filter(_ > 0).getOrElse {
+      throw new IllegalArgumentException(s"$key must be a positive integer, got: $raw")
+    }
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReaderStrategy.scala`
around lines 40 - 48, The two config reads for maxGapBytes and lookaheadSize in
BatchedBlobReaderStrategy call .toInt directly and can throw
NumberFormatException or accept non-positive values; change the logic to
validate parsing and value ranges before building the exec node: for each config
key (BatchedBlobReader.MAX_GAP_BYTES_CONF and
BatchedBlobReader.LOOKAHEAD_SIZE_CONF) retrieve the string via
HoodieSparkConfUtils.getConfigValue, attempt a safe parse (e.g., parseInt with
explicit error handling or toIntOption), if parsing fails throw an
IllegalArgumentException with a clear message including the config name and bad
value, and if parsed value <= 0 also throw a clear IllegalArgumentException;
then assign the validated integers to maxGapBytes and lookaheadSize (using
DEFAULT_* when appropriate or only after validation).
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestFileGroupReaderPartitionColumn.scala-130-135 (1)

130-135: ⚠️ Potential issue | 🟡 Minor

Avoid toMap here; it can mask duplicate-key regressions.

collect().map(...).toMap silently keeps one value per id. If MOR merge regresses and returns duplicate rows per key, this test can still pass.

💡 Suggested hardening
-    val rows = spark.read.format("hudi").load(basePath)
-      .select("id", "country")
-      .collect()
-      .map(r => r.getLong(0) -> (if (r.isNullAt(1)) null else r.getString(1)))
-      .toMap
+    val grouped = spark.read.format("hudi").load(basePath)
+      .select("id", "country")
+      .collect()
+      .groupBy(_.getLong(0))
+
+    assertEquals(
+      Set.empty[Long],
+      grouped.collect { case (id, rs) if rs.lengthCompare(1) != 0 => id }.toSet,
+      "Expected exactly one row per id")
+
+    val rows = grouped.map { case (id, rs) =>
+      id -> (if (rs.head.isNullAt(1)) null else rs.head.getString(1))
+    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestFileGroupReaderPartitionColumn.scala`
around lines 130 - 135, The current test builds `rows` using
`collect().map(...).toMap`, which hides duplicate `id` values; change the test
to detect duplicates explicitly by collecting into a Seq (keep the
`collect().map(r => r.getLong(0) -> (if (r.isNullAt(1)) null else
r.getString(1)))` transformation but do not call `toMap`), then group by the id
(e.g., call `.groupBy(_._1)`), assert each group has size 1 (fail the test if
any id has >1 rows), and only then build a Map of id -> value for subsequent
assertions; reference the `rows` binding and the `collect().map(...).toMap`
usage when locating where to change the code.
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala-752-775 (1)

752-775: ⚠️ Potential issue | 🟡 Minor

Use the actual metadata contract in the docs and error text.

This helper no longer searches for hudi_blob=true; it searches for hudi_type=BLOB. Leaving the old wording here will send callers to the wrong metadata key when resolution fails.

Proposed fix
-   * Find the blob column by searching for metadata hudi_blob=true.
+   * Find the blob column by searching for metadata hudi_type=BLOB.
@@
-    // Find field with metadata key hudi_blob=true
+    // Find field with metadata key hudi_type=BLOB
@@
-        s"No blob column found with metadata hudi_blob=true. " +
+        s"No blob column found with metadata ${HoodieSchema.TYPE_METADATA_FIELD}=BLOB. " +
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/blob/BatchedBlobReader.scala`
around lines 752 - 775, Update the Javadoc and error message in getBlobColumn to
reference the actual metadata contract ("hudi_type=BLOB") instead of the
outdated "hudi_blob=true"; specifically change the method comment and the
IllegalArgumentException text to mention the metadata key
HoodieSchema.TYPE_METADATA_FIELD with value HoodieSchemaType.BLOB (or the
literal "hudi_type=BLOB") so callers are directed to the correct metadata key
used by the lookup logic that checks
field.metadata.getString(HoodieSchema.TYPE_METADATA_FIELD).

Comment on lines +1319 to +1326
if (!this.heartbeatClient.isHeartbeatExpired(rollbackInstantOpt.get().requestedTime())) {
LOG.error("Rollback heartbeat already exists for instant {}", rollbackInstantOpt.get().requestedTime());
return false;
}
if (table.getMetaClient().reloadActiveTimeline().getRollbackTimeline().filterCompletedInstants().getInstantsAsStream()
.anyMatch(instant -> EQUALS.test(instant.requestedTime(), rollbackInstantOpt.get().requestedTime()))) {
LOG.info("Requested rollback instant {} is already completed in the active timeline", rollbackInstantOpt.get().requestedTime());
return false;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
sed -n '116,135p' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java
rg -n '\bLOG\b|\blog\b' hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java

Repository: hudi-agent/hudi

Length of output: 5789


Change LOG to log for compilation.

@Slf4j generates log, not LOG. Lines 1320 and 1325 use an undefined logger field and will not compile.

Suggested fix
-        LOG.error("Rollback heartbeat already exists for instant {}", rollbackInstantOpt.get().requestedTime());
+        log.error("Rollback heartbeat already exists for instant {}", rollbackInstantOpt.get().requestedTime());
         return false;
       }
       if (table.getMetaClient().reloadActiveTimeline().getRollbackTimeline().filterCompletedInstants().getInstantsAsStream()
           .anyMatch(instant -> EQUALS.test(instant.requestedTime(), rollbackInstantOpt.get().requestedTime()))) {
-        LOG.info("Requested rollback instant {} is already completed in the active timeline", rollbackInstantOpt.get().requestedTime());
+        log.info("Requested rollback instant {} is already completed in the active timeline", rollbackInstantOpt.get().requestedTime());
         return false;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieTableServiceClient.java`
around lines 1319 - 1326, The code in BaseHoodieTableServiceClient is using an
undefined logger name `LOG`; change those calls to the SLF4J-generated `log`
instance. Specifically, replace the two occurrences where `LOG.error(...)` and
`LOG.info(...)` are invoked around the heartbeat and rollback timeline checks
(the block that calls
heartbeatClient.isHeartbeatExpired(rollbackInstantOpt.get().requestedTime()) and
the subsequent
table.getMetaClient().reloadActiveTimeline().getRollbackTimeline()...anyMatch(...)
check) to use `log.error(...)` and `log.info(...)` respectively so the class
compiles with the `@Slf4j-provided` logger.

Comment on lines +113 to 138
// Build ColumnVector[] in Spark-schema order by looking each field up by name;
// lance-spark 0.4.0's VectorSchemaRoot may return the file's on-disk order, which
// would misalign the UnsafeProjection. Cached on the first batch and reused thereafter.
if (columnVectors == null) {
columnVectors = root.getFieldVectors().stream()
.map(LanceArrowColumnVector::new)
.toArray(ColumnVector[]::new);
List<FieldVector> fieldVectors = root.getFieldVectors();
Map<String, FieldVector> byName = new HashMap<>(fieldVectors.size() * 2);
for (FieldVector fv : fieldVectors) {
byName.put(fv.getName(), fv);
}
StructField[] sparkFields = sparkSchema.fields();
if (sparkFields.length != fieldVectors.size()) {
throw new HoodieException("Lance batch column count " + fieldVectors.size()
+ " does not match expected Spark schema size " + sparkFields.length
+ " for file: " + path);
}
columnVectors = new ColumnVector[sparkFields.length];
for (int i = 0; i < sparkFields.length; i++) {
String name = sparkFields[i].name();
FieldVector fv = byName.get(name);
if (fv == null) {
throw new HoodieException("Lance batch missing expected column '" + name
+ "' for file: " + path + "; available columns: " + byName.keySet());
}
columnVectors[i] = new LanceArrowColumnVector(fv);
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does lance-spark 0.4.0 ArrowReader reuse FieldVector instances across batches?

💡 Result:

No, lance-spark 0.4.0 ArrowReader does not reuse FieldVector instances across batches. No source explicitly states reuse of FieldVector instances. Lance Spark uses JNI bindings to lance-core Rust library for reading Lance data into Arrow RecordBatches. Each batch read likely creates new FieldVectors, as typical in Arrow readers without explicit pooling or zero-copy mechanisms documented. Memory management mentions like "fix: Arrow memory leak in LanceCountStarPartitionReader" suggest allocation per operation rather than reuse. Official docs, releases up to v0.4.0-beta.1, and source searches show no evidence of FieldVector reuse across batches.

Citations:


Do not cache columnVectors across batches — Lance creates new FieldVector instances per batch, causing stale data to be read.

The code caches the columnVectors array on the first batch (if (columnVectors == null)) and reuses it for all subsequent batches. However, lance-spark 0.4.0's ArrowReader returns new FieldVector instances with each batch read. This means the cached LanceArrowColumnVector wrappers continue referencing the original FieldVectors from the first batch, reading stale data for all subsequent batches.

Remove the null check and rebuild the column vectors on each batch read. Alternatively, update the LanceArrowColumnVector references to point to the new FieldVectors from the current batch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/LanceRecordIterator.java`
around lines 113 - 138, The current implementation caches the ColumnVector[] in
LanceRecordIterator (the columnVectors field) on the first batch, but
lance-spark returns new FieldVector instances each batch causing stale data;
remove the "if (columnVectors == null)" guard so that for every batch you
rebuild columnVectors by calling root.getFieldVectors(), mapping byName and
creating new LanceArrowColumnVector(fv) instances (or alternatively update each
existing LanceArrowColumnVector to point to the new FieldVector), ensuring the
logic around sparkSchema.fields() and the existing error checks (throwing
HoodieException with path) remains the same.

)
val df = spark.createDataFrame(spark.sparkContext.parallelize(data), schema).coalesce(1)

writeDataframe(tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

These writes target a precombine field that does not exist in the test schema.

writeDataframe(...) hardcodes hoodie.datasource.write.precombine.field=age, but neither testNullableVectorRoundTrip nor testVectorProjection defines an age column. These writes will fail before they exercise the vector path.

Suggested fix
-    writeDataframe(tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite)
+    writeDataframe(
+      tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite,
+      extraOptions = Map(PRECOMBINE_FIELD.key() -> "id"))
-    writeDataframe(tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite)
+    writeDataframe(
+      tableType, tableName, tablePath, df, saveMode = SaveMode.Overwrite,
+      extraOptions = Map(PRECOMBINE_FIELD.key() -> "id"))

Also applies to: 1080-1080

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestLanceDataSource.scala`
at line 1048, The writes in testNullableVectorRoundTrip and testVectorProjection
call writeDataframe which currently hardcodes
hoodie.datasource.write.precombine.field=age, but those tests' DataFrames have
no age column so the write fails; fix by making writeDataframe accept a
precombineField parameter (or accept a Map of write options) and update the two
test calls to pass a valid existing column name from their schemas (e.g., "id"
or the timestamp column) instead of "age", or alternatively add an "age" column
to the test DataFrames if that is preferred; update references to writeDataframe
in these tests to use the new parameter and ensure the property name
hoodie.datasource.write.precombine.field is set to the chosen existing column.

Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and
BLOB onto the Lance base file format, and extend all variants with a
6th deltacommit so the cleaner has a chance to retire the post-compaction
log-only slice and write a .clean instant.

- VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table
  config and that a .lance base file exists under the table path after
  compaction.
- VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by
  -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter ->
  LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the
  LANCE format config sticks to hoodie.properties immediately after
  CREATE TABLE so the table-level invariant is checked even when the
  writer fails downstream.
- All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after
  the compaction-triggering 5th commit. The 5th commit's auto-clean
  runs before inline compaction, so the prior log slice is not yet
  superseded; the 6th commit's postCommit clean retires it and writes
  the .clean instant. The cleaner-timeline assertion uses
  reloadActiveTimeline() to avoid a stale cached view.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala (1)

127-234: Consider extracting shared MOR compaction flow into a helper.

The default and LANCE tests duplicate a long commit/merge/assert sequence. A shared helper (parameterized by base file format + extra assertions) would reduce drift and maintenance cost.

Also applies to: 244-349

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`
around lines 127 - 234, The test duplicates a long MOR compaction flow; extract
the repeated sequence into a helper method (e.g., runMorCompactionFlow or
assertMorCompactionRoundtrip) that accepts parameters like the target table
name, tmp path, and an optional callback for extra assertions; move the repeated
spark.sql insert/merge calls, the DataSourceTestUtils.isLogFileOnly checks,
checkAnswer verifications, the variant field round-trip assertion (finding
schema field "v" and asserting typeName == "variant"), and the createMetaClient
timeline assertion into that helper, then call it from withRecordType() and the
LANCE test with the appropriate parameters to remove duplication while
preserving existing checks (references: withRecordType(),
DataSourceTestUtils.isLogFileOnly, checkAnswer, createMetaClient, and the schema
lookup for "v").
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala (1)

56-283: Consider extracting a shared MOR compaction scenario helper.

The four tests duplicate the same create/insert/merge/compaction/clean assertions with only blob-shape and format differences. A small helper would reduce drift and make future custom-type additions easier to maintain.

Also applies to: 285-517

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 56 - 283, Extract the repeated MOR compaction scenario into a
single helper (e.g., runMorCompactionScenario) that accepts parameters for the
blob value builder (inlineBlobLiteral or outOfLineBlobLiteral), any pre-created
blob files, and two validator callbacks (post-compaction bytes check and
blob-shape assertions); move the shared steps—table creation (using create table
SQL with type='mor' and hoodie.compact.inline='true'), the three inserts/merges
that drive compaction, the DataSourceTestUtils.isLogFileOnly assertions, the
post-compaction read_blob verification (using spark.sql(...).collect().map(...)
and DataSourceTestUtils/BlobTestHelpers checks), metadata assertions on
spark.table(...).schema (checking HoodieSchema.TYPE_METADATA_FIELD and
HoodieSchemaType.BLOB), the final merge that triggers cleaning, and metaClient
cleaner timeline assertion—into that helper and replace the four near-duplicate
test bodies (Test Query Log Only MOR Table With BLOB INLINE/OUT_OF_LINE and the
other two similar tests) to call it with the appropriate literal generator
(inlineBlobLiteral or outOfLineBlobLiteral), file args (file1..file4 or none),
and the specific byte/shape assertions; reference helpers and symbols in the
diff such as inlineBlobLiteral, outOfLineBlobLiteral,
DataSourceTestUtils.isLogFileOnly, read_blob,
BlobTestHelpers.assertBytesContent, HoodieSchema.TYPE_METADATA_FIELD,
HoodieSchema.Blob.* and createMetaClient when wiring parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`:
- Around line 313-316: Tests in TestBlobDataType.scala currently only assert the
table config reports HoodieFileFormat.LANCE but do not verify that compaction
actually produced a .lance base file; add a filesystem-level assertion after the
existing config check (after where createMetaClient(spark,
tablePath).getTableConfig.getBaseFileFormat is asserted) that scans the
tablePath recursively via Hadoop FileSystem (e.g.,
FileSystem.get(spark.sparkContext.hadoopConfiguration)) and asserts at least one
file path endsWith ".lance" (or matches the expected Lance base-file suffix) to
ensure a real .lance artifact was written; use the existing tablePath variable
and place the check in both spots mentioned so the test fails if no .lance base
file exists after compaction.
- Around line 358-365: The test currently only asserts EXTERNAL_REFERENCE_PATH
is null for an INLINE reference; tighten it by verifying all leaves of the
reference struct are the expected null/default values: after obtaining ref from
blob.getStruct(refIdx) assert
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)),
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_OFFSET)), and
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_LENGTH)), and
assert the managed flag (HoodieSchema.Blob.EXTERNAL_REFERENCE_MANAGED) is the
expected default (e.g., false or null as per schema) to ensure no partially
populated INLINE reference passes.
- Around line 397-402: The test "Test Query Log Only MOR Table With BLOB
OUT_OF_LINE column triggers compaction (Lance)" currently runs by default
because the assume check uses System.getProperty("lance.skip.tests") != "true"
while the property defaults to "false"; change the gating to opt-in by checking
for equality (e.g., System.getProperty("lance.skip.tests") == "true") so the
test only runs when explicitly enabled, or alternatively mark the test with
`@Ignore` (or wrap assertions in an assertThrows/try-catch) until
HoodieSparkLanceWriter adds BLOB support to avoid failing default test runs.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`:
- Around line 137-144: The table DDL in TestVariantDataType sets
hoodie.compact.inline='true' but leaves the compaction trigger threshold
implicit, causing commit-count flakiness; add an explicit compaction trigger
like hoodie.compact.inline.max.delta.commits='<expected_count>' to the
tblproperties in the shown DDL block (and the other occurrence around lines
255-262) so the test’s expected commit boundaries (the commits asserted at
3/4/5/6) are stable; update the table properties in the TestVariantDataType test
cases to include this property with the threshold the test expects.
- Around line 239-243: The test currently runs as a normal pass-path but
comments say VARIANT support is missing; change the `assume(...)` guard in
TestVariantDataType so VARIANT tests are opt-in instead of running by default:
replace the existing `assume(System.getProperty("lance.skip.tests") != "true",
...)` check with an explicit opt-in like
`assume(System.getProperty("lance.enable.variant") == "true", "LANCE VARIANT
tests disabled unless -Dlance.enable.variant=true")` (or equivalent logic) so
the test is skipped by default until Lance VARIANT support is implemented.

---

Nitpick comments:
In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`:
- Around line 56-283: Extract the repeated MOR compaction scenario into a single
helper (e.g., runMorCompactionScenario) that accepts parameters for the blob
value builder (inlineBlobLiteral or outOfLineBlobLiteral), any pre-created blob
files, and two validator callbacks (post-compaction bytes check and blob-shape
assertions); move the shared steps—table creation (using create table SQL with
type='mor' and hoodie.compact.inline='true'), the three inserts/merges that
drive compaction, the DataSourceTestUtils.isLogFileOnly assertions, the
post-compaction read_blob verification (using spark.sql(...).collect().map(...)
and DataSourceTestUtils/BlobTestHelpers checks), metadata assertions on
spark.table(...).schema (checking HoodieSchema.TYPE_METADATA_FIELD and
HoodieSchemaType.BLOB), the final merge that triggers cleaning, and metaClient
cleaner timeline assertion—into that helper and replace the four near-duplicate
test bodies (Test Query Log Only MOR Table With BLOB INLINE/OUT_OF_LINE and the
other two similar tests) to call it with the appropriate literal generator
(inlineBlobLiteral or outOfLineBlobLiteral), file args (file1..file4 or none),
and the specific byte/shape assertions; reference helpers and symbols in the
diff such as inlineBlobLiteral, outOfLineBlobLiteral,
DataSourceTestUtils.isLogFileOnly, read_blob,
BlobTestHelpers.assertBytesContent, HoodieSchema.TYPE_METADATA_FIELD,
HoodieSchema.Blob.* and createMetaClient when wiring parameters.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`:
- Around line 127-234: The test duplicates a long MOR compaction flow; extract
the repeated sequence into a helper method (e.g., runMorCompactionFlow or
assertMorCompactionRoundtrip) that accepts parameters like the target table
name, tmp path, and an optional callback for extra assertions; move the repeated
spark.sql insert/merge calls, the DataSourceTestUtils.isLogFileOnly checks,
checkAnswer verifications, the variant field round-trip assertion (finding
schema field "v" and asserting typeName == "variant"), and the createMetaClient
timeline assertion into that helper, then call it from withRecordType() and the
LANCE test with the appropriate parameters to remove duplication while
preserving existing checks (references: withRecordType(),
DataSourceTestUtils.isLogFileOnly, checkAnswer, createMetaClient, and the schema
lookup for "v").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 1c13b86e-e4d3-45ee-8bfa-0ad371455b11

📥 Commits

Reviewing files that changed from the base of the PR and between 19c9a6a and cf663ca.

📒 Files selected for processing (4)
  • hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala
🚧 Files skipped from review as they are similar to previous changes (1)
  • hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestVectorDataSource.scala

Comment on lines +313 to +316
// Verify the LANCE config was actually persisted to hoodie.properties.
assertResult(HoodieFileFormat.LANCE)(
createMetaClient(spark, tablePath).getTableConfig.getBaseFileFormat)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Assert actual .lance base-file creation after compaction.

Line 314 and Line 435 only validate table config, not emitted file artifacts. If compaction silently writes a non-Lance base file, these tests can still pass. Please add a filesystem assertion after Line 343 and Line 467 that at least one .lance base file exists.

🧪 Proposed test hardening
+  private def hasBaseFileWithExtension(root: String, ext: String): Boolean = {
+    val stream = java.nio.file.Files.walk(new File(root).toPath)
+    try {
+      stream.anyMatch(path =>
+        java.nio.file.Files.isRegularFile(path) &&
+          path.getFileName.toString.endsWith(ext))
+    } finally {
+      stream.close()
+    }
+  }
...
       assertResult(false)(DataSourceTestUtils.isLogFileOnly(tablePath))
+      assert(hasBaseFileWithExtension(tablePath, ".lance"),
+        s"Expected at least one .lance base file after compaction under $tablePath")
...
       assertResult(false)(DataSourceTestUtils.isLogFileOnly(tablePath))
+      assert(hasBaseFileWithExtension(tablePath, ".lance"),
+        s"Expected at least one .lance base file after compaction under $tablePath")

Also applies to: 343-343, 434-437, 467-467

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 313 - 316, Tests in TestBlobDataType.scala currently only assert
the table config reports HoodieFileFormat.LANCE but do not verify that
compaction actually produced a .lance base file; add a filesystem-level
assertion after the existing config check (after where createMetaClient(spark,
tablePath).getTableConfig.getBaseFileFormat is asserted) that scans the
tablePath recursively via Hadoop FileSystem (e.g.,
FileSystem.get(spark.sparkContext.hadoopConfiguration)) and asserts at least one
file path endsWith ".lance" (or matches the expected Lance base-file suffix) to
ensure a real .lance artifact was written; use the existing tablePath variable
and place the check in both spots mentioned so the test fails if no .lance base
file exists after compaction.

Comment on lines +358 to +365
// Lance materializes the `reference` struct as non-null with all-null leaves for
// INLINE rows (vs. a null struct on Parquet). `type` is the canonical INLINE
// discriminator per RFC-100; tolerate either shape and just check the leaves.
val refIdx = blob.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE)
if (!blob.isNullAt(refIdx)) {
val ref = blob.getStruct(refIdx)
assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Strengthen INLINE Lance reference-shape validation.

Line 364 checks only external_path. A partially populated reference struct (offset/length/managed) would still pass. Please assert all reference leaves are null (or expected defaults) when type = INLINE.

🔍 Proposed assertion extension
         if (!blob.isNullAt(refIdx)) {
           val ref = blob.getStruct(refIdx)
           assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))
+          assert(ref.isNullAt(ref.fieldIndex("offset")))
+          assert(ref.isNullAt(ref.fieldIndex("length")))
+          assert(ref.isNullAt(ref.fieldIndex("managed")))
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Lance materializes the `reference` struct as non-null with all-null leaves for
// INLINE rows (vs. a null struct on Parquet). `type` is the canonical INLINE
// discriminator per RFC-100; tolerate either shape and just check the leaves.
val refIdx = blob.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE)
if (!blob.isNullAt(refIdx)) {
val ref = blob.getStruct(refIdx)
assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))
}
// Lance materializes the `reference` struct as non-null with all-null leaves for
// INLINE rows (vs. a null struct on Parquet). `type` is the canonical INLINE
// discriminator per RFC-100; tolerate either shape and just check the leaves.
val refIdx = blob.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE)
if (!blob.isNullAt(refIdx)) {
val ref = blob.getStruct(refIdx)
assert(ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)))
assert(ref.isNullAt(ref.fieldIndex("offset")))
assert(ref.isNullAt(ref.fieldIndex("length")))
assert(ref.isNullAt(ref.fieldIndex("managed")))
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 358 - 365, The test currently only asserts EXTERNAL_REFERENCE_PATH
is null for an INLINE reference; tighten it by verifying all leaves of the
reference struct are the expected null/default values: after obtaining ref from
blob.getStruct(refIdx) assert
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_PATH)),
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_OFFSET)), and
ref.isNullAt(ref.fieldIndex(HoodieSchema.Blob.EXTERNAL_REFERENCE_LENGTH)), and
assert the managed flag (HoodieSchema.Blob.EXTERNAL_REFERENCE_MANAGED) is the
expected default (e.g., false or null as per schema) to ensure no partially
populated INLINE reference passes.

Comment on lines +397 to +402
test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE column triggers compaction (Lance)") {
assume(System.getProperty("lance.skip.tests") != "true",
"Lance tests disabled via -Dlance.skip.tests=true")
// Lance writer has no BLOB handling today (RFC-100 Phase 2). Expected to fail
// until support lands in HoodieSparkLanceWriter; this test pins the gap.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Verify whether CI/build defaults force skipping Lance tests.
rg -n -C2 --iglob '*pom*.xml' --iglob '*.gradle*' --iglob '*.yml' --iglob '*.yaml' \
  'lance\.skip\.tests|Dlance\.skip\.tests|skip.*lance|lance.*skip'

# Inspect all usages to confirm intended gating semantics repo-wide.
rg -n -C2 --iglob '*.scala' \
  'System\.getProperty\("lance\.skip\.tests"\)|assume\(.*lance\.skip\.tests'

Repository: hudi-agent/hudi

Length of output: 5814


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get the full test context around line 397-402
sed -n '390,410p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

# Check if test has any `@Ignore`, `@Pending`, or other skip annotations
rg -B5 'Test Query Log Only MOR Table With BLOB OUT_OF_LINE' \
  hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

# Look for test implementation to see if there's failure handling
sed -n '397,460p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

Repository: hudi-agent/hudi

Length of output: 4505


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check for assertThrows, expectedException or try-catch patterns in the OUT_OF_LINE test
sed -n '397,500p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | \
  grep -i "assertthrows\|expected\|exception\|try\|catch"

# Check if there's a non-Lance variant to compare the expected behavior
grep -n "Test Query Log Only MOR Table With BLOB OUT_OF_LINE" hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

# Verify test count - how many similar Lance gated tests exist
rg -c 'assume\(System\.getProperty\("lance\.skip\.tests"\)' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala

Repository: hudi-agent/hudi

Length of output: 416


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Get the complete test implementation to see where it ends and check for failure handling
sed -n '397,550p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | tail -60

# Also check the non-Lance version (line 161) to see if it has assertions or different behavior
sed -n '161,280p' hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | grep -A120 'test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE' | head -80

Repository: hudi-agent/hudi

Length of output: 6345


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Check if there are any `@Ignore` or `@Pending` annotations on Lance tests in this file
rg -B3 'assume\(System\.getProperty\("lance\.skip\.tests"\)' \
  hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | \
  grep -E '@Ignore|@Pending|@Disabled' || echo "No `@Ignore/`@Pending/@Disabled found"

# Check for similar pattern in other test files to see how other "expected to fail" tests are handled
rg -B2 'Expected to fail' hudi-spark-datasource/hudi-spark/src/test/scala/ | head -30

Repository: hudi-agent/hudi

Length of output: 1737


Test runs by default despite documented expected failure - needs opt-in gating or quarantine.

The opt-out gate at line 398 uses != "true", but pom.xml defaults lance.skip.tests to false, causing the test to run. Comments at lines 400–401 state the test is expected to fail because Lance lacks BLOB handling (RFC-100 Phase 2). The test executes full assertions and operations with no failure-handling mechanism (@Ignore, try-catch, assertThrows), so it will deterministically fail if Lance doesn't support BLOBs, destabilizing default test runs. Switch to opt-in gating (check == "true" instead) or add @Ignore until support lands.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
around lines 397 - 402, The test "Test Query Log Only MOR Table With BLOB
OUT_OF_LINE column triggers compaction (Lance)" currently runs by default
because the assume check uses System.getProperty("lance.skip.tests") != "true"
while the property defaults to "false"; change the gating to opt-in by checking
for equality (e.g., System.getProperty("lance.skip.tests") == "true") so the
test only runs when explicitly enabled, or alternatively mark the test with
`@Ignore` (or wrap assertions in an assertThrows/try-catch) until
HoodieSparkLanceWriter adds BLOB support to avoid failing default test runs.

Comment on lines +137 to +144
| tblproperties (
| primaryKey = 'id',
| type = 'mor',
| preCombineField = 'ts',
| hoodie.index.type = 'INMEMORY',
| hoodie.compact.inline = 'true',
| hoodie.clean.commits.retained = '1'
| )
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Pin compaction trigger configs to avoid commit-count flakiness.

These tests assert exact behavior at commits 3/4/5/6, but the threshold is implicit. Please set it explicitly in table properties so defaults/config drift cannot break the assertions.

Suggested fix
            |  hoodie.compact.inline = 'true',
+           |  hoodie.compact.inline.max.delta.commits = '5',
            |  hoodie.clean.commits.retained = '1'
            |  hoodie.compact.inline = 'true',
+           |  hoodie.compact.inline.max.delta.commits = '5',
            |  hoodie.clean.commits.retained = '1',
            |  'hoodie.table.base.file.format' = 'LANCE'

Also applies to: 255-262

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`
around lines 137 - 144, The table DDL in TestVariantDataType sets
hoodie.compact.inline='true' but leaves the compaction trigger threshold
implicit, causing commit-count flakiness; add an explicit compaction trigger
like hoodie.compact.inline.max.delta.commits='<expected_count>' to the
tblproperties in the shown DDL block (and the other occurrence around lines
255-262) so the test’s expected commit boundaries (the commits asserted at
3/4/5/6) are stable; update the table properties in the TestVariantDataType test
cases to include this property with the threshold the test expects.

Comment on lines +239 to +243
assume(System.getProperty("lance.skip.tests") != "true",
"Lance tests disabled via -Dlance.skip.tests=true")
// Lance writer has no VARIANT handling today (RFC-100 Phase 2). Expected to fail
// until support lands in HoodieSparkLanceWriter; this test pins the gap.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the LANCE VARIANT test semantics explicit (currently “expected to fail” but asserted as pass-path).

The comment says this case is expected to fail until support lands, but the test runs normal success assertions unless a manual JVM flag is set. That can make default test runs unstable.

Suggested fix (opt-in while feature is incomplete)
-    assume(System.getProperty("lance.skip.tests") != "true",
-      "Lance tests disabled via -Dlance.skip.tests=true")
-    // Lance writer has no VARIANT handling today (RFC-100 Phase 2). Expected to fail
-    // until support lands in HoodieSparkLanceWriter; this test pins the gap.
+    assume(System.getProperty("lance.enable.variant.tests") == "true",
+      "Experimental: enable with -Dlance.enable.variant.tests=true")
+    // Enable by opt-in until HoodieSparkLanceWriter fully supports VARIANT.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assume(System.getProperty("lance.skip.tests") != "true",
"Lance tests disabled via -Dlance.skip.tests=true")
// Lance writer has no VARIANT handling today (RFC-100 Phase 2). Expected to fail
// until support lands in HoodieSparkLanceWriter; this test pins the gap.
assume(System.getProperty("lance.enable.variant.tests") == "true",
"Experimental: enable with -Dlance.enable.variant.tests=true")
// Enable by opt-in until HoodieSparkLanceWriter fully supports VARIANT.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala`
around lines 239 - 243, The test currently runs as a normal pass-path but
comments say VARIANT support is missing; change the `assume(...)` guard in
TestVariantDataType so VARIANT tests are opt-in instead of running by default:
replace the existing `assume(System.getProperty("lance.skip.tests") != "true",
...)` check with an explicit opt-in like
`assume(System.getProperty("lance.enable.variant") == "true", "LANCE VARIANT
tests disabled unless -Dlance.enable.variant=true")` (or equivalent logic) so
the test is skipped by default until Lance VARIANT support is implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.