feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497
Conversation
Translate the Hudi VECTOR logical-type metadata (`hudi_type = "VECTOR(dim[,elem])"`)
into the lance-spark metadata key `arrow.fixed-size-list.size` before calling
`LanceArrowUtils.toArrowSchema`, so the Lance writer emits a native Arrow
FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead
of a plain variable-length list. No change needed at `LanceFileWriter.open(...)`;
the encoding is driven by the Arrow schema itself.
- New private helper `enrichSparkSchemaForLanceVectors` in `HoodieSparkLanceWriter`
reuses `VectorConversionUtils.detectVectorColumnsFromMetadata` to find VECTOR
fields and attaches the Lance metadata key; non-vector fields pass through
unchanged.
- Fails fast with `HoodieNotSupportedException` for non-ArrayType or non-
Float/Double element types (matches lance-spark's `shouldBeFixedSizeList`).
- Tests in `TestLanceDataSource` (COW + MOR):
- `testFloatVectorRoundTrip`
- `testDoubleVectorRoundTrip`
- `testMultipleVectorColumns`
Each opens the written `.lance` file via `LanceFileReader` and asserts the
field is `ArrowType.FixedSizeList` with the expected `listSize` — the direct
regression guard that fails pre-fix and passes post-fix.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Companion to the Lance writer's native FixedSizeList encoding: on read, rehydrate the Hudi `hudi_type = VECTOR(...)` Spark metadata that `LanceArrowUtils.fromArrowSchema` drops, so the read schema matches the Parquet path. Gate the Parquet-only ArrayType→BinaryType vector rewrite in HoodieFileGroupReaderBasedFileFormat on format == PARQUET; Lance returns vectors natively as ArrayType so the rewrite would trigger a spurious cast and break the read. - VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the Arrow schema and re-attaches VECTOR(dim[,DOUBLE]) for FixedSizeList<Float32|Float64, dim> fields. - HoodieSparkLanceReader.getSchema and SparkLanceReaderBase.read now call it so downstream VECTOR-aware code sees the same schema as on Parquet. - TestLanceDataSource: assert hudi_type metadata is restored on read for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mirrors the Parquet writer: emit the comma-separated `colName:VECTOR(dim[,elemType])` descriptor list under the existing `hoodie.vector.columns` key in the Lance file-footer key-value metadata. Reader still derives VECTOR identity from the Arrow FixedSizeList type today; this footer entry is insurance for future descriptor fields the Arrow type cannot express (quantization tags, distance metrics, etc.) and keeps Lance files symmetric with Parquet files. - HoodieBaseLanceWriter: new protected `additionalSchemaMetadata()` hook invoked during close(), so subclasses can contribute footer KV entries alongside bloom-filter metadata. - HoodieSparkLanceWriter: override `additionalSchemaMetadata()` to emit `hoodie.vector.columns` when the Spark schema has any VECTOR column. - VectorConversionUtils: add `buildVectorColumnsMetadataValue(StructType)` matching the Parquet-path helper's output format. - TestLanceDataSource: assert footer carries the expected descriptor list for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Writing VECTOR columns as native Lance |
|
Lance artifact coordinates are out of date: |
|
The
|
|
The three added tests (
|
|
Minor code-quality nits
|
SparkFileFormatInternalRowReaderContext.getFileRecordIterator had a second, unconditional rewrite of VECTOR columns from ArrayType to BinaryType (the earlier withVectorRewrite gate in HoodieFileGroupReaderBasedFileFormat only covered the non-FileGroupReader branch). On the MOR / FileGroupReader path this caused Lance reads to fail with scala.MatchError: ArrayType(FloatType,true) in Cast.castToBinaryCode, because Lance returns vectors natively as ArrayType while the caller-supplied schema had been rewritten to BinaryType — the generated UnsafeProjection then injected an unsupported Cast(ArrayType -> BinaryType). Gate the detection + rewrite on the file format: skip it for .lance base files. Hudi log files are always parquet-encoded so they still take the Parquet path. Fixes 14 TestLanceDataSource vector errors (COW + MOR) observed in spark3.5 / spark3.4 CI, including the spark3.4 part2 6h timeout that was the same failure retrying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Scala 2.13, Row.getAs[Seq[Float]] fails at runtime with ClassCastException: scala.collection.mutable.ArraySeq$ofRef cannot be cast to scala.collection.immutable.Seq, because Seq in Scala 2.13 defaults to immutable.Seq while Spark holds array columns as mutable.ArraySeq internally. Row.getSeq[T] is declared as scala.collection.Seq[T] (general), so it works on both 2.12 (where Seq = scala.collection.Seq) and 2.13 (where Seq = scala.collection.immutable.Seq). Same runtime object, no cast. Fixes the 14 TestLanceDataSource errors on java17 CI (scala-2.13, spark3.5 / spark4.0). The earlier VECTOR->BinaryType rewrite fix resolved the scala.MatchError in the read path; this change resolves the subsequent 2.13-only test-side ClassCastException. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… size check - SparkFileFormatInternalRowReaderContext: use tableConfig.getBaseFileFormat instead of filename extension sniff to detect Lance base files - VectorConversionUtils.restoreVectorMetadataFromArrowSchema: remove confusing arrowFields.size != sparkFields.length defensive guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Why does this matter though, since we didnt change any lance dependency version in this pr? I have a seperate lance bump PR #18498 to isolate this |
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Style & Readability Review — one small inconsistency to clean up: the new additionalSchemaMetadata method uses a fully-qualified java.util.Map return type even though Map is imported in the same file.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Nice work on this PR! The round-trip design is clean — enriching the Spark schema with arrow.fixed-size-list.size on write and rebuilding the VECTOR descriptor from the Arrow type on read mirrors the Parquet path nicely, and the footer-key unification (VECTOR_COLUMNS_METADATA_KEY) keeps both formats in lockstep. One correctness concern around deterministic ordering of the footer value is worth double-checking in the inline comment. Once that is addressed, this should be ready for a Hudi committer or PMC member to take it from here.
| // Only Parquet needs the BinaryType rewrite; other formats (Lance) return ArrayType natively. | ||
| if (hoodieFileFormat != HoodieFileFormat.PARQUET) { | ||
| (schema, Map.empty[Int, HoodieSchema.Vector]) | ||
| } else { |
There was a problem hiding this comment.
🤖 Gating on hoodieFileFormat != HoodieFileFormat.PARQUET disables the vector→Binary rewrite for the whole table, but MOR log files are always Parquet even when the base format is Lance. Could you confirm that log file reads in a Lance MOR table don't flow through withVectorRewrite? If they do, a Lance table with VECTOR columns + log updates would try to read FIXED_LEN_BYTE_ARRAY vectors with an ArrayType schema and fail.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
Let's follow-up with MOR support in a separate PR if needed.
| continue; | ||
| } | ||
| HoodieSchema.Vector.VectorElementType elemType = vec.getVectorElementType(); | ||
| if (elemType != HoodieSchema.Vector.VectorElementType.FLOAT |
There was a problem hiding this comment.
🤖 enrichSparkSchemaForLanceVectors throws HoodieNotSupportedException at writer construction for any non-FLOAT/DOUBLE VECTOR. Is there a schema-validation layer upstream that catches this before the writer is created, or will a user with e.g. a VECTOR(N, INT8) column first discover the limitation as a failed write task? If it's the latter, consider surfacing this earlier (at DDL/commit planning) so partial writes don't leave orphan files.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
There was a problem hiding this comment.
I think it is ok to add this defensive check here to throw HoodieNotSupportedException. We should have a follow-up (non-blocker) to add such check in schema validation as well. @rahil-c to create an GH issue.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
CodeRabbit Walkthrough: This PR enhances vector metadata handling across Hudi to support Lance file format persistence and recovery. It introduces utilities to serialize/restore vector descriptors, adds vector schema enrichment during Lance writes, restores vector type information during reads, renames a format-agnostic metadata key constant, and implements conditional vector schema rewriting based on file format. Test coverage for Lance vectors is substantially expanded.
Sequence Diagram (CodeRabbit):
sequenceDiagram
actor User
participant SparkSchema as Spark Schema<br/>(with VECTOR metadata)
participant Writer as HoodieSparkLanceWriter
participant Enrich as VectorConversionUtils
participant Arrow as Arrow Schema
participant Lance as Lance File<br/>(with footer metadata)
User->>Writer: Write DataFrame with VECTOR columns
Writer->>Enrich: enrichVectorMetadataInSchema()
Enrich->>Enrich: Detect VECTOR fields<br/>Add dimension to metadata
Enrich-->>Writer: Enriched StructType
Writer->>Arrow: Convert to Arrow schema<br/>(with sizing info)
Arrow->>Lance: Write with metadata
Writer->>Lance: addSchemaMetadata()<br/>(vector columns footer)
Lance-->>User: Persisted file
Sequence Diagram (CodeRabbit):
sequenceDiagram
actor User
participant Lance as Lance File<br/>(with footer metadata)
participant Reader as LanceReader
participant Arrow as Arrow Schema<br/>(FixedSizeList)
participant Restore as VectorConversionUtils
participant SparkSchema as Spark Schema<br/>(restored VECTOR metadata)
User->>Reader: Read Lance file
Reader->>Lance: Load Arrow schema
Lance-->>Arrow: Return schema<br/>(FixedSizeList fields)
Reader->>Arrow: Convert to Spark StructType
Arrow-->>Reader: Basic StructType<br/>(no VECTOR info)
Reader->>Restore: restoreVectorMetadataFromArrowSchema()
Restore->>Restore: Detect FixedSizeList of<br/>Float32/Float64
Restore->>SparkSchema: Attach VECTOR metadata
SparkSchema-->>User: Return enriched schema
CodeRabbit: yihua#49 (review)
|
@yihua to take a look |
| continue; | ||
| } | ||
| HoodieSchema.Vector.VectorElementType elemType = vec.getVectorElementType(); | ||
| if (elemType != HoodieSchema.Vector.VectorElementType.FLOAT |
There was a problem hiding this comment.
I think it is ok to add this defensive check here to throw HoodieNotSupportedException. We should have a follow-up (non-blocker) to add such check in schema validation as well. @rahil-c to create an GH issue.
| // Only Parquet needs the BinaryType rewrite; other formats (Lance) return ArrayType natively. | ||
| if (hoodieFileFormat != HoodieFileFormat.PARQUET) { | ||
| (schema, Map.empty[Int, HoodieSchema.Vector]) | ||
| } else { |
There was a problem hiding this comment.
Let's follow-up with MOR support in a separate PR if needed.
| * {@code LanceArrowUtils.fromArrowSchema} drops all field metadata, so without this | ||
| * step VECTOR columns are indistinguishable from plain arrays. |
There was a problem hiding this comment.
Should we upstream fix to LanceArrowUtils.fromArrowSchema so we can get rid of these restore logic in the future for simplicity?
There was a problem hiding this comment.
Yes we will need to track this in github issue and then upstream a fix to lance spark
|
- HoodieSparkLanceWriter.additionalSchemaMetadata: drop redundant java.util.Map FQN (Map is already imported) - VectorConversionUtils.buildVectorColumnsFooterValue: walk fields in ordinal order rather than iterating the detected HashMap, so the hoodie.vector.columns footer value is stable across JDKs - VectorConversionUtils & HoodieSchema: use import for LinkedHashMap instead of fully-qualified class name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st consolidation - VectorConversionUtils.restoreVectorMetadataFromArrowSchema: only restore VECTOR descriptor metadata for fields listed in the hoodie.vector.columns footer, so non-VECTOR Arrow FixedSizeList<Float/Double> fields cannot be misinterpreted. Drop the anyChanged optimization for simplicity. - HoodieSchema: add parseVectorColumnNames helper to extract field names from the footer value, handling commas inside descriptor parentheses. - HoodieSparkLanceWriter.enrichSparkSchemaForLanceVectors: expand javadoc to explain why we auto-attach arrow.fixed-size-list.size from the VECTOR dimension. - TestLanceDataSource: consolidate testFloatVectorRoundTrip, testDoubleVectorRoundTrip, testVectorMorUpdatePath, testVectorPartitionedTable into testMultipleVectorColumns (two VECTOR columns of different element types + dims, insert then upsert covers MOR merge path). Keep testNullableVectorRoundTrip separate — nullable VECTOR in the upsert/merge path hits a Lance reader bug; tracked as a follow-up. Rename forEachLanceSchema -> validateLanceFileSchema per review suggestion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| Set<String> vectorColumnNames = HoodieSchema.parseVectorColumnNames( | ||
| customMetadata == null ? null : customMetadata.get(HoodieSchema.VECTOR_COLUMNS_METADATA_KEY)); |
There was a problem hiding this comment.
nit: VECTOR_COLUMNS_METADATA_KEY in the footer is now used. We can remove this usage later.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Describe the issue this Pull Request addresses
Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the Parquet path via the
hoodie.vector.columnsfooter metadata + FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, VECTOR columns silently degraded to plainList<Float>/List<Double>Arrow fields on write and lost theirhudi_typedescriptor on read. This PR wires up the Lance path so VECTOR columns land as native ArrowFixedSizeList<Float32|Float64, dim>(Lance's native vector column encoding) and are surfaced with the samehudi_type = VECTOR(...)metadata on read as Parquet.Summary and Changelog
Users writing a Hudi table with a
VECTOR(dim[, elem])column andhoodie.table.base.file.format = LANCEnow get Lance's native fixed-size-list vector encoding end-to-end, with the same schema-level metadata they see for Parquet-backed tables.Changes:
HoodieSparkLanceWriter.enrichSparkSchemaForLanceVectors): translate eachhudi_type = VECTOR(...)field's descriptor into the lance-spark metadata keyarrow.fixed-size-list.size(long) before callingLanceArrowUtils.toArrowSchema, so lance-spark'sshouldBeFixedSizeListchoosesFixedSizeListWriter. FLOAT and DOUBLE elements only; anything else fails fast withHoodieNotSupportedException.HoodieSparkLanceWriter.additionalSchemaMetadataemits the canonicalhoodie.vector.columnsfooter entry (delegating toHoodieSchema.serializeVectorColumnsMetadata) so readers — including future readers — can identify VECTOR columns without a Hudi schema store.VectorConversionUtils.restoreVectorMetadataFromArrowSchema): re-attachhudi_type = VECTOR(...)Spark metadata onto fields the Lance reader produces, gated on thehoodie.vector.columnsfooter value so non-Hudi-writtenFixedSizeList<Float/Double>fields cannot be misinterpreted.HoodieSchema.parseVectorColumnNamesis a new helper for parsing the comma-separated footer value.SparkFileFormatInternalRowReaderContextusestableConfig.getBaseFileFormat(not a filename extension sniff) to decide whether to apply the Parquet-only VECTOR→BinaryType rewrite;HoodieFileGroupReaderBasedFileFormat.withVectorRewriteskips the rewrite entirely for non-Parquet base formats.TestLanceDataSource, parameterized across COW + MOR):testMultipleVectorColumns— consolidated: two VECTOR columns (FLOAT + DOUBLE) at different dimensions; insert then upsert exercises the COW rewrite and MOR log-merge paths.testNullableVectorRoundTrip— nullable VECTOR with a null row (kept separate; folding null + upsert together triggers a Lance reader bug tracked as a follow-up).testVectorProjection— project only the VECTOR column, and the VECTOR column with Hudi meta-fields..lancefile viaLanceFileReaderand asserts the physical column isArrowType.FixedSizeListwith the expectedlistSize, and that thehoodie.vector.columnsfooter entry matches the expected descriptor list.Impact
hoodie.table.base.file.format = LANCEand aVECTORcolumn now produce Lance files where that column is a native fixed-size list (previously a plain variable-length list). This is the representation Lance's vector-search features operate on, so future vector-search integrations can consume these files directly.HoodieSparkLanceWritergains a private helper and overrides the existingadditionalSchemaMetadatahook onHoodieBaseLanceWriter.Risk Level
low
Only the Lance base-file write/read path is affected. Parquet and other base formats keep their existing behavior (explicitly gated by
HoodieFileFormatchecks at every rewrite site). The VECTOR-column changes are local to a narrow conversion layer between Hudi and lance-spark. Covered by the newTestLanceDataSourcecases (COW + MOR, multi-column, upsert, nullable, projection).Documentation Update
none — Lance base-file support is itself still new. The user-facing VECTOR descriptor on Lance is identical to Parquet (already documented in RFC-99).
Contributor's checklist