feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists by rahil-c · Pull Request #18497 · apache/hudi

rahil-c · 2026-04-13T16:04:17Z

Describe the issue this Pull Request addresses

Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the Parquet path via the hoodie.vector.columns footer metadata + FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, VECTOR columns silently degraded to plain List<Float> / List<Double> Arrow fields on write and lost their hudi_type descriptor on read. This PR wires up the Lance path so VECTOR columns land as native Arrow FixedSizeList<Float32|Float64, dim> (Lance's native vector column encoding) and are surfaced with the same hudi_type = VECTOR(...) metadata on read as Parquet.

Summary and Changelog

Users writing a Hudi table with a VECTOR(dim[, elem]) column and hoodie.table.base.file.format = LANCE now get Lance's native fixed-size-list vector encoding end-to-end, with the same schema-level metadata they see for Parquet-backed tables.

Changes:

Writer (HoodieSparkLanceWriter.enrichSparkSchemaForLanceVectors): translate each hudi_type = VECTOR(...) field's descriptor into the lance-spark metadata key arrow.fixed-size-list.size (long) before calling LanceArrowUtils.toArrowSchema, so lance-spark's shouldBeFixedSizeList chooses FixedSizeListWriter. FLOAT and DOUBLE elements only; anything else fails fast with HoodieNotSupportedException.
Writer footer: HoodieSparkLanceWriter.additionalSchemaMetadata emits the canonical hoodie.vector.columns footer entry (delegating to HoodieSchema.serializeVectorColumnsMetadata) so readers — including future readers — can identify VECTOR columns without a Hudi schema store.
Reader (VectorConversionUtils.restoreVectorMetadataFromArrowSchema): re-attach hudi_type = VECTOR(...) Spark metadata onto fields the Lance reader produces, gated on the hoodie.vector.columns footer value so non-Hudi-written FixedSizeList<Float/Double> fields cannot be misinterpreted. HoodieSchema.parseVectorColumnNames is a new helper for parsing the comma-separated footer value.
File-format plumbing: SparkFileFormatInternalRowReaderContext uses tableConfig.getBaseFileFormat (not a filename extension sniff) to decide whether to apply the Parquet-only VECTOR→BinaryType rewrite; HoodieFileGroupReaderBasedFileFormat.withVectorRewrite skips the rewrite entirely for non-Parquet base formats.
Tests (TestLanceDataSource, parameterized across COW + MOR):
- testMultipleVectorColumns — consolidated: two VECTOR columns (FLOAT + DOUBLE) at different dimensions; insert then upsert exercises the COW rewrite and MOR log-merge paths.
- testNullableVectorRoundTrip — nullable VECTOR with a null row (kept separate; folding null + upsert together triggers a Lance reader bug tracked as a follow-up).
- testVectorProjection — project only the VECTOR column, and the VECTOR column with Hudi meta-fields.
- Each test opens the written .lance file via LanceFileReader and asserts the physical column is ArrowType.FixedSizeList with the expected listSize, and that the hoodie.vector.columns footer entry matches the expected descriptor list.

Impact

User-facing: Hudi tables with hoodie.table.base.file.format = LANCE and a VECTOR column now produce Lance files where that column is a native fixed-size list (previously a plain variable-length list). This is the representation Lance's vector-search features operate on, so future vector-search integrations can consume these files directly.
No change to the Parquet path.
No public Java/Scala API signature changes. HoodieSparkLanceWriter gains a private helper and overrides the existing additionalSchemaMetadata hook on HoodieBaseLanceWriter.

Risk Level

low

Only the Lance base-file write/read path is affected. Parquet and other base formats keep their existing behavior (explicitly gated by HoodieFileFormat checks at every rewrite site). The VECTOR-column changes are local to a narrow conversion layer between Hudi and lance-spark. Covered by the new TestLanceDataSource cases (COW + MOR, multi-column, upsert, nullable, projection).

Documentation Update

none — Lance base-file support is itself still new. The user-facing VECTOR descriptor on Lance is identical to Parquet (already documented in RFC-99).

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

Translate the Hudi VECTOR logical-type metadata (`hudi_type = "VECTOR(dim[,elem])"`) into the lance-spark metadata key `arrow.fixed-size-list.size` before calling `LanceArrowUtils.toArrowSchema`, so the Lance writer emits a native Arrow FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead of a plain variable-length list. No change needed at `LanceFileWriter.open(...)`; the encoding is driven by the Arrow schema itself. - New private helper `enrichSparkSchemaForLanceVectors` in `HoodieSparkLanceWriter` reuses `VectorConversionUtils.detectVectorColumnsFromMetadata` to find VECTOR fields and attaches the Lance metadata key; non-vector fields pass through unchanged. - Fails fast with `HoodieNotSupportedException` for non-ArrayType or non- Float/Double element types (matches lance-spark's `shouldBeFixedSizeList`). - Tests in `TestLanceDataSource` (COW + MOR): - `testFloatVectorRoundTrip` - `testDoubleVectorRoundTrip` - `testMultipleVectorColumns` Each opens the written `.lance` file via `LanceFileReader` and asserts the field is `ArrowType.FixedSizeList` with the expected `listSize` — the direct regression guard that fails pre-fix and passes post-fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Companion to the Lance writer's native FixedSizeList encoding: on read, rehydrate the Hudi `hudi_type = VECTOR(...)` Spark metadata that `LanceArrowUtils.fromArrowSchema` drops, so the read schema matches the Parquet path. Gate the Parquet-only ArrayType→BinaryType vector rewrite in HoodieFileGroupReaderBasedFileFormat on format == PARQUET; Lance returns vectors natively as ArrayType so the rewrite would trigger a spurious cast and break the read. - VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the Arrow schema and re-attaches VECTOR(dim[,DOUBLE]) for FixedSizeList<Float32|Float64, dim> fields. - HoodieSparkLanceReader.getSchema and SparkLanceReaderBase.read now call it so downstream VECTOR-aware code sees the same schema as on Parquet. - TestLanceDataSource: assert hudi_type metadata is restored on read for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Mirrors the Parquet writer: emit the comma-separated `colName:VECTOR(dim[,elemType])` descriptor list under the existing `hoodie.vector.columns` key in the Lance file-footer key-value metadata. Reader still derives VECTOR identity from the Arrow FixedSizeList type today; this footer entry is insurance for future descriptor fields the Arrow type cannot express (quantization tags, distance metrics, etc.) and keeps Lance files symmetric with Parquet files. - HoodieBaseLanceWriter: new protected `additionalSchemaMetadata()` hook invoked during close(), so subclasses can contribute footer KV entries alongside bloom-filter metadata. - HoodieSparkLanceWriter: override `additionalSchemaMetadata()` to emit `hoodie.vector.columns` when the Spark schema has any VECTOR column. - VectorConversionUtils: add `buildVectorColumnsMetadataValue(StructType)` matching the Parquet-path helper's output format. - TestLanceDataSource: assert footer carries the expected descriptor list for float, double, and multi-vector round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

wombatu-kun · 2026-04-14T07:21:00Z

Writing VECTOR columns as native Lance FixedSizeList is the right direction, and the writer/reader symmetry plus the forward-compat footer are nice design choices. I've been working on the same problem on a parallel branch (different strategy: plain List<Float> + BinaryType rewrite reversal), and while comparing the two approaches I spotted a handful of things worth addressing before this lands.

wombatu-kun · 2026-04-14T07:21:20Z

Lance artifact coordinates are out of date: com.lancedb → org.lance, 0.0.15 → 0.4.0

wombatu-kun · 2026-04-14T07:22:57Z

The hoodieFileFormat != PARQUET early-return in withVectorRewrite is clean. Please double-check that every call site that previously may have hit the rewrite now correctly skips it for Lance:

buildReaderWithPartitionValues (all three invocations: requiredSchema, outputSchema, requestedSchema).
Any other places in the file where detectVectorColumns / replaceVectorFieldsWithBinary are called independently of the helper.

wombatu-kun · 2026-04-14T07:23:10Z

The three added tests (testFloatVectorRoundTrip, testDoubleVectorRoundTrip, testMultipleVectorColumns) are good writer/reader guards but cover only the trivial insert + full-table-read path. Recommended additions:

Nullable vector column (null row values, null whole struct).
Partitioned table.
MOR log merging (write → update → read).
Schema evolution: add VECTOR column to an existing Lance table and read old + new rows.
Clustering: verify clustered output Lance files also carry native FixedSizeList + footer.
Projection: read only the vector column; read vector column alongside metadata columns.
Time travel / incremental query.

wombatu-kun · 2026-04-14T07:26:30Z

Minor code-quality nits

HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors: the local DataType dt = field.dataType() is assigned before the ArrayType check, then unused once the cast is done. Can collapse to if (!(field.dataType() instanceof ArrayType)).
VectorConversionUtils#buildVectorColumnsMetadataValue duplicates the format produced by HoodieSchema#buildVectorColumnsMetadataValue for Avro schemas. Consider adding a Javadoc cross-reference and asserting the two produce identical strings for equivalent schemas (could even delegate).
VectorConversionUtils#restoreVectorMetadataFromArrowSchema walks only top-level fields. If a nested struct ever carries a VECTOR child (unlikely today but possible), it would be missed. Worth a Javadoc note: "Top-level VECTORs only; nested struct children are not recursed into."
HoodieBaseLanceWriter#close: the new additionalSchemaMetadata hook is called after bloom filter metadata — fine — but the nested if (writer != null) check is redundant (already inside the outer writer != null for bloom filter); can collapse.
buildVectorColumnsMetadataValue returns "" for schemas without vectors, and the override early-returns Collections.emptyMap() in that case — so no footer entry is emitted. Correct, just worth a comment explaining that the hook is called unconditionally but is a no-op when there are no vectors (otherwise a future reader maintaining this code might wonder why a non-VECTOR Lance file has no hoodie.vector.columns).

SparkFileFormatInternalRowReaderContext.getFileRecordIterator had a second, unconditional rewrite of VECTOR columns from ArrayType to BinaryType (the earlier withVectorRewrite gate in HoodieFileGroupReaderBasedFileFormat only covered the non-FileGroupReader branch). On the MOR / FileGroupReader path this caused Lance reads to fail with scala.MatchError: ArrayType(FloatType,true) in Cast.castToBinaryCode, because Lance returns vectors natively as ArrayType while the caller-supplied schema had been rewritten to BinaryType — the generated UnsafeProjection then injected an unsupported Cast(ArrayType -> BinaryType). Gate the detection + rewrite on the file format: skip it for .lance base files. Hudi log files are always parquet-encoded so they still take the Parquet path. Fixes 14 TestLanceDataSource vector errors (COW + MOR) observed in spark3.5 / spark3.4 CI, including the spark3.4 part2 6h timeout that was the same failure retrying. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

On Scala 2.13, Row.getAs[Seq[Float]] fails at runtime with ClassCastException: scala.collection.mutable.ArraySeq$ofRef cannot be cast to scala.collection.immutable.Seq, because Seq in Scala 2.13 defaults to immutable.Seq while Spark holds array columns as mutable.ArraySeq internally. Row.getSeq[T] is declared as scala.collection.Seq[T] (general), so it works on both 2.12 (where Seq = scala.collection.Seq) and 2.13 (where Seq = scala.collection.immutable.Seq). Same runtime object, no cast. Fixes the 14 TestLanceDataSource errors on java17 CI (scala-2.13, spark3.5 / spark4.0). The earlier VECTOR->BinaryType rewrite fix resolved the scala.MatchError in the read path; this change resolves the subsequent 2.13-only test-side ClassCastException. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… size check - SparkFileFormatInternalRowReaderContext: use tableConfig.getBaseFileFormat instead of filename extension sniff to detect Lance base files - VectorConversionUtils.restoreVectorMetadataFromArrowSchema: remove confusing arrowFields.size != sparkFields.length defensive guard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rahil-c · 2026-04-16T14:20:38Z

Lance artifact coordinates are out of date: com.lancedb → org.lance, 0.0.15 → 0.4.0

Why does this matter though, since we didnt change any lance dependency version in this pr? I have a seperate lance bump PR #18498 to isolate this

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — one small inconsistency to clean up: the new additionalSchemaMetadata method uses a fully-qualified java.util.Map return type even though Map is imported in the same file.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice work on this PR! The round-trip design is clean — enriching the Spark schema with arrow.fixed-size-list.size on write and rebuilding the VECTOR descriptor from the Arrow type on read mirrors the Parquet path nicely, and the footer-key unification (VECTOR_COLUMNS_METADATA_KEY) keeps both formats in lockstep. One correctness concern around deterministic ordering of the footer value is worth double-checking in the inline comment. Once that is addressed, this should be ready for a Hudi committer or PMC member to take it from here.

yihua · 2026-04-18T01:20:44Z

+    // Only Parquet needs the BinaryType rewrite; other formats (Lance) return ArrayType natively.
+    if (hoodieFileFormat != HoodieFileFormat.PARQUET) {
+      (schema, Map.empty[Int, HoodieSchema.Vector])
+    } else {


🤖 Gating on hoodieFileFormat != HoodieFileFormat.PARQUET disables the vector→Binary rewrite for the whole table, but MOR log files are always Parquet even when the base format is Lance. Could you confirm that log file reads in a Lance MOR table don't flow through withVectorRewrite? If they do, a Lance table with VECTOR columns + log updates would try to read FIXED_LEN_BYTE_ARRAY vectors with an ArrayType schema and fail.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

Let's follow-up with MOR support in a separate PR if needed.

yihua · 2026-04-18T01:20:44Z

+        continue;
+      }
+      HoodieSchema.Vector.VectorElementType elemType = vec.getVectorElementType();
+      if (elemType != HoodieSchema.Vector.VectorElementType.FLOAT


🤖 enrichSparkSchemaForLanceVectors throws HoodieNotSupportedException at writer construction for any non-FLOAT/DOUBLE VECTOR. Is there a schema-validation layer upstream that catches this before the writer is created, or will a user with e.g. a VECTOR(N, INT8) column first discover the limitation as a failed write task? If it's the latter, consider surfacing this earlier (at DDL/commit planning) so partial writes don't leave orphan files.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

I think it is ok to add this defensive check here to throw HoodieNotSupportedException. We should have a follow-up (non-blocker) to add such check in schema validation as well. @rahil-c to create an GH issue.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR enhances vector metadata handling across Hudi to support Lance file format persistence and recovery. It introduces utilities to serialize/restore vector descriptors, adds vector schema enrichment during Lance writes, restores vector type information during reads, renames a format-agnostic metadata key constant, and implements conditional vector schema rewriting based on file format. Test coverage for Lance vectors is substantially expanded.

Sequence Diagram (CodeRabbit):

sequenceDiagram
    actor User
    participant SparkSchema as Spark Schema<br/>(with VECTOR metadata)
    participant Writer as HoodieSparkLanceWriter
    participant Enrich as VectorConversionUtils
    participant Arrow as Arrow Schema
    participant Lance as Lance File<br/>(with footer metadata)
    
    User->>Writer: Write DataFrame with VECTOR columns
    Writer->>Enrich: enrichVectorMetadataInSchema()
    Enrich->>Enrich: Detect VECTOR fields<br/>Add dimension to metadata
    Enrich-->>Writer: Enriched StructType
    Writer->>Arrow: Convert to Arrow schema<br/>(with sizing info)
    Arrow->>Lance: Write with metadata
    Writer->>Lance: addSchemaMetadata()<br/>(vector columns footer)
    Lance-->>User: Persisted file

Sequence Diagram (CodeRabbit):

sequenceDiagram
    actor User
    participant Lance as Lance File<br/>(with footer metadata)
    participant Reader as LanceReader
    participant Arrow as Arrow Schema<br/>(FixedSizeList)
    participant Restore as VectorConversionUtils
    participant SparkSchema as Spark Schema<br/>(restored VECTOR metadata)
    
    User->>Reader: Read Lance file
    Reader->>Lance: Load Arrow schema
    Lance-->>Arrow: Return schema<br/>(FixedSizeList fields)
    Reader->>Arrow: Convert to Spark StructType
    Arrow-->>Reader: Basic StructType<br/>(no VECTOR info)
    Reader->>Restore: restoreVectorMetadataFromArrowSchema()
    Restore->>Restore: Detect FixedSizeList of<br/>Float32/Float64
    Restore->>SparkSchema: Attach VECTOR metadata
    SparkSchema-->>User: Return enriched schema

CodeRabbit: yihua#49 (review)

rahil-c · 2026-04-21T16:23:55Z

@yihua to take a look

yihua · 2026-04-22T18:36:49Z

+        continue;
+      }
+      HoodieSchema.Vector.VectorElementType elemType = vec.getVectorElementType();
+      if (elemType != HoodieSchema.Vector.VectorElementType.FLOAT


I think it is ok to add this defensive check here to throw HoodieNotSupportedException. We should have a follow-up (non-blocker) to add such check in schema validation as well. @rahil-c to create an GH issue.

yihua · 2026-04-22T19:04:00Z

+    // Only Parquet needs the BinaryType rewrite; other formats (Lance) return ArrayType natively.
+    if (hoodieFileFormat != HoodieFileFormat.PARQUET) {
+      (schema, Map.empty[Int, HoodieSchema.Vector])
+    } else {


Let's follow-up with MOR support in a separate PR if needed.

yihua · 2026-04-22T19:12:59Z

+   * {@code LanceArrowUtils.fromArrowSchema} drops all field metadata, so without this
+   * step VECTOR columns are indistinguishable from plain arrays.


Should we upstream fix to LanceArrowUtils.fromArrowSchema so we can get rid of these restore logic in the future for simplicity?

Yes we will need to track this in github issue and then upstream a fix to lance spark

Actually when doing a deeper dive i think field metadata in regards to this property is preserved.

rahil-c · 2026-04-22T21:54:45Z

Let's follow-up with MOR support in a separate PR if needed.
#17628 here is github tracking issue for lance as log format

- HoodieSparkLanceWriter.additionalSchemaMetadata: drop redundant java.util.Map FQN (Map is already imported) - VectorConversionUtils.buildVectorColumnsFooterValue: walk fields in ordinal order rather than iterating the detected HashMap, so the hoodie.vector.columns footer value is stable across JDKs - VectorConversionUtils & HoodieSchema: use import for LinkedHashMap instead of fully-qualified class name Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…st consolidation - VectorConversionUtils.restoreVectorMetadataFromArrowSchema: only restore VECTOR descriptor metadata for fields listed in the hoodie.vector.columns footer, so non-VECTOR Arrow FixedSizeList<Float/Double> fields cannot be misinterpreted. Drop the anyChanged optimization for simplicity. - HoodieSchema: add parseVectorColumnNames helper to extract field names from the footer value, handling commas inside descriptor parentheses. - HoodieSparkLanceWriter.enrichSparkSchemaForLanceVectors: expand javadoc to explain why we auto-attach arrow.fixed-size-list.size from the VECTOR dimension. - TestLanceDataSource: consolidate testFloatVectorRoundTrip, testDoubleVectorRoundTrip, testVectorMorUpdatePath, testVectorPartitionedTable into testMultipleVectorColumns (two VECTOR columns of different element types + dims, insert then upsert covers MOR merge path). Keep testNullableVectorRoundTrip separate — nullable VECTOR in the upsert/merge path hits a Lance reader bug; tracked as a follow-up. Rename forEachLanceSchema -> validateLanceFileSchema per review suggestion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yihua

LGTM

yihua · 2026-04-23T07:23:04Z

+      Set<String> vectorColumnNames = HoodieSchema.parseVectorColumnNames(
+          customMetadata == null ? null : customMetadata.get(HoodieSchema.VECTOR_COLUMNS_METADATA_KEY));


nit: VECTOR_COLUMNS_METADATA_KEY in the footer is now used. We can remove this usage later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

hudi-bot · 2026-04-23T22:14:56Z

CI report:

a8f2c55 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-04-29T01:52:32Z

Codecov Report

❌ Patch coverage is 81.29496% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.87%. Comparing base (ace2871) to head (a8f2c55).
⚠️ Report is 23 commits behind head on master.

Files with missing lines	Patch %	Lines
.../apache/hudi/io/storage/VectorConversionUtils.java	73.91%	5 Missing and 7 partials ⚠️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java	55.55%	1 Missing and 3 partials ⚠️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	89.28%	2 Missing and 1 partial ⚠️
...va/org/apache/hudi/common/schema/HoodieSchema.java	90.90%	0 Missing and 3 partials ⚠️
...apache/hudi/io/storage/HoodieSparkLanceReader.java	80.00%	0 Missing and 1 partial ⚠️
...i/io/storage/row/HoodieRowParquetWriteSupport.java	0.00%	1 Missing ⚠️
...hudi/SparkFileFormatInternalRowReaderContext.scala	80.00%	0 Missing and 1 partial ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             master   #18497    +/-   ##
==========================================
  Coverage     68.87%   68.87%            
- Complexity    28482    28515    +33     
==========================================
  Files          2478     2478            
  Lines        136699   136801   +102     
  Branches      16634    16659    +25     
==========================================
+ Hits          94150    94228    +78     
- Misses        34980    34989     +9     
- Partials       7569     7584    +15

Flag	Coverage Δ
common-and-other-modules	`44.43% <5.03%> (-0.04%)`	⬇️
hadoop-mr-java-client	`44.75% <9.30%> (-0.02%)`	⬇️
spark-client-hadoop-common	`48.47% <3.12%> (-0.07%)`	⬇️
spark-java-tests	`49.46% <81.29%> (+<0.01%)`	⬆️
spark-scala-tests	`45.28% <7.91%> (-0.04%)`	⬇️
utilities	`37.99% <7.19%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...a/org/apache/hudi/avro/HoodieAvroWriteSupport.java	`100.00% <100.00%> (ø)`
...ution/datasources/lance/SparkLanceReaderBase.scala	`87.03% <100.00%> (+1.03%)`	⬆️
...apache/hudi/io/storage/HoodieSparkLanceReader.java	`71.25% <80.00%> (+0.19%)`	⬆️
...i/io/storage/row/HoodieRowParquetWriteSupport.java	`72.87% <0.00%> (ø)`
...hudi/SparkFileFormatInternalRowReaderContext.scala	`76.71% <80.00%> (-0.05%)`	⬇️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala	`85.58% <83.33%> (-0.19%)`	⬇️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java	`93.58% <89.28%> (-2.57%)`	⬇️
...va/org/apache/hudi/common/schema/HoodieSchema.java	`87.65% <90.90%> (+0.01%)`	⬆️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java	`69.04% <55.55%> (-0.58%)`	⬇️
.../apache/hudi/io/storage/VectorConversionUtils.java	`80.39% <73.91%> (-5.82%)`	⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 13, 2026

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 13, 2026

rahil-c changed the title ~~[WIP] feat(lance): write Hudi VECTOR columns as native Lance fixed-size lists~~ [WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists Apr 13, 2026

rahil-c requested a review from wombatu-kun April 13, 2026 17:12

rahil-c and others added 5 commits April 14, 2026 15:45

address comments

ff5beb3

intial self review

66276af

fix comment to be concise

e703935

rahil-c marked this pull request as ready for review April 15, 2026 20:36

rahil-c requested review from bvaradar, voonhous and yihua April 15, 2026 20:36

bvaradar reviewed Apr 16, 2026

View reviewed changes

Comment thread ...di-spark-client/src/main/scala/org/apache/hudi/SparkFileFormatInternalRowReaderContext.scala Outdated

bvaradar reviewed Apr 16, 2026

View reviewed changes

Comment thread ...client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/VectorConversionUtils.java Outdated

rahil-c requested a review from bvaradar April 16, 2026 05:17

bvaradar approved these changes Apr 17, 2026

View reviewed changes

yihua reviewed Apr 18, 2026

View reviewed changes

Comment thread ...lient/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/HoodieSparkLanceWriter.java

yihua reviewed Apr 18, 2026

View reviewed changes

yihua mentioned this pull request Apr 18, 2026

[OSS PR #18497] [WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists yihua/hudi#49

Open

yihua reviewed Apr 18, 2026

View reviewed changes

rahil-c changed the title ~~[WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists~~ feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists Apr 20, 2026

yihua reviewed Apr 22, 2026

View reviewed changes

rahil-c and others added 4 commits April 22, 2026 15:51

address ethan comments

92d75af

Merge branch 'master' into rahil/lance-vector-write

574aca0

yihua reviewed Apr 23, 2026

View reviewed changes

rahil-c added 2 commits April 22, 2026 23:07

fix merge conflict and adress ethan

e1d6825

fix minor issue

01b6733

yihua approved these changes Apr 23, 2026

View reviewed changes

rahil-c and others added 2 commits April 23, 2026 10:26

Merge branch 'master' into rahil/lance-vector-write

158b2d0

fix(lance): reorder imports in TestLanceDataSource to satisfy scalastyle

a8f2c55

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

yihua merged commit 1e64662 into apache:master Apr 23, 2026
56 checks passed

		* {@code LanceArrowUtils.fromArrowSchema} drops all field metadata, so without this
		* step VECTOR columns are indistinguishable from plain arrays.

		Set<String> vectorColumnNames = HoodieSchema.parseVectorColumnNames(
		customMetadata == null ? null : customMetadata.get(HoodieSchema.VECTOR_COLUMNS_METADATA_KEY));

Conversation

rahil-c commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

wombatu-kun commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

rahil-c commented Apr 16, 2026

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahil-c commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rahil-c commented Apr 13, 2026 •

edited

Loading