Skip to content

feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497

Merged
yihua merged 17 commits into
apache:masterfrom
rahil-c:rahil/lance-vector-write
Apr 23, 2026
Merged

feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists#18497
yihua merged 17 commits into
apache:masterfrom
rahil-c:rahil/lance-vector-write

Conversation

@rahil-c
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c commented Apr 13, 2026

Describe the issue this Pull Request addresses

Hudi's VECTOR logical type (RFC-99) already round-trips correctly on the Parquet path via the hoodie.vector.columns footer metadata + FIXED_LEN_BYTE_ARRAY storage. On the newly-added Lance base-file path, VECTOR columns silently degraded to plain List<Float> / List<Double> Arrow fields on write and lost their hudi_type descriptor on read. This PR wires up the Lance path so VECTOR columns land as native Arrow FixedSizeList<Float32|Float64, dim> (Lance's native vector column encoding) and are surfaced with the same hudi_type = VECTOR(...) metadata on read as Parquet.

Summary and Changelog

Users writing a Hudi table with a VECTOR(dim[, elem]) column and hoodie.table.base.file.format = LANCE now get Lance's native fixed-size-list vector encoding end-to-end, with the same schema-level metadata they see for Parquet-backed tables.

Changes:

  • Writer (HoodieSparkLanceWriter.enrichSparkSchemaForLanceVectors): translate each hudi_type = VECTOR(...) field's descriptor into the lance-spark metadata key arrow.fixed-size-list.size (long) before calling LanceArrowUtils.toArrowSchema, so lance-spark's shouldBeFixedSizeList chooses FixedSizeListWriter. FLOAT and DOUBLE elements only; anything else fails fast with HoodieNotSupportedException.
  • Writer footer: HoodieSparkLanceWriter.additionalSchemaMetadata emits the canonical hoodie.vector.columns footer entry (delegating to HoodieSchema.serializeVectorColumnsMetadata) so readers — including future readers — can identify VECTOR columns without a Hudi schema store.
  • Reader (VectorConversionUtils.restoreVectorMetadataFromArrowSchema): re-attach hudi_type = VECTOR(...) Spark metadata onto fields the Lance reader produces, gated on the hoodie.vector.columns footer value so non-Hudi-written FixedSizeList<Float/Double> fields cannot be misinterpreted. HoodieSchema.parseVectorColumnNames is a new helper for parsing the comma-separated footer value.
  • File-format plumbing: SparkFileFormatInternalRowReaderContext uses tableConfig.getBaseFileFormat (not a filename extension sniff) to decide whether to apply the Parquet-only VECTOR→BinaryType rewrite; HoodieFileGroupReaderBasedFileFormat.withVectorRewrite skips the rewrite entirely for non-Parquet base formats.
  • Tests (TestLanceDataSource, parameterized across COW + MOR):
    • testMultipleVectorColumns — consolidated: two VECTOR columns (FLOAT + DOUBLE) at different dimensions; insert then upsert exercises the COW rewrite and MOR log-merge paths.
    • testNullableVectorRoundTrip — nullable VECTOR with a null row (kept separate; folding null + upsert together triggers a Lance reader bug tracked as a follow-up).
    • testVectorProjection — project only the VECTOR column, and the VECTOR column with Hudi meta-fields.
    • Each test opens the written .lance file via LanceFileReader and asserts the physical column is ArrowType.FixedSizeList with the expected listSize, and that the hoodie.vector.columns footer entry matches the expected descriptor list.

Impact

  • User-facing: Hudi tables with hoodie.table.base.file.format = LANCE and a VECTOR column now produce Lance files where that column is a native fixed-size list (previously a plain variable-length list). This is the representation Lance's vector-search features operate on, so future vector-search integrations can consume these files directly.
  • No change to the Parquet path.
  • No public Java/Scala API signature changes. HoodieSparkLanceWriter gains a private helper and overrides the existing additionalSchemaMetadata hook on HoodieBaseLanceWriter.

Risk Level

low

Only the Lance base-file write/read path is affected. Parquet and other base formats keep their existing behavior (explicitly gated by HoodieFileFormat checks at every rewrite site). The VECTOR-column changes are local to a narrow conversion layer between Hudi and lance-spark. Covered by the new TestLanceDataSource cases (COW + MOR, multi-column, upsert, nullable, projection).

Documentation Update

none — Lance base-file support is itself still new. The user-facing VECTOR descriptor on Lance is identical to Parquet (already documented in RFC-99).

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Translate the Hudi VECTOR logical-type metadata (`hudi_type = "VECTOR(dim[,elem])"`)
into the lance-spark metadata key `arrow.fixed-size-list.size` before calling
`LanceArrowUtils.toArrowSchema`, so the Lance writer emits a native Arrow
FixedSizeList<Float32|Float64, dim> (Lance's vector column encoding) instead
of a plain variable-length list. No change needed at `LanceFileWriter.open(...)`;
the encoding is driven by the Arrow schema itself.

- New private helper `enrichSparkSchemaForLanceVectors` in `HoodieSparkLanceWriter`
  reuses `VectorConversionUtils.detectVectorColumnsFromMetadata` to find VECTOR
  fields and attaches the Lance metadata key; non-vector fields pass through
  unchanged.
- Fails fast with `HoodieNotSupportedException` for non-ArrayType or non-
  Float/Double element types (matches lance-spark's `shouldBeFixedSizeList`).
- Tests in `TestLanceDataSource` (COW + MOR):
    - `testFloatVectorRoundTrip`
    - `testDoubleVectorRoundTrip`
    - `testMultipleVectorColumns`
  Each opens the written `.lance` file via `LanceFileReader` and asserts the
  field is `ArrowType.FixedSizeList` with the expected `listSize` — the direct
  regression guard that fails pre-fix and passes post-fix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Apr 13, 2026
Companion to the Lance writer's native FixedSizeList encoding: on read,
rehydrate the Hudi `hudi_type = VECTOR(...)` Spark metadata that
`LanceArrowUtils.fromArrowSchema` drops, so the read schema matches the
Parquet path. Gate the Parquet-only ArrayType→BinaryType vector rewrite
in HoodieFileGroupReaderBasedFileFormat on format == PARQUET; Lance
returns vectors natively as ArrayType so the rewrite would trigger a
spurious cast and break the read.

- VectorConversionUtils.restoreVectorMetadataFromArrowSchema walks the
  Arrow schema and re-attaches VECTOR(dim[,DOUBLE]) for
  FixedSizeList<Float32|Float64, dim> fields.
- HoodieSparkLanceReader.getSchema and SparkLanceReaderBase.read now
  call it so downstream VECTOR-aware code sees the same schema as on
  Parquet.
- TestLanceDataSource: assert hudi_type metadata is restored on read
  for float, double, and multi-vector round-trips.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Apr 13, 2026
@rahil-c rahil-c changed the title [WIP] feat(lance): write Hudi VECTOR columns as native Lance fixed-size lists [WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists Apr 13, 2026
@rahil-c rahil-c requested a review from wombatu-kun April 13, 2026 17:12
Mirrors the Parquet writer: emit the comma-separated
`colName:VECTOR(dim[,elemType])` descriptor list under the existing
`hoodie.vector.columns` key in the Lance file-footer key-value metadata.
Reader still derives VECTOR identity from the Arrow FixedSizeList type
today; this footer entry is insurance for future descriptor fields the
Arrow type cannot express (quantization tags, distance metrics, etc.)
and keeps Lance files symmetric with Parquet files.

- HoodieBaseLanceWriter: new protected `additionalSchemaMetadata()` hook
  invoked during close(), so subclasses can contribute footer KV
  entries alongside bloom-filter metadata.
- HoodieSparkLanceWriter: override `additionalSchemaMetadata()` to emit
  `hoodie.vector.columns` when the Spark schema has any VECTOR column.
- VectorConversionUtils: add `buildVectorColumnsMetadataValue(StructType)`
  matching the Parquet-path helper's output format.
- TestLanceDataSource: assert footer carries the expected descriptor
  list for float, double, and multi-vector round-trips.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@wombatu-kun
Copy link
Copy Markdown
Contributor

Writing VECTOR columns as native Lance FixedSizeList is the right direction, and the writer/reader symmetry plus the forward-compat footer are nice design choices. I've been working on the same problem on a parallel branch (different strategy: plain List<Float> + BinaryType rewrite reversal), and while comparing the two approaches I spotted a handful of things worth addressing before this lands.

@wombatu-kun
Copy link
Copy Markdown
Contributor

Lance artifact coordinates are out of date: com.lancedborg.lance, 0.0.150.4.0

@wombatu-kun
Copy link
Copy Markdown
Contributor

The hoodieFileFormat != PARQUET early-return in withVectorRewrite is clean. Please double-check that every call site that previously may have hit the rewrite now correctly skips it for Lance:

  • buildReaderWithPartitionValues (all three invocations: requiredSchema, outputSchema, requestedSchema).
  • Any other places in the file where detectVectorColumns / replaceVectorFieldsWithBinary are called independently of the helper.

@wombatu-kun
Copy link
Copy Markdown
Contributor

The three added tests (testFloatVectorRoundTrip, testDoubleVectorRoundTrip, testMultipleVectorColumns) are good writer/reader guards but cover only the trivial insert + full-table-read path. Recommended additions:

  • Nullable vector column (null row values, null whole struct).
  • Partitioned table.
  • MOR log merging (write → update → read).
  • Schema evolution: add VECTOR column to an existing Lance table and read old + new rows.
  • Clustering: verify clustered output Lance files also carry native FixedSizeList + footer.
  • Projection: read only the vector column; read vector column alongside metadata columns.
  • Time travel / incremental query.

@wombatu-kun
Copy link
Copy Markdown
Contributor

Minor code-quality nits

  • HoodieSparkLanceWriter#enrichSparkSchemaForLanceVectors: the local DataType dt = field.dataType() is assigned before the ArrayType check, then unused once the cast is done. Can collapse to if (!(field.dataType() instanceof ArrayType)).
  • VectorConversionUtils#buildVectorColumnsMetadataValue duplicates the format produced by HoodieSchema#buildVectorColumnsMetadataValue for Avro schemas. Consider adding a Javadoc cross-reference and asserting the two produce identical strings for equivalent schemas (could even delegate).
  • VectorConversionUtils#restoreVectorMetadataFromArrowSchema walks only top-level fields. If a nested struct ever carries a VECTOR child (unlikely today but possible), it would be missed. Worth a Javadoc note: "Top-level VECTORs only; nested struct children are not recursed into."
  • HoodieBaseLanceWriter#close: the new additionalSchemaMetadata hook is called after bloom filter metadata — fine — but the nested if (writer != null) check is redundant (already inside the outer writer != null for bloom filter); can collapse.
  • buildVectorColumnsMetadataValue returns "" for schemas without vectors, and the override early-returns Collections.emptyMap() in that case — so no footer entry is emitted. Correct, just worth a comment explaining that the hook is called unconditionally but is a no-op when there are no vectors (otherwise a future reader maintaining this code might wonder why a non-VECTOR Lance file has no hoodie.vector.columns).

rahil-c and others added 5 commits April 14, 2026 15:45
SparkFileFormatInternalRowReaderContext.getFileRecordIterator had a
second, unconditional rewrite of VECTOR columns from ArrayType to
BinaryType (the earlier withVectorRewrite gate in
HoodieFileGroupReaderBasedFileFormat only covered the non-FileGroupReader
branch). On the MOR / FileGroupReader path this caused Lance reads to
fail with scala.MatchError: ArrayType(FloatType,true) in
Cast.castToBinaryCode, because Lance returns vectors natively as
ArrayType while the caller-supplied schema had been rewritten to
BinaryType — the generated UnsafeProjection then injected an
unsupported Cast(ArrayType -> BinaryType).

Gate the detection + rewrite on the file format: skip it for .lance
base files. Hudi log files are always parquet-encoded so they still
take the Parquet path.

Fixes 14 TestLanceDataSource vector errors (COW + MOR) observed in
spark3.5 / spark3.4 CI, including the spark3.4 part2 6h timeout that
was the same failure retrying.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
On Scala 2.13, Row.getAs[Seq[Float]] fails at runtime with
ClassCastException: scala.collection.mutable.ArraySeq$ofRef cannot
be cast to scala.collection.immutable.Seq, because Seq in Scala 2.13
defaults to immutable.Seq while Spark holds array columns as
mutable.ArraySeq internally.

Row.getSeq[T] is declared as scala.collection.Seq[T] (general), so it
works on both 2.12 (where Seq = scala.collection.Seq) and 2.13 (where
Seq = scala.collection.immutable.Seq). Same runtime object, no cast.

Fixes the 14 TestLanceDataSource errors on java17 CI (scala-2.13,
spark3.5 / spark4.0). The earlier VECTOR->BinaryType rewrite fix
resolved the scala.MatchError in the read path; this change resolves
the subsequent 2.13-only test-side ClassCastException.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rahil-c rahil-c marked this pull request as ready for review April 15, 2026 20:36
@rahil-c rahil-c requested review from bvaradar, voonhous and yihua April 15, 2026 20:36
… size check

- SparkFileFormatInternalRowReaderContext: use tableConfig.getBaseFileFormat
  instead of filename extension sniff to detect Lance base files
- VectorConversionUtils.restoreVectorMetadataFromArrowSchema: remove confusing
  arrowFields.size != sparkFields.length defensive guard

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rahil-c rahil-c requested a review from bvaradar April 16, 2026 05:17
@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 16, 2026

Lance artifact coordinates are out of date: com.lancedborg.lance, 0.0.150.4.0

Why does this matter though, since we didnt change any lance dependency version in this pr? I have a seperate lance bump PR #18498 to isolate this

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — one small inconsistency to clean up: the new additionalSchemaMetadata method uses a fully-qualified java.util.Map return type even though Map is imported in the same file.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice work on this PR! The round-trip design is clean — enriching the Spark schema with arrow.fixed-size-list.size on write and rebuilding the VECTOR descriptor from the Arrow type on read mirrors the Parquet path nicely, and the footer-key unification (VECTOR_COLUMNS_METADATA_KEY) keeps both formats in lockstep. One correctness concern around deterministic ordering of the footer value is worth double-checking in the inline comment. Once that is addressed, this should be ready for a Hudi committer or PMC member to take it from here.

// Only Parquet needs the BinaryType rewrite; other formats (Lance) return ArrayType natively.
if (hoodieFileFormat != HoodieFileFormat.PARQUET) {
(schema, Map.empty[Int, HoodieSchema.Vector])
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Gating on hoodieFileFormat != HoodieFileFormat.PARQUET disables the vector→Binary rewrite for the whole table, but MOR log files are always Parquet even when the base format is Lance. Could you confirm that log file reads in a Lance MOR table don't flow through withVectorRewrite? If they do, a Lance table with VECTOR columns + log updates would try to read FIXED_LEN_BYTE_ARRAY vectors with an ArrayType schema and fail.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow-up with MOR support in a separate PR if needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue;
}
HoodieSchema.Vector.VectorElementType elemType = vec.getVectorElementType();
if (elemType != HoodieSchema.Vector.VectorElementType.FLOAT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 enrichSparkSchemaForLanceVectors throws HoodieNotSupportedException at writer construction for any non-FLOAT/DOUBLE VECTOR. Is there a schema-validation layer upstream that catches this before the writer is created, or will a user with e.g. a VECTOR(N, INT8) column first discover the limitation as a failed write task? If it's the latter, consider surfacing this earlier (at DDL/commit planning) so partial writes don't leave orphan files.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to add this defensive check here to throw HoodieNotSupportedException. We should have a follow-up (non-blocker) to add such check in schema validation as well. @rahil-c to create an GH issue.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR enhances vector metadata handling across Hudi to support Lance file format persistence and recovery. It introduces utilities to serialize/restore vector descriptors, adds vector schema enrichment during Lance writes, restores vector type information during reads, renames a format-agnostic metadata key constant, and implements conditional vector schema rewriting based on file format. Test coverage for Lance vectors is substantially expanded.

Sequence Diagram (CodeRabbit):

sequenceDiagram
    actor User
    participant SparkSchema as Spark Schema<br/>(with VECTOR metadata)
    participant Writer as HoodieSparkLanceWriter
    participant Enrich as VectorConversionUtils
    participant Arrow as Arrow Schema
    participant Lance as Lance File<br/>(with footer metadata)
    
    User->>Writer: Write DataFrame with VECTOR columns
    Writer->>Enrich: enrichVectorMetadataInSchema()
    Enrich->>Enrich: Detect VECTOR fields<br/>Add dimension to metadata
    Enrich-->>Writer: Enriched StructType
    Writer->>Arrow: Convert to Arrow schema<br/>(with sizing info)
    Arrow->>Lance: Write with metadata
    Writer->>Lance: addSchemaMetadata()<br/>(vector columns footer)
    Lance-->>User: Persisted file
Loading

Sequence Diagram (CodeRabbit):

sequenceDiagram
    actor User
    participant Lance as Lance File<br/>(with footer metadata)
    participant Reader as LanceReader
    participant Arrow as Arrow Schema<br/>(FixedSizeList)
    participant Restore as VectorConversionUtils
    participant SparkSchema as Spark Schema<br/>(restored VECTOR metadata)
    
    User->>Reader: Read Lance file
    Reader->>Lance: Load Arrow schema
    Lance-->>Arrow: Return schema<br/>(FixedSizeList fields)
    Reader->>Arrow: Convert to Spark StructType
    Arrow-->>Reader: Basic StructType<br/>(no VECTOR info)
    Reader->>Restore: restoreVectorMetadataFromArrowSchema()
    Restore->>Restore: Detect FixedSizeList of<br/>Float32/Float64
    Restore->>SparkSchema: Attach VECTOR metadata
    SparkSchema-->>User: Return enriched schema
Loading

CodeRabbit: yihua#49 (review)

@rahil-c rahil-c changed the title [WIP] feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists feat(lance): round-trip Hudi VECTOR columns as native Lance fixed-size lists Apr 20, 2026
@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 21, 2026

@yihua to take a look

continue;
}
HoodieSchema.Vector.VectorElementType elemType = vec.getVectorElementType();
if (elemType != HoodieSchema.Vector.VectorElementType.FLOAT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to add this defensive check here to throw HoodieNotSupportedException. We should have a follow-up (non-blocker) to add such check in schema validation as well. @rahil-c to create an GH issue.

Comment thread hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java Outdated
// Only Parquet needs the BinaryType rewrite; other formats (Lance) return ArrayType natively.
if (hoodieFileFormat != HoodieFileFormat.PARQUET) {
(schema, Map.empty[Int, HoodieSchema.Vector])
} else {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow-up with MOR support in a separate PR if needed.

Comment on lines +269 to +270
* {@code LanceArrowUtils.fromArrowSchema} drops all field metadata, so without this
* step VECTOR columns are indistinguishable from plain arrays.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we upstream fix to LanceArrowUtils.fromArrowSchema so we can get rid of these restore logic in the future for simplicity?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we will need to track this in github issue and then upstream a fix to lance spark

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually when doing a deeper dive i think field metadata in regards to this property is preserved.

Image

@rahil-c
Copy link
Copy Markdown
Collaborator Author

rahil-c commented Apr 22, 2026

Let's follow-up with MOR support in a separate PR if needed.
#17628 here is github tracking issue for lance as log format

rahil-c and others added 4 commits April 22, 2026 15:51
- HoodieSparkLanceWriter.additionalSchemaMetadata: drop redundant
  java.util.Map FQN (Map is already imported)
- VectorConversionUtils.buildVectorColumnsFooterValue: walk fields in ordinal
  order rather than iterating the detected HashMap, so the hoodie.vector.columns
  footer value is stable across JDKs
- VectorConversionUtils & HoodieSchema: use import for LinkedHashMap instead
  of fully-qualified class name

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…st consolidation

- VectorConversionUtils.restoreVectorMetadataFromArrowSchema: only restore VECTOR
  descriptor metadata for fields listed in the hoodie.vector.columns footer, so
  non-VECTOR Arrow FixedSizeList<Float/Double> fields cannot be misinterpreted.
  Drop the anyChanged optimization for simplicity.
- HoodieSchema: add parseVectorColumnNames helper to extract field names from the
  footer value, handling commas inside descriptor parentheses.
- HoodieSparkLanceWriter.enrichSparkSchemaForLanceVectors: expand javadoc to
  explain why we auto-attach arrow.fixed-size-list.size from the VECTOR dimension.
- TestLanceDataSource: consolidate testFloatVectorRoundTrip, testDoubleVectorRoundTrip,
  testVectorMorUpdatePath, testVectorPartitionedTable into testMultipleVectorColumns
  (two VECTOR columns of different element types + dims, insert then upsert covers
  MOR merge path). Keep testNullableVectorRoundTrip separate — nullable VECTOR in the
  upsert/merge path hits a Lance reader bug; tracked as a follow-up. Rename
  forEachLanceSchema -> validateLanceFileSchema per review suggestion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchema.java Outdated
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +206 to +207
Set<String> vectorColumnNames = HoodieSchema.parseVectorColumnNames(
customMetadata == null ? null : customMetadata.get(HoodieSchema.VECTOR_COLUMNS_METADATA_KEY));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: VECTOR_COLUMNS_METADATA_KEY in the footer is now used. We can remove this usage later.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 1e64662 into apache:master Apr 23, 2026
56 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 81.29496% with 26 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.87%. Comparing base (ace2871) to head (a8f2c55).
⚠️ Report is 23 commits behind head on master.

Files with missing lines Patch % Lines
.../apache/hudi/io/storage/VectorConversionUtils.java 73.91% 5 Missing and 7 partials ⚠️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java 55.55% 1 Missing and 3 partials ⚠️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 89.28% 2 Missing and 1 partial ⚠️
...va/org/apache/hudi/common/schema/HoodieSchema.java 90.90% 0 Missing and 3 partials ⚠️
...apache/hudi/io/storage/HoodieSparkLanceReader.java 80.00% 0 Missing and 1 partial ⚠️
...i/io/storage/row/HoodieRowParquetWriteSupport.java 0.00% 1 Missing ⚠️
...hudi/SparkFileFormatInternalRowReaderContext.scala 80.00% 0 Missing and 1 partial ⚠️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #18497    +/-   ##
==========================================
  Coverage     68.87%   68.87%            
- Complexity    28482    28515    +33     
==========================================
  Files          2478     2478            
  Lines        136699   136801   +102     
  Branches      16634    16659    +25     
==========================================
+ Hits          94150    94228    +78     
- Misses        34980    34989     +9     
- Partials       7569     7584    +15     
Flag Coverage Δ
common-and-other-modules 44.43% <5.03%> (-0.04%) ⬇️
hadoop-mr-java-client 44.75% <9.30%> (-0.02%) ⬇️
spark-client-hadoop-common 48.47% <3.12%> (-0.07%) ⬇️
spark-java-tests 49.46% <81.29%> (+<0.01%) ⬆️
spark-scala-tests 45.28% <7.91%> (-0.04%) ⬇️
utilities 37.99% <7.19%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...a/org/apache/hudi/avro/HoodieAvroWriteSupport.java 100.00% <100.00%> (ø)
...ution/datasources/lance/SparkLanceReaderBase.scala 87.03% <100.00%> (+1.03%) ⬆️
...apache/hudi/io/storage/HoodieSparkLanceReader.java 71.25% <80.00%> (+0.19%) ⬆️
...i/io/storage/row/HoodieRowParquetWriteSupport.java 72.87% <0.00%> (ø)
...hudi/SparkFileFormatInternalRowReaderContext.scala 76.71% <80.00%> (-0.05%) ⬇️
...parquet/HoodieFileGroupReaderBasedFileFormat.scala 85.58% <83.33%> (-0.19%) ⬇️
...apache/hudi/io/storage/HoodieSparkLanceWriter.java 93.58% <89.28%> (-2.57%) ⬇️
...va/org/apache/hudi/common/schema/HoodieSchema.java 87.65% <90.90%> (+0.01%) ⬆️
...rg/apache/hudi/io/lance/HoodieBaseLanceWriter.java 69.04% <55.55%> (-0.58%) ⬇️
.../apache/hudi/io/storage/VectorConversionUtils.java 80.39% <73.91%> (-5.82%) ⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants