test(schema): Add lance fileformat test for custom types on MOR by voonhous · Pull Request #18597 · apache/hudi

voonhous · 2026-04-26T15:11:09Z

Describe the issue this Pull Request addresses

Fixes: #18602

Following up with #18583 to add the same tests, but using Lance as file format.

Note: Merge this after #18583 and #18599 is merged.

Summary and Changelog

Add merge-into log only tests in #18583 using lance basefile format.

Impact

None

Risk Level

low

The change only re-attaches metadata that the catalog already owns, scoped to fields with a matching target. Nullability narrowing is opt-in and only enabled on paths where Spark guarantees no nulls upstream.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor and payload shape of a custom-typed column survive inline compaction of a log-only MOR table into a base file. - TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata (5 commits via SQL + MERGE INTO to trigger default inline compaction). - TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+, asserting native VariantType round-trips through compaction. - TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline uses named_struct with hex byte literals; out-of-line creates real files via BlobTestHelpers.createTestFile and verifies bytes via read_blob().

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the merge-into log-only tests from #18583 against the Lance base file format and adds a corresponding non-Lance vector test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming/duplication nit in the vector test — the readOrdered and embeddingOf helpers are defined twice across the two near-identical test methods; otherwise the tests are clean and well-commented.

cc @yihua

hudi-agent · 2026-04-26T15:36:56Z

+           | )
+       """.stripMargin)
+
+      def readOrdered(): Seq[Row] =


🤖 nit: readOrdered and embeddingOf are defined identically inside both testMorLogOnlyCompactionPreservesVectorMetadata and this Lance variant. Could you hoist them to private class-level helpers so changes only need to be made in one place?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR extends the existing custom-type tests (VECTOR / BLOB / VARIANT on log-only MOR + compaction) with parallel Lance base-file variants, gated on lance.skip.tests. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of "expected to fail" comments in the Lance tests that have no corresponding failure assertion — a reader (or CI triage) can't tell whether a red build is intentional or a regression.

cc @yihua

hudi-agent · 2026-04-27T10:48:41Z

+  }
+
+  test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE column triggers compaction (Lance)") {
+    assume(System.getProperty("lance.skip.tests") != "true",


🤖 nit: 'Expected to fail' without a fail(), assertThrows, or @Disabled leaves a future reader — or someone triaging a red CI run — unable to tell whether this is an intentional known gap or a regression. Could you make the intent explicit, e.g. add fail("Lance OUT_OF_LINE BLOB not yet supported — remove when RFC-100 Phase 2 lands") after the assume guard?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-27T10:48:42Z

+
+  test("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") {
+    assume(HoodieSparkUtils.gteqSpark4_0, "Variant type requires Spark 4.0 or higher")
+    assume(System.getProperty("lance.skip.tests") != "true",


🤖 nit: same pattern as in TestBlobDataType — 'Expected to fail' with no explicit fail() or @Disabled means a CI failure from this test is indistinguishable from a regression. Could you add a fail("Lance VARIANT not yet supported — remove when RFC-100 Phase 2 lands") so the intent is clear in the failure output?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and BLOB onto the Lance base file format, and extend all variants with a 6th deltacommit so the cleaner has a chance to retire the post-compaction log-only slice and write a .clean instant. - VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table config and that a .lance base file exists under the table path after compaction. - VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter -> LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the LANCE format config sticks to hoodie.properties immediately after CREATE TABLE so the table-level invariant is checked even when the writer fails downstream. - All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after the compaction-triggering 5th commit. The 5th commit's auto-clean runs before inline compaction, so the prior log slice is not yet superseded; the 6th commit's postCommit clean retires it and writes the .clean instant. The cleaner-timeline assertion uses reloadActiveTimeline() to avoid a stale cached view.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds Lance file format variants of the existing MoR log-only compaction tests for VECTOR, BLOB, and VARIANT custom types, plus a test-harness fix to clear the JVM-static HoodieInMemoryHashIndex map between tests. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

hudi-bot · 2026-04-27T21:39:44Z

CI report:

cf663ca Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-10T15:20:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.80%. Comparing base (4f3e885) to head (cf663ca).
⚠️ Report is 41 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18597      +/-   ##
============================================
- Coverage     68.90%   67.80%   -1.11%     
- Complexity    28541    28810     +269     
============================================
  Files          2480     2518      +38     
  Lines        136910   140594    +3684     
  Branches      16679    17420     +741     
============================================
+ Hits          94341    95327     +986     
- Misses        34980    37446    +2466     
- Partials       7589     7821     +232

Flag	Coverage Δ
common-and-other-modules	`44.35% <ø> (-0.08%)`	⬇️
hadoop-mr-java-client	`44.85% <ø> (+0.02%)`	⬆️
spark-client-hadoop-common	`48.41% <ø> (-0.12%)`	⬇️
spark-java-tests	`49.05% <ø> (-0.44%)`	⬇️
spark-scala-tests	`44.98% <ø> (-0.30%)`	⬇️
utilities	`37.34% <ø> (-0.65%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 116 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous requested review from rahil-c and yihua April 26, 2026 15:13

voonhous changed the title ~~test(schema): Add lance fileformat test~~ test(schema): Add lance fileformat test for custom types on MOR Apr 26, 2026

github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 26, 2026

hudi-agent reviewed Apr 26, 2026

View reviewed changes

voonhous mentioned this pull request Apr 26, 2026

feat(lance): Add VariantType support to Lance base files #18599

Open

3 tasks

voonhous force-pushed the add-lance-fileformat-test branch from 210f287 to 19c9a6a Compare April 27, 2026 10:39

hudi-agent reviewed Apr 27, 2026

View reviewed changes

hudi-agent mentioned this pull request Apr 27, 2026

[OSS PR #18597] test(schema): Add lance fileformat test for custom types on MOR hudi-agent/hudi#23

Open

voonhous force-pushed the add-lance-fileformat-test branch from 19c9a6a to cf663ca Compare April 27, 2026 20:27

hudi-agent reviewed Apr 27, 2026

View reviewed changes

voonhous mentioned this pull request Apr 30, 2026

feat(blob): Accept partial {type,data} or {type,reference} structs on write #18665

Open

3 tasks

voonhous added this to the release-1.2.0 milestone May 13, 2026

rahil-c removed this from the release-1.2.0 milestone May 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(schema): Add lance fileformat test for custom types on MOR#18597

test(schema): Add lance fileformat test for custom types on MOR#18597
voonhous wants to merge 2 commits into
apache:masterfrom
voonhous:add-lance-fileformat-test

voonhous commented Apr 26, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 26, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 27, 2026

Uh oh!

hudi-agent Apr 27, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-bot commented Apr 27, 2026

Uh oh!

codecov-commenter commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

voonhous commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Apr 27, 2026

CI report:

Uh oh!

codecov-commenter commented May 10, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voonhous commented Apr 26, 2026 •

edited

Loading