Skip to content

test(schema): Add lance fileformat test for custom types on MOR#18597

Open
voonhous wants to merge 2 commits into
apache:masterfrom
voonhous:add-lance-fileformat-test
Open

test(schema): Add lance fileformat test for custom types on MOR#18597
voonhous wants to merge 2 commits into
apache:masterfrom
voonhous:add-lance-fileformat-test

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Apr 26, 2026

Describe the issue this Pull Request addresses

Fixes: #18602

Following up with #18583 to add the same tests, but using Lance as file format.

Note: Merge this after #18583 and #18599 is merged.

Summary and Changelog

Add merge-into log only tests in #18583 using lance basefile format.

Impact

None

Risk Level

low

The change only re-attaches metadata that the catalog already owns, scoped to fields with a matching target. Nullability narrowing is opt-in and only enabled on paths where Spark guarantees no nulls upstream.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor
and payload shape of a custom-typed column survive inline compaction of
a log-only MOR table into a base file.

- TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata
  (5 commits via SQL + MERGE INTO to trigger default inline compaction).
- TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+,
  asserting native VariantType round-trips through compaction.
- TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline
  uses named_struct with hex byte literals; out-of-line creates real files
  via BlobTestHelpers.createTestFile and verifies bytes via read_blob().
@voonhous voonhous requested review from rahil-c and yihua April 26, 2026 15:13
@voonhous voonhous changed the title test(schema): Add lance fileformat test test(schema): Add lance fileformat test for custom types on MOR Apr 26, 2026
@github-actions github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 26, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR mirrors the merge-into log-only tests from #18583 against the Lance base file format and adds a corresponding non-Lance vector test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming/duplication nit in the vector test — the readOrdered and embeddingOf helpers are defined twice across the two near-identical test methods; otherwise the tests are clean and well-commented.

cc @yihua

| )
""".stripMargin)

def readOrdered(): Seq[Row] =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: readOrdered and embeddingOf are defined identically inside both testMorLogOnlyCompactionPreservesVectorMetadata and this Lance variant. Could you hoist them to private class-level helpers so changes only need to be made in one place?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR extends the existing custom-type tests (VECTOR / BLOB / VARIANT on log-only MOR + compaction) with parallel Lance base-file variants, gated on lance.skip.tests. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of "expected to fail" comments in the Lance tests that have no corresponding failure assertion — a reader (or CI triage) can't tell whether a red build is intentional or a regression.

cc @yihua

}

test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE column triggers compaction (Lance)") {
assume(System.getProperty("lance.skip.tests") != "true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: 'Expected to fail' without a fail(), assertThrows, or @Disabled leaves a future reader — or someone triaging a red CI run — unable to tell whether this is an intentional known gap or a regression. Could you make the intent explicit, e.g. add fail("Lance OUT_OF_LINE BLOB not yet supported — remove when RFC-100 Phase 2 lands") after the assume guard?

- AI-generated; verify before applying. React 👍/👎 to flag quality.


test("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") {
assume(HoodieSparkUtils.gteqSpark4_0, "Variant type requires Spark 4.0 or higher")
assume(System.getProperty("lance.skip.tests") != "true",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: same pattern as in TestBlobDataType — 'Expected to fail' with no explicit fail() or @Disabled means a CI failure from this test is indistinguishable from a regression. Could you add a fail("Lance VARIANT not yet supported — remove when RFC-100 Phase 2 lands") so the intent is clear in the failure output?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and
BLOB onto the Lance base file format, and extend all variants with a
6th deltacommit so the cleaner has a chance to retire the post-compaction
log-only slice and write a .clean instant.

- VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table
  config and that a .lance base file exists under the table path after
  compaction.
- VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by
  -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter ->
  LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the
  LANCE format config sticks to hoodie.properties immediately after
  CREATE TABLE so the table-level invariant is checked even when the
  writer fails downstream.
- All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after
  the compaction-triggering 5th commit. The 5th commit's auto-clean
  runs before inline compaction, so the prior log slice is not yet
  superseded; the 6th commit's postCommit clean retires it and writes
  the .clean instant. The cleaner-timeline assertion uses
  reloadActiveTimeline() to avoid a stale cached view.
@voonhous voonhous force-pushed the add-lance-fileformat-test branch from 19c9a6a to cf663ca Compare April 27, 2026 20:27
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds Lance file format variants of the existing MoR log-only compaction tests for VECTOR, BLOB, and VARIANT custom types, plus a test-harness fix to clear the JVM-static HoodieInMemoryHashIndex map between tests. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.80%. Comparing base (4f3e885) to head (cf663ca).
⚠️ Report is 41 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18597      +/-   ##
============================================
- Coverage     68.90%   67.80%   -1.11%     
- Complexity    28541    28810     +269     
============================================
  Files          2480     2518      +38     
  Lines        136910   140594    +3684     
  Branches      16679    17420     +741     
============================================
+ Hits          94341    95327     +986     
- Misses        34980    37446    +2466     
- Partials       7589     7821     +232     
Flag Coverage Δ
common-and-other-modules 44.35% <ø> (-0.08%) ⬇️
hadoop-mr-java-client 44.85% <ø> (+0.02%) ⬆️
spark-client-hadoop-common 48.41% <ø> (-0.12%) ⬇️
spark-java-tests 49.05% <ø> (-0.44%) ⬇️
spark-scala-tests 44.98% <ø> (-0.30%) ⬇️
utilities 37.34% <ø> (-0.65%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 116 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@voonhous voonhous added this to the release-1.2.0 milestone May 13, 2026
@rahil-c rahil-c removed this from the release-1.2.0 milestone May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

INLINE blob + Lance basefile format write fail

5 participants