test(schema): Add lance fileformat test for custom types on MOR#18597
test(schema): Add lance fileformat test for custom types on MOR#18597voonhous wants to merge 2 commits into
Conversation
Cover the invariant that the HoodieSchema.TYPE_METADATA_FIELD descriptor and payload shape of a custom-typed column survive inline compaction of a log-only MOR table into a base file. - TestVectorDataSource: add testMorLogOnlyCompactionPreservesVectorMetadata (5 commits via SQL + MERGE INTO to trigger default inline compaction). - TestVariantDataType: equivalent VARIANT test, gated on Spark 4.0+, asserting native VariantType round-trips through compaction. - TestBlobDataType (new): BLOB INLINE and BLOB OUT_OF_LINE cases. Inline uses named_struct with hex byte literals; out-of-line creates real files via BlobTestHelpers.createTestFile and verifies bytes via read_blob().
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR mirrors the merge-into log-only tests from #18583 against the Lance base file format and adds a corresponding non-Lance vector test. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming/duplication nit in the vector test — the readOrdered and embeddingOf helpers are defined twice across the two near-identical test methods; otherwise the tests are clean and well-commented.
cc @yihua
| | ) | ||
| """.stripMargin) | ||
|
|
||
| def readOrdered(): Seq[Row] = |
There was a problem hiding this comment.
🤖 nit: readOrdered and embeddingOf are defined identically inside both testMorLogOnlyCompactionPreservesVectorMetadata and this Lance variant. Could you hoist them to private class-level helpers so changes only need to be made in one place?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
210f287 to
19c9a6a
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR extends the existing custom-type tests (VECTOR / BLOB / VARIANT on log-only MOR + compaction) with parallel Lance base-file variants, gated on lance.skip.tests. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of "expected to fail" comments in the Lance tests that have no corresponding failure assertion — a reader (or CI triage) can't tell whether a red build is intentional or a regression.
cc @yihua
| } | ||
|
|
||
| test("Test Query Log Only MOR Table With BLOB OUT_OF_LINE column triggers compaction (Lance)") { | ||
| assume(System.getProperty("lance.skip.tests") != "true", |
There was a problem hiding this comment.
🤖 nit: 'Expected to fail' without a fail(), assertThrows, or @Disabled leaves a future reader — or someone triaging a red CI run — unable to tell whether this is an intentional known gap or a regression. Could you make the intent explicit, e.g. add fail("Lance OUT_OF_LINE BLOB not yet supported — remove when RFC-100 Phase 2 lands") after the assume guard?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
|
|
||
| test("Test Query Log Only MOR Table With VARIANT column triggers compaction (Lance)") { | ||
| assume(HoodieSparkUtils.gteqSpark4_0, "Variant type requires Spark 4.0 or higher") | ||
| assume(System.getProperty("lance.skip.tests") != "true", |
There was a problem hiding this comment.
🤖 nit: same pattern as in TestBlobDataType — 'Expected to fail' with no explicit fail() or @Disabled means a CI failure from this test is indistinguishable from a regression. Could you add a fail("Lance VARIANT not yet supported — remove when RFC-100 Phase 2 lands") so the intent is clear in the failure output?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and BLOB onto the Lance base file format, and extend all variants with a 6th deltacommit so the cleaner has a chance to retire the post-compaction log-only slice and write a .clean instant. - VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table config and that a .lance base file exists under the table path after compaction. - VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter -> LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the LANCE format config sticks to hoodie.properties immediately after CREATE TABLE so the table-level invariant is checked even when the writer fails downstream. - All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after the compaction-triggering 5th commit. The 5th commit's auto-clean runs before inline compaction, so the prior log slice is not yet superseded; the 6th commit's postCommit clean retires it and writes the .clean instant. The cleaner-timeline assertion uses reloadActiveTimeline() to avoid a stale cached view.
19c9a6a to
cf663ca
Compare
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR adds Lance file format variants of the existing MoR log-only compaction tests for VECTOR, BLOB, and VARIANT custom types, plus a test-harness fix to clear the JVM-static HoodieInMemoryHashIndex map between tests. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.
cc @yihua
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #18597 +/- ##
============================================
- Coverage 68.90% 67.80% -1.11%
- Complexity 28541 28810 +269
============================================
Files 2480 2518 +38
Lines 136910 140594 +3684
Branches 16679 17420 +741
============================================
+ Hits 94341 95327 +986
- Misses 34980 37446 +2466
- Partials 7589 7821 +232
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
Fixes: #18602
Following up with #18583 to add the same tests, but using Lance as file format.
Note: Merge this after #18583 and #18599 is merged.
Summary and Changelog
Add merge-into log only tests in #18583 using lance basefile format.
Impact
None
Risk Level
low
The change only re-attaches metadata that the catalog already owns, scoped to fields with a matching target. Nullability narrowing is opt-in and only enabled on paths where Spark guarantees no nulls upstream.
Documentation Update
none
Contributor's checklist