feat(blob): Accept partial {type,data} or {type,reference} structs on write by voonhous · Pull Request #18665 · apache/hudi

voonhous · 2026-04-30T18:05:48Z

Describe the issue this Pull Request addresses

BLOB writes require the full 3-field {type, data, reference} struct on every row, even when only one sibling is used (reference is unused for INLINE, data is unused for OUT_OF_LINE). The boilerplate is the first thing people hit when writing a blob.

Note: Merge this after:

Summary and Changelog

INLINE writes now accept {type, data}.
OUT_OF_LINE writes now accept {type, reference}.
Missing sibling is padded to null at the writer entry. Canonical 3-field input is a no-op.
Padding recurses through StructType, ArrayType, MapType (via Spark transform / transform_values), so partial blobs nested inside complex types work too.

Changes:

HoodieSparkSchemaConverters: new public padPartialBlobColumns(df) plus recursive helpers (padField, padDataType, padBlobStructValue, rebuiltType).
HoodieSparkSqlWriter.writeInternal: pads the source DataFrame just before the schema-conversion / validation call.
BlobTestHelpers: added inlineBlobStructColMinimal and outOfLineBlobStructColMinimal.
TestReadBlobSQL: minimal-struct tests for INLINE and OUT_OF_LINE plus a nested struct/array/map case.
TestBlobDataType: SQL named_struct minimal-literal tests for both INLINE and OUT_OF_LINE.

Impact

User-facing: BLOB writes accept fewer fields. On-disk layout: unchanged (still canonical 3-field).
Read path: untouched.
Performance: padding short-circuits on canonical inputs (single schema walk, no projection emitted).

Risk Level

low

Padding only fires when a partial blob field is detected by a quick schema scan. Canonical inputs hit an early return. Null-struct semantics are preserved with when(col.isNull, lit(null)).

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a quality-of-life improvement that lets users write BLOB columns with partial {type,data} (INLINE) or {type,reference} (OUT_OF_LINE) structs, padding the missing sibling at the writer entry. The recursion through struct/array/map and the SQL-side canonicalizer for INSERT INTO are nicely scoped, and the canonical-input short-circuit keeps it cheap. A couple of observations worth double-checking in the inline comments around metadata preservation in nested rebuilds and a few smaller edge cases. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of naming nits: rebuiltType breaks the surrounding pad* convention, and dfPreBlobPad is hard to parse at a glance — suggestions below.

hudi-agent · 2026-04-30T18:21:05Z

+  private def padBlobStructValue(blobCol: Column, st: StructType, caseSensitive: Boolean): Column = {
+    val key: String => String =
+      if (caseSensitive) identity else (_: String).toLowerCase(Locale.ROOT)
+    val present = st.fields.map(f => key(f.name)).toSet


🤖 When rebuilding a non-blob struct that contains a nested partial blob, the inner padField(...).as(f.name) doesn't propagate f.metadata, so the rebuilt struct's nested BLOB-tagged field loses its HoodieSchema.TYPE_METADATA_FIELD on the resulting DataFrame schema. The null branch via rebuiltType does preserve metadata, which makes the two branches asymmetric. Was this intentional (relying on the catalog/table schema downstream) or should it be .as(f.name, f.metadata) to keep validators / consumers that introspect the DF schema happy?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

+        Binding(attr, attr, None)
+      case other =>
+        // Deterministic alias name from the canonicalized form so identical nested
+        // sub-trees (e.g. across SELECT and WHERE) get the same alias name.


🤖 Integer.toHexString(other.canonicalized.hashCode()) can collide between two genuinely different source expressions, in which case we'd inject two Aliases with the same name into the synthetic Project. Different ExprIds make plan addressing safe, but Spark name resolution within that Project would be ambiguous. Probably benign in practice — flagging in case a longer or UUID-suffixed alias name is preferable.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

+      // RFC-100 BLOB QoL: accept partial INLINE-only `{type,data}` or OUT_OF_LINE-only
+      // `{type,reference}` user inputs by padding the missing sibling field with null at the
+      // ingest boundary. No-op for already-canonical 3-field structs.
+      val df = HoodieSparkSchemaConverters.padPartialBlobColumns(dfPreBlobPad)


🤖 Padding now runs on the prepped/streaming branches too (where the early-return previously kept sourceDf untouched). The canonical-input short-circuit should make this a no-op, but is there any prepped path where the input dataframe has already been encoded such that running this projection through it could change semantics (e.g. lazy plan invalidation, breaking pre-bound encoders)? If not, all good — just calling it out.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

+  /**
+   * Returns the post-padding DataType corresponding to `dataType`: every accepted partial
+   * blob struct is replaced by `expectedBlobStructType`; nested struct/array/map containers
+   * are rebuilt with their inner types similarly transformed. Used to provide the


🤖 nit: rebuiltType sits in the middle of a pad*-named family (padPartialBlobColumns, padDataType, padBlobStructValue, padField) but doesn't follow that convention. Could you rename it to paddedType (or canonicalType) so readers immediately understand its role without having to read the Javadoc?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-04-30T18:21:05Z

@@ -356,11 +357,15 @@ class HoodieSparkSqlWriterInternal {

      val shouldReconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
      val latestTableSchemaOpt = getLatestTableSchema(tableMetaClient, schemaFromCatalog)


🤖 nit: dfPreBlobPad is a bit hard to parse at a glance. Could you call it unpaddedDf (or rawDf) to make the before/after relationship with val df = …padPartialBlobColumns(…) on line 367 immediately clear?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Mirror the parquet MOR log-only compaction tests for VECTOR, VARIANT, and BLOB onto the Lance base file format, and extend all variants with a 6th deltacommit so the cleaner has a chance to retire the post-compaction log-only slice and write a .clean instant. - VECTOR Lance: passes; verifies HoodieFileFormat.LANCE on the table config and that a .lance base file exists under the table path after compaction. - VARIANT Lance / BLOB INLINE Lance / BLOB OUT_OF_LINE Lance: gated by -Dlance.skip.tests; expected to fail at HoodieSparkLanceWriter -> LanceArrowUtils.toArrowType (RFC-100 Phase 2 gap). Each asserts the LANCE format config sticks to hoodie.properties immediately after CREATE TABLE so the table-level invariant is checked even when the writer fails downstream. - All 8 tests (4 parquet + 4 Lance) now drive a 6th merge-update after the compaction-triggering 5th commit. The 5th commit's auto-clean runs before inline compaction, so the prior log slice is not yet superseded; the 6th commit's postCommit clean retires it and writes the .clean instant. The cleaner-timeline assertion uses reloadActiveTimeline() to avoid a stale cached view.

… write INLINE writes now accept the natural `{type, data}` shape and OUT_OF_LINE writes accept `{type, reference}`; the missing sibling field is auto-padded with null at the writer ingest boundary so the canonical 3-field BLOB layout is preserved on disk. Padding recurses through StructType, ArrayType, and MapType (via Spark's transform / transform_values) so nested partial blobs are handled too. Already-canonical inputs are a no-op.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds writer-side padding for partial BLOB struct inputs (INLINE {type,data} and OUT_OF_LINE {type,reference}) on both the DataFrame and SQL ingest paths, plus extends ReadBlobRule to handle nested source expressions. After tracing through the padding recursion (struct/array/map), the canonical-shape rebuild in padBlobStructValue, the InsertBlobCanonicalizer SQL path, and the HoodieInMemoryHashIndex.clear() test cleanup, the substantive concerns I'd raise are already captured by the prior round of review comments (metadata propagation through padField(...).as(f.name), hashCode-based alias naming collision risk, padding now applied to prepped/streaming branches, and a couple of naming nits). No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A couple of small naming and defensive-coding suggestions below.

cc @yihua

hudi-agent · 2026-05-15T02:18:51Z

      val shouldReconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
      val latestTableSchemaOpt = getLatestTableSchema(tableMetaClient, schemaFromCatalog)
-      val df = if (preppedWriteOperation || preppedSparkSqlWrites || preppedSparkSqlMergeInto || sourceDf.isStreaming) {
+      val dfPreBlobPad = if (preppedWriteOperation || preppedSparkSqlWrites || preppedSparkSqlMergeInto || sourceDf.isStreaming) {


🤖 nit: dfPreBlobPad names the variable by its position in the pipeline rather than by what it holds. Something like dfWithoutMetaCols or sourceDfStripped would communicate the semantic purpose more directly to a reader who hasn't just read the adjacent lines.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-15T02:18:51Z

+  // field when a user supplies an INLINE-only `{type, data}` blob struct.
+  private lazy val expectedBlobReferenceStructType: DataType =
+    expectedBlobStructType.fields
+      .find(_.name == HoodieSchema.Blob.EXTERNAL_REFERENCE).get.dataType


🤖 nit: .find(...).get in a lazy val will surface as a bare NoSuchElementException with no context if EXTERNAL_REFERENCE is ever absent from expectedBlobStructType. Using .getOrElse(throw new IllegalStateException("Missing EXTERNAL_REFERENCE in canonical blob struct")) would give an actionable error if the schema ever changes.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

codecov-commenter · 2026-05-15T03:07:54Z

Codecov Report

❌ Patch coverage is 75.13514% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.07%. Comparing base (071b3f1) to head (b4f6f5e).

Files with missing lines	Patch %	Lines
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala	68.33%	13 Missing and 6 partials ⚠️
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	79.41%	3 Missing and 11 partials ⚠️
...ark/sql/hudi/command/InsertBlobCanonicalizer.scala	78.18%	0 Missing and 12 partials ⚠️
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	50.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #18665       +/-   ##
=============================================
+ Coverage     46.80%   68.07%   +21.26%     
- Complexity    15316    29043    +13727     
=============================================
  Files          1963     2518      +555     
  Lines        107192   141251    +34059     
  Branches      13007    17561     +4554     
=============================================
+ Hits          50173    96156    +45983     
+ Misses        51757    37183    -14574     
- Partials       5262     7912     +2650

Flag	Coverage Δ
common-and-other-modules	`44.36% <0.54%> (?)`
hadoop-mr-java-client	`45.00% <ø> (+<0.01%)`	⬆️
spark-client-hadoop-common	`48.29% <0.00%> (-0.04%)`	⬇️
spark-java-tests	`49.00% <58.37%> (?)`
spark-scala-tests	`45.01% <43.24%> (+11.83%)`	⬆️
utilities	`37.56% <4.32%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`78.50% <50.00%> (+26.54%)`	⬆️
...ark/sql/hudi/command/InsertBlobCanonicalizer.scala	`78.18% <78.18%> (ø)`
...e/spark/sql/avro/HoodieSparkSchemaConverters.scala	`76.43% <79.41%> (+23.18%)`	⬆️
.../org/apache/spark/sql/hudi/blob/ReadBlobRule.scala	`57.30% <68.33%> (+25.30%)`	⬆️

... and 1663 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-05-15T03:56:19Z

CI report:

dcb21ee UNKNOWN
b4f6f5e Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous force-pushed the fix-qol-blob-issue branch from dcb21ee to 9947842 Compare April 30, 2026 18:07

github-actions Bot added the size:XL PR with lines of changes > 1000 label Apr 30, 2026

hudi-agent reviewed Apr 30, 2026

View reviewed changes

voonhous added this to the release-1.2.0 milestone May 13, 2026

voonhous linked an issue May 13, 2026 that may be closed by this pull request

Quality of Life improvements for Blob writes #18635

Open

voonhous added 2 commits May 15, 2026 09:40

voonhous force-pushed the fix-qol-blob-issue branch from 9947842 to b4f6f5e Compare May 15, 2026 01:40

rahil-c removed this from the release-1.2.0 milestone May 15, 2026

hudi-agent reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(blob): Accept partial {type,data} or {type,reference} structs on write#18665

feat(blob): Accept partial {type,data} or {type,reference} structs on write#18665
voonhous wants to merge 2 commits into
apache:masterfrom
voonhous:fix-qol-blob-issue

voonhous commented Apr 30, 2026 •

edited

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent Apr 30, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 15, 2026

Uh oh!

hudi-agent May 15, 2026

Uh oh!

codecov-commenter commented May 15, 2026

Uh oh!

hudi-bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -356,11 +357,15 @@ class HoodieSparkSqlWriterInternal {

		val shouldReconcileSchema = parameters(DataSourceWriteOptions.RECONCILE_SCHEMA.key()).toBoolean
		val latestTableSchemaOpt = getLatestTableSchema(tableMetaClient, schemaFromCatalog)

Conversation

voonhous commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 15, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 15, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 15, 2026

Codecov Report

Uh oh!

hudi-bot commented May 15, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voonhous commented Apr 30, 2026 •

edited

Loading