Skip to content

feat(blob): default blob.inline.mode to DESCRIPTOR for Lance#18744

Open
voonhous wants to merge 4 commits into
apache:masterfrom
voonhous:address-#18742
Open

feat(blob): default blob.inline.mode to DESCRIPTOR for Lance#18744
voonhous wants to merge 4 commits into
apache:masterfrom
voonhous:address-#18742

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented May 15, 2026

Describe the issue this Pull Request addresses

Closes: #18742

Summary and Changelog

Flip the default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR so plain reads on Lance tables stop materializing inline blob payloads on every row. CONTENT remains available as an opt-in.

Changes:

  • HoodieReaderConfig: default flipped to DESCRIPTOR; doc updated.
  • BatchedBlobReader: INLINE branch now falls back to the synthesized reference when inline_data is null, so read_blob() returns bytes under both modes. Previously it short-circuited on storage_type=INLINE and read the null field.
  • TestLanceDataSource:
    • testBlobInlineRoundTrip now opts into CONTENT explicitly (it exists to validate CONTENT semantics).
    • New testBlobInlineCompactionRoundTrip verifies INLINE bytes survive MOR compaction under the new default, asserted via the realistic user paths (plain read returns descriptor shape; read_blob() returns bytes).
  • rfc/rfc-100/rfc-100.md: replaced the ASCII visual with two mermaid diagrams covering row shape per (storage type × query × mode × file format) and read_blob() byte-resolution hop counts.

The existing compaction-side CONTENT pin in HoodieSparkLanceReader is already in place and unchanged.

Impact

User-facing behavior change for Lance INLINE blobs: SELECT * now returns data=NULL plus a populated descriptor by default; callers materializing bytes should use read_blob() (or explicitly set hoodie.read.blob.inline.mode=CONTENT).

Parquet is unaffected. The Parquet reader does not currently honor hoodie.read.blob.inline.mode, so on Parquet tables the config is ignored and INLINE rows always come back in CONTENT shape regardless of the setting. The default flip has no observable effect there. The eventual spec'd behavior for Parquet + DESCRIPTOR (return data=NULL, reference=NULL, bytes via read_blob) is tracked separately and is not part of this PR.

Performance trade-off: plain reads skip blob decoding entirely (strict win over CONTENT); read_blob() on INLINE rows now does 2 hops instead of 1 when DESCRIPTOR mode is used.

Why 2 hops is the right default:

  • Hop 1 (scan) returns (position, size) pairs instead of decoded blob bytes — strictly less work than CONTENT mode's scan.
  • Hop 2 is a raw pread against the .lance file at (offset, length). Lance's blob encoding stores bytes contiguously at the reported position, so the read bypasses Lance's decoder entirely — no column-chunk walking, no Arrow materialization.
  • For the common query pattern (column reads, filters, joins that don't touch the blob), blob bytes are never decoded at all. This is the workload Lance is typically chosen for.
  • For byte-heavy workloads that materialize every blob (batch image processing, ML feature extraction, full-table exports), the second hop is real per-row overhead. Those callers should explicitly opt in to CONTENT:
    hoodie.read.blob.inline.mode = CONTENT
    
    This collapses back to a single hop and matches the prior default's behavior exactly.
  • Auto-promotion (detect read_blob() in the plan and force CONTENT for that scan) is feasible as a follow-up but carries non-trivial planner-rewrite complexity (relation cloning under shared subtrees, option-flow timing, column granularity). Deferred until usage data shows it matters.

In short: the default optimizes for the dominant Lance access pattern; byte-heavy callers flip one config to get the optimum for their pattern.

Risk Level

Medium. The default flip changes observable read behavior for Lance INLINE blob columns. Mitigations:

  • The BatchedBlobReader fallback keeps read_blob() working in both modes, so the canonical bytes API is unaffected by the flip.
  • Compaction-side reader is hard-pinned to CONTENT and the new compaction test exercises the round-trip end-to-end.
  • Users on the prior default can restore it via hoodie.read.blob.inline.mode=CONTENT.

Documentation Update

  • HoodieReaderConfig.BLOB_INLINE_READ_MODE description updated to reflect the new default.
  • RFC-100 visual section refreshed with mermaid diagrams covering the new default.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@voonhous voonhous added this to the release-1.2.0 milestone May 15, 2026
@voonhous voonhous requested review from rahil-c and yihua May 15, 2026 09:13
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC update! This flips the default of hoodie.read.blob.inline.mode from CONTENT to DESCRIPTOR so Lance INLINE blobs are no longer materialized on every plain read, with a BatchedBlobReader fallback so read_blob() still works under either shape. A few design-level points could use more attention — see inline comments. Once those are addressed, a Hudi committer or PMC member can take it from here for a deeper review.

Comment thread rfc/rfc-100/rfc-100.md Outdated
Comment thread rfc/rfc-100/rfc-100.md Outdated
public static final ConfigProperty<String> BLOB_INLINE_READ_MODE = ConfigProperty
.key("hoodie.read.blob.inline.mode")
.defaultValue(BLOB_INLINE_READ_MODE_CONTENT)
.defaultValue(BLOB_INLINE_READ_MODE_DESCRIPTOR)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This is a user-visible behavior change for SELECT * on INLINE blob columns. The config carries sinceVersion("1.2.0") — has 1.2.0 already shipped CONTENT as the default? If yes, the flip should bump sinceVersion/document the breaking change explicitly; if no, it would help to call that out so reviewers can confirm no released version locks in the old default. Either way, a short "Migration / Upgrade" subsection in the RFC pointing users to set hoodie.read.blob.inline.mode=CONTENT to preserve prior behavior would make this easier to roll out.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

val referenceStruct = accessor.getStruct(blobStruct, 2, HoodieSchema.Blob.getReferenceFieldCount)
val filePath = accessor.getString(referenceStruct, 0)
require(filePath != null && filePath.nonEmpty,
s"INLINE blob descriptor at row $rowIndex must have non-empty external_path")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The INLINE-with-null-inline_data branch silently performs a remote pread against the synthetic descriptor's external_path. Could the RFC discuss the failure mode and security implications of trusting that path? If a malformed/forged base file produces a descriptor pointing at an arbitrary file or with an is_managed=true claim that isn't validated against the table's known file set, read_blob() could end up reading bytes outside the table. Some validation (e.g. is_managed must be true and the path must lie under the table base) seems worth calling out as a design invariant.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Comment thread rfc/rfc-100/rfc-100.md
- For DESCRIPTOR + Lance, hop 2 is a raw filesystem `pread` against the `.lance` file at the descriptor's `(offset, length)`. It bypasses Lance's decoder entirely — the blob encoding (`lance-encoding:blob=true` on a `LargeBinary` column) stores blob bytes contiguously at the position Lance reports, so direct byte access is safe.
- Plain `SELECT col` (no `read_blob`) is always 1 hop. DESCRIPTOR's win is that hop 1 skips blob decoding when bytes aren't needed.

### 3. Writer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The RFC notes the DESCRIPTOR + Lance hop-2 read "bypasses Lance's decoder entirely" via raw pread at the recorded offset/length. Is there a stability contract from Lance that the (offset, length) reported on a blob-encoded LargeBinary column is guaranteed to be the on-disk byte range across Lance versions? If Lance ever switches blob storage to a chunked/compressed layout, raw pread would silently return garbage. Worth either citing the Lance contract here or noting the version pin / detection strategy.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Comment thread rfc/rfc-100/rfc-100.md
@@ -163,32 +163,46 @@ Lance's native blob encoding stores blobs in a way that already exposes a `(file

**Visual**
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 It would help to add a short "Alternatives considered" note for the default flip — e.g., gating DESCRIPTOR per-table/per-query vs flipping the global default, or making DESCRIPTOR the default only when the engine signals projection without read_blob(). The current text presents the flip as the only option; documenting why a global flip is preferable would strengthen the RFC.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 15, 2026
voonhous added 3 commits May 15, 2026 19:34
Flip the default of hoodie.read.blob.inline.mode from CONTENT to
DESCRIPTOR so plain reads (e.g. SELECT *) on Lance tables stop paying
the I/O cost of materializing inline blob payloads. CONTENT remains
available as an opt-in for callers that need the raw bytes in `data`.

The compactor's CONTENT pin is already in place in HoodieSparkLanceReader,
so compaction continues to read bytes regardless of the user-facing
default and round-trips them through the base-file rewrite.

Also fix BatchedBlobReader: under DESCRIPTOR mode, INLINE rows arrive
with null `inline_data` and a synthesized reference. The previous
dispatch short-circuited on storage_type=INLINE and read the null
field, breaking read_blob(). The INLINE branch now falls back to the
descriptor's positional range read when inline_data is absent, so
read_blob() works in both modes.

Tests:
- testBlobInlineRoundTrip now opts into CONTENT explicitly (it exists
  specifically to validate CONTENT semantics).
- New testBlobInlineCompactionRoundTrip verifies INLINE bytes survive
  MOR compaction read/rewrite under the new default, asserted via the
  realistic user paths (plain read + read_blob()).
Replace the ASCII visual with two mermaid diagrams: a storage-rooted
dispatch tree showing what the user gets back for SELECT col vs
SELECT read_blob(col) across modes and file formats, and a hop-count
view of read_blob() byte resolution that makes the CONTENT vs
DESCRIPTOR + Lance trade-off explicit.
…o-op

- Swap CONTENT/DESCRIPTOR default labels in the mermaid diagrams and
  surrounding prose so the doc matches HoodieReaderConfig's new
  default.
- Make the matrix's Parquet + DESCRIPTOR row reflect current behavior
  (config is ignored on Parquet today; rows come back in CONTENT
  shape) with a footnote, and split the eventual spec'd behavior into
  its own paragraph marked not-yet-implemented. Mermaid Parquet
  branches updated to match.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC update! The default flip plus the BatchedBlobReader fallback and new compaction round-trip test together make a coherent story, and the mermaid diagrams are a nice readability win. A few architectural questions remain around lifecycle of the synthetic descriptor and an undocumented invariant in the compaction reader path that I think are worth surfacing in the RFC itself before a Hudi committer or PMC member takes it from here.

Comment thread rfc/rfc-100/rfc-100.md
- `hoodie.read.blob.inline.mode` — values `CONTENT` (default) | `DESCRIPTOR`.
- `hoodie.read.blob.inline.mode` — values `DESCRIPTOR` (default) | `CONTENT`.
- `DESCRIPTOR` (default): the engine returns an `OUT_OF_LINE`-shaped descriptor in the `reference` field where the underlying file format supports it (Lance today), enabling lazy byte materialization via `read_blob`. For file formats without a native descriptor for inline payloads (Parquet), both `data` and `reference` are returned `NULL`, and the caller must use `read_blob` to retrieve bytes.
- `CONTENT`: the engine eagerly materializes inline bytes into the struct's `data` field.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Could the RFC discuss the lifecycle of the synthetic DESCRIPTOR returned for INLINE+Lance? The descriptor encodes (base file path, offset, length) into a managed Hudi base file — if cleaning, clustering, or a subsequent compaction rewrites/removes that file before the caller invokes read_blob(), the deferred resolution will fail or read stale bytes. OUT_OF_LINE references are user-supplied and assumed externally stable; a synthetic descriptor pointing at a Hudi-managed file is not. Is there a contract about how long these descriptors stay valid (e.g. only within a single query/stage), and how that's enforced?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Comment thread rfc/rfc-100/rfc-100.md
| `SELECT col FROM table` | INLINE | Parquet | `CONTENT` | bytes | NULL | Yes — via `data` |
| `SELECT col FROM table` | INLINE | Parquet | `DESCRIPTOR` (default) | bytes¹ | NULL¹ | Yesvia `data`¹ |
| `SELECT col FROM table` | INLINE | Lance | `CONTENT` | bytes | NULL | Yes — via `data` |
| `SELECT col FROM table` | INLINE | Lance | `DESCRIPTOR` (default) | NULL | populated (Lance blob enc.) | No — descriptor visible; use `read_blob` for bytes|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The new compaction test reveals that HoodieSparkLanceReader must hard-pin CONTENT regardless of hoodie.read.blob.inline.mode — otherwise the compactor reads data=NULL for untouched rows and silently rewrites the base file without bytes. That's a critical correctness invariant that the RFC doesn't currently mention. Could you add a note (probably in this section or in the Writer/Compaction section) that internal table-services readers MUST ignore this config and always materialize CONTENT? Otherwise a future refactor that "unifies" reader configs could re-introduce silent data loss. @yihua might want to weigh in.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

filePath = filePath,
offset = offset,
length = length,
index = rowIndex
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The require(...) calls here will throw IllegalArgumentException mid-batch on any malformed INLINE+DESCRIPTOR row (null inline_data and null/incomplete reference). On a corrupted file this kills the whole task and surfaces as a confusing user error far from the cause. Would it be worth using a Hudi-typed exception with the file path + row index, or at least a top-level wrapper that names the column and source file? Same question for the offset/length sign checks — those would only fire on writer bugs, but the diagnostic is what determines whether anyone can debug it.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Comment thread rfc/rfc-100/rfc-100.md
F -->|Lance| LD["type = INLINE<br/>inline_data = NULL<br/>reference = synthetic managed<br/>path, offset, length, is_managed=true"]
F -->|"Parquet (today: mode no-op)"| PD["Parquet reader does not<br/>implement DESCRIPTOR yet —<br/>returns CONTENT shape:<br/>inline_data = bytes, reference = NULL"]

QI -->|"SELECT read_blob(col)"| RM{hoodie.read.blob.inline.mode}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The Lance branch shows reference = synthetic managed path, offset, length, is_managed=true. Could you spell out what is_managed=true means semantically and how it differs from a user-supplied OUT_OF_LINE reference downstream — e.g., does it gate cleaning, does read_blob() route differently, can a writer round-trip a row carrying this flag back into the table? This flag appears to be the only signal distinguishing "safe to clean — Hudi owns the bytes" from "do not clean — user owns the bytes", which is load-bearing.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

The INLINE-descriptor fallback in BatchedBlobReader (the branch that
preads bytes via the synthesized reference when inline_data is null) is
load-bearing for mixed-usage queries. If read_blob() force-flipped the
scan back to CONTENT shape, reference.* would all be null in the same
projection. Fold a single co-projection SELECT into the existing
testBlobInlineDescriptorMode fixture to lock in this contract without
adding a new test method.
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the iteration! The descriptor-fallback path in BatchedBlobReader, the MOR compaction round-trip test, and the clarified Parquet no-op documentation address the substantive concerns. Prior rounds have already raised the key design points (compaction CONTENT pin, mid-batch require behavior, synthetic descriptor lifecycle, Parquet semantics). A Hudi committer or PMC member can take it from here for the final design call.

cc @yihua

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.00%. Comparing base (071b3f1) to head (97bc5de).
⚠️ Report is 6 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (071b3f1) and HEAD (97bc5de). Click for more details.

HEAD has 32 uploads less than BASE
Flag BASE (071b3f1) HEAD (97bc5de)
spark-scala-tests 12 0
utilities 1 0
spark-java-tests 18 0
common-and-other-modules 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18744       +/-   ##
=============================================
- Coverage     68.14%   54.00%   -14.14%     
+ Complexity    29094    12459    -16635     
=============================================
  Files          2517     1434     -1083     
  Lines        141113    72161    -68952     
  Branches      17508     8245     -9263     
=============================================
- Hits          96160    38974    -57186     
+ Misses        37046    29691     -7355     
+ Partials       7907     3496     -4411     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 45.01% <100.00%> (+<0.01%) ⬆️
spark-client-hadoop-common 48.32% <100.00%> (-0.01%) ⬇️
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../apache/hudi/common/config/HoodieReaderConfig.java 98.14% <100.00%> (-1.86%) ⬇️

... and 1861 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Default hoodie.read.blob.inline.mode to DESCRIPTOR for Lance (compaction pinned to CONTENT)

4 participants