test(spark): Add tests for batch-mode blob reads by voonhous · Pull Request #18736 · apache/hudi

voonhous · 2026-05-14T04:36:16Z

Describe the issue this Pull Request addresses

The existing tests for BatchedBlobReader (introduced in #18098) cover byte-level correctness via the readBatched API (TestBatchedBlobReader) and the read_blob() SQL surface (TestReadBlobSQL), but they do not exercise the merge/batching algorithm itself. A regression that silently disables the I/O reduction — a bad gap-threshold comparison, a broken range merge, sort/group breakage — would still pass byte-level assertions. This PR adds focused coverage for the batch-mode path.

Summary and Changelog

TestBatchedBlobReaderMerge — unit tests against mergeRanges and identifyConsecutiveRanges. Asserts merged-range counts, gap-threshold inclusive/exclusive boundaries, multi-file grouping, sort, index preservation, and rejection of overlapping ranges. No Spark, no I/O.
TestReadBlobBatching — SQL-driven correctness tests on Hudi-backed tables. Each scenario bulk_inserts a Hudi table containing the blob column, reads it back via sparkSession.read.format("hudi"), and runs SELECT read_blob(...) with a batching configuration chosen to exercise specific merge behaviors:
- 20 small reads in one file batched under maxGap=4096 (high-ratio merge)
- mixed small/large gaps in one file (above and below threshold in the same query)
- threshold-boundary case (gap == maxGap)
- multi-file interleaved input with mixed gap patterns
Driving the query through a real Hudi table exercises HoodieFileIndex and BatchedBlobReadExec serialization (the production read plan) on top of the batching path. Each test asserts the returned bytes match the deterministic pattern at each row's recorded offset — validates query success and output correctness.
Relaxes the visibility of two BatchedBlobReader helpers from private to package-private so the merge tests can call them directly.
Adds org.apache.spark.sql.hudi.blob to the test-package allowlist (TestSparkSqlHudiPackageStructure, the Azure pipeline filter, and the GitHub Actions filter) so the merge tests can use private[blob] access without widening internal visibility further.

Impact

Tests only. No production code logic changes. The visibility relaxation on the two helpers is scoped to the same package as the reader.

Risk Level

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…tching Existing tests for the BatchedBlobReader (PR apache#18098) only assert byte correctness of the data returned, so they cannot detect a regression that causes the batching optimization to stop reducing I/O. Add two test classes that close that gap: - TestBatchedBlobReaderMerge: direct unit tests on mergeRanges and identifyConsecutiveRanges (newly package-private), asserting merged range counts, gap-threshold boundaries, multi-file grouping, sort, index preservation, and overlap rejection. No Spark, no I/O. - TestBatchedBlobReaderIO: integration tests that drive processPartition with a CountingHoodieStorage wrapper around a real storage, asserting openSeekable/seek counts across four scenarios — many blobs in one file, contiguous zero-gap blobs, threshold-controlled small/large gaps including the inclusive boundary, and multi-file queries (per-file batching, mixed gap patterns, interleaved input order).

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds unit tests for BatchedBlobReader's merge algorithm and integration tests that count I/O operations via a CountingHoodieStorage decorator, plus a minimal private → private[blob] visibility relaxation on two helper methods. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming nit in the merge-test helper — everything else reads cleanly.

cc @yihua

hudi-agent · 2026-05-14T04:44:26Z

+  private def reader(maxGapBytes: Int = 4096) =
+    new BatchedBlobReader(storage = null, maxGapBytes = maxGapBytes, lookaheadRows = 50)
+
+  private def row(filePath: String, offset: Long, length: Long, index: Long = 0L): RowInfo[Row] =


🤖 nit: the helper is named row but returns a RowInfo[Row], and org.apache.spark.sql.Row is imported at the top — could you rename it to rowInfo to avoid the double-take when reading call sites like Seq(row("/f", 1000, 200))?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

TestSparkSqlHudiPackageStructure rejects Scala test classes under org.apache.spark.sql.hudi.* that are not in the curated allow-list, which is mirrored in two CI configs that drive the wildcard suite filter. The new BatchedBlobReader merge/IO tests live in the .blob package so they can reach private[blob] helpers (mergeRanges, identifyConsecutiveRanges, RowAccessor) without widening internal visibility - matching the same-package pattern already used for the .analysis, .catalog, and .command production sub-packages. Add 'org.apache.spark.sql.hudi.blob' to all three places: - ALLOWED_PACKAGES in TestSparkSqlHudiPackageStructure - job6HudiSparkDdlOthersWildcardSuites in azure-pipelines-20230430.yml - SCALA_TEST_OTHERS_FILTER in .github/workflows/bot.yml

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a focused suite of merge-algorithm unit tests and I/O-count integration tests for BatchedBlobReader, plus a minimal private → private[blob] visibility relaxation needed to call the merge helpers directly. The test math (gap boundaries, file-grouping, interleaved input) checks out, and the CountingHoodieStorage decorator delegates correctly — close() propagation is safe because the underlying HoodieHadoopStorage.close() is intentionally a no-op. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

Drop tests that re-verify the merge algorithm at the I/O layer when TestBatchedBlobReaderMerge already covers it. Keep only the cases that prove the merge result maps 1:1 to physical I/O ops: - batching reduces I/O end-to-end (single-file, batched vs baseline) - mixed gaps in one file produce one I/O per merged group - multi-file routing with interleaved input and mixed gap patterns Reduces TestBatchedBlobReaderIO from 9 to 3 tests (~67% fewer Spark DataFrame builds and real file reads, the dominant runtime cost).

yihua · 2026-05-14T21:27:23Z

+  def testSingleRowProducesSingleRange(): Unit = {
+    val merged = reader().mergeRanges(Seq(row("/f", 1000, 200, index = 7)), maxGap = 100)
+    assertEquals(1, merged.size)
+    val r = merged.head
+    assertEquals("/f", r.filePath)
+    assertEquals(1000L, r.startOffset)
+    assertEquals(1200L, r.endOffset)
+    assertEquals(1, r.rows.size)
+    assertEquals(7L, r.rows.head.index)


It would be better to test this with SQL so that it works end-to-end.

TestBatchedBlobReaderMerge already catches algorithmic regressions in the merge logic via direct calls to mergeRanges/identifyConsecutiveRanges, so the I/O-count layer was duplicative. Existing TestBatchedBlobReader and TestReadBlobSQL cover the readBatched API and the SQL surface respectively at byte-level granularity. Replace TestBatchedBlobReaderIO with TestReadBlobBatching: SQL-driven tests that exercise the read_blob() path with batching configurations chosen to drive specific merge behaviors: - 20 small reads in one file batched under maxGap=4096 - mixed small/large gaps in one file (above and below threshold) - threshold-boundary case (gap == maxGap) - multi-file interleaved input with mixed gap patterns Each test asserts the returned bytes match the deterministic pattern at each row's recorded offset, validating both query success and output correctness through the SQL planner / BatchedBlobReadExec path. Move to org.apache.hudi.blob (no longer needs private[blob] access). Drop CountingHoodieStorage. The package-allowlist entry for org.apache.spark.sql.hudi.blob is retained since TestBatchedBlobReaderMerge still needs same-package access to the merge helpers.

yihua

Existing coverage before this PR

TestBatchedBlobReader (calls readBatched API, validates bytes):

testBasicBatchedRead — contiguous reads in one file
testGapThresholdSmallGaps — 4 reads, 20-byte gaps, maxGapBytes=4096 → directly covers new IO scenario 3
testGapThresholdLargeGaps — 4 reads, 9.9KB gaps, maxGapBytes=1000 → directly covers new IO scenario 4
testNoBatchingDifferentFiles — multi-file, one read each
testMixedScenario — batchable + non-batchable groups across 2 files → covers new IO scenarios 6, 7, 8
testPreserveInputOrder, testEmptyDataset, testOverlappingRangesThrowsException, etc.

TestReadBlobSQL (uses read_blob() SQL function):

testConfigurationParameters — sets hoodie.blob.batching.max.gap.bytes=10000 + lookahead.size=100, 3 reads with 4.9KB gap (only existing SQL test that exercises the batching config knob)
testReadBlobMultipleFiles — multi-file reads via SQL
testReadOutOfLineBlobOnHudiBackedTable, testBasicReadBlobSQL, joins, subqueries, etc.

So, it would be good to add SQL-level tests for batch reading of blobs. We should rewrite TestBatchedBlobReaderIO as a SQL-driven test focused on batching scenarios that the existing TestReadBlobSQL doesn't already exercise. Validate bytes, not I/O counts.

TestBatchedBlobReaderMerge is the only place that catches a merge-algorithm regression, and we can keep it.

Each scenario now bulk_inserts a Hudi table with the blob column and reads it back via sparkSession.read.format("hudi"), then runs SELECT read_blob(...) against the loaded view. This exercises HoodieFileIndex and BatchedBlobReadExec serialization in addition to the merge/batching path, matching the production read plan rather than a Spark-only temp view. Adds writeHudiBlobTable helper that coerces the input DataFrame to the canonical BlobType schema (nullable reference struct, as required by HoodieSparkSchemaConverters.validateBlobStructure) before save.

yihua

LGTM

hudi-bot · 2026-05-15T18:34:14Z

CI report:

168d5e3 UNKNOWN
af94452 Azure: SUCCESS
5f4faad UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-15T19:37:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.01%. Comparing base (f2fdca2) to head (5f4faad).
⚠️ Report is 11 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (f2fdca2) and HEAD (5f4faad). Click for more details.

HEAD has 32 uploads less than BASE

Flag BASE (f2fdca2) HEAD (5f4faad)

common-and-other-modules 1 0

spark-java-tests 18 0

spark-scala-tests 12 0

utilities 1 0

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #18736       +/-   ##
=============================================
- Coverage     68.14%   54.01%   -14.14%     
+ Complexity    29051    12460    -16591     
=============================================
  Files          2516     1434     -1082     
  Lines        140935    72161    -68774     
  Branches      17472     8245     -9227     
=============================================
- Hits          96047    38978    -57069     
+ Misses        36993    29688     -7305     
+ Partials       7895     3495     -4400

Flag	Coverage Δ
common-and-other-modules	`?`
hadoop-mr-java-client	`44.96% <ø> (-0.02%)`	⬇️
spark-client-hadoop-common	`48.32% <ø> (-0.03%)`	⬇️
spark-java-tests	`?`
spark-scala-tests	`?`
utilities	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1863 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous changed the title ~~test(spark): Add merge-algorithm and I/O-count tests for read_blob ba…~~ test(spark): Add merge-algorithm and I/O-count tests for read_blob May 14, 2026

hudi-agent reviewed May 14, 2026

View reviewed changes

github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 14, 2026

hudi-agent reviewed May 14, 2026

View reviewed changes

voonhous requested review from rahil-c and yihua May 14, 2026 15:23

voonhous added this to the release-1.2.0 milestone May 14, 2026

yihua mentioned this pull request May 14, 2026

test(spark): Simplify I/O-count tests for read_blob batching #18740

Closed

3 tasks

yihua reviewed May 14, 2026

View reviewed changes

yihua changed the title ~~test(spark): Add merge-algorithm and I/O-count tests for read_blob~~ test(spark): Add tests for batch-mode blob reads May 14, 2026

yihua reviewed May 15, 2026

View reviewed changes

Comment thread hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/blob/TestReadBlobBatching.scala Outdated

yihua reviewed May 15, 2026

View reviewed changes

Comment thread ...ce/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/blob/TestBatchedBlobReaderMerge.scala Outdated

Use individual imports instead of wildcard imports

5f4faad

yihua approved these changes May 15, 2026

View reviewed changes

yihua merged commit 9230f7c into apache:master May 15, 2026
20 of 62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(spark): Add tests for batch-mode blob reads#18736

test(spark): Add tests for batch-mode blob reads#18736
yihua merged 6 commits into
apache:masterfrom
voonhous:blob-batch-tests

voonhous commented May 14, 2026 •

edited by yihua

Loading

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

yihua May 14, 2026

Uh oh!

yihua left a comment

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Uh oh!

hudi-bot commented May 15, 2026

Uh oh!

Uh oh!

codecov-commenter commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

voonhous commented May 14, 2026 • edited by yihua Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

yihua May 14, 2026

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 15, 2026

CI report:

Uh oh!

Uh oh!

codecov-commenter commented May 15, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

voonhous commented May 14, 2026 •

edited by yihua

Loading