Skip to content

test(spark): Add tests for batch-mode blob reads#18736

Merged
yihua merged 6 commits into
apache:masterfrom
voonhous:blob-batch-tests
May 15, 2026
Merged

test(spark): Add tests for batch-mode blob reads#18736
yihua merged 6 commits into
apache:masterfrom
voonhous:blob-batch-tests

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented May 14, 2026

Describe the issue this Pull Request addresses

The existing tests for BatchedBlobReader (introduced in #18098) cover byte-level correctness via the readBatched API (TestBatchedBlobReader) and the read_blob() SQL surface (TestReadBlobSQL), but they do not exercise the merge/batching algorithm itself. A regression that silently disables the I/O reduction — a bad gap-threshold comparison, a broken range merge, sort/group breakage — would still pass byte-level assertions. This PR adds focused coverage for the batch-mode path.

Summary and Changelog

  • TestBatchedBlobReaderMerge — unit tests against mergeRanges and identifyConsecutiveRanges. Asserts merged-range counts, gap-threshold inclusive/exclusive boundaries, multi-file grouping, sort, index preservation, and rejection of overlapping ranges. No Spark, no I/O.

  • TestReadBlobBatching — SQL-driven correctness tests on Hudi-backed tables. Each scenario bulk_inserts a Hudi table containing the blob column, reads it back via sparkSession.read.format("hudi"), and runs SELECT read_blob(...) with a batching configuration chosen to exercise specific merge behaviors:

    • 20 small reads in one file batched under maxGap=4096 (high-ratio merge)
    • mixed small/large gaps in one file (above and below threshold in the same query)
    • threshold-boundary case (gap == maxGap)
    • multi-file interleaved input with mixed gap patterns

    Driving the query through a real Hudi table exercises HoodieFileIndex and BatchedBlobReadExec serialization (the production read plan) on top of the batching path. Each test asserts the returned bytes match the deterministic pattern at each row's recorded offset — validates query success and output correctness.

  • Relaxes the visibility of two BatchedBlobReader helpers from private to package-private so the merge tests can call them directly.

  • Adds org.apache.spark.sql.hudi.blob to the test-package allowlist (TestSparkSqlHudiPackageStructure, the Azure pipeline filter, and the GitHub Actions filter) so the merge tests can use private[blob] access without widening internal visibility further.

Impact

Tests only. No production code logic changes. The visibility relaxation on the two helpers is scoped to the same package as the reader.

Risk Level

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

…tching

Existing tests for the BatchedBlobReader (PR apache#18098) only assert byte
correctness of the data returned, so they cannot detect a regression
that causes the batching optimization to stop reducing I/O. Add two
test classes that close that gap:

- TestBatchedBlobReaderMerge: direct unit tests on mergeRanges and
  identifyConsecutiveRanges (newly package-private), asserting merged
  range counts, gap-threshold boundaries, multi-file grouping, sort,
  index preservation, and overlap rejection. No Spark, no I/O.

- TestBatchedBlobReaderIO: integration tests that drive processPartition
  with a CountingHoodieStorage wrapper around a real storage, asserting
  openSeekable/seek counts across four scenarios — many blobs in one
  file, contiguous zero-gap blobs, threshold-controlled small/large
  gaps including the inclusive boundary, and multi-file queries
  (per-file batching, mixed gap patterns, interleaved input order).
@voonhous voonhous changed the title test(spark): Add merge-algorithm and I/O-count tests for read_blob ba… test(spark): Add merge-algorithm and I/O-count tests for read_blob May 14, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds unit tests for BatchedBlobReader's merge algorithm and integration tests that count I/O operations via a CountingHoodieStorage decorator, plus a minimal privateprivate[blob] visibility relaxation on two helper methods. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One naming nit in the merge-test helper — everything else reads cleanly.

cc @yihua

private def reader(maxGapBytes: Int = 4096) =
new BatchedBlobReader(storage = null, maxGapBytes = maxGapBytes, lookaheadRows = 50)

private def row(filePath: String, offset: Long, length: Long, index: Long = 0L): RowInfo[Row] =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the helper is named row but returns a RowInfo[Row], and org.apache.spark.sql.Row is imported at the top — could you rename it to rowInfo to avoid the double-take when reading call sites like Seq(row("/f", 1000, 200))?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label May 14, 2026
TestSparkSqlHudiPackageStructure rejects Scala test classes under
org.apache.spark.sql.hudi.* that are not in the curated allow-list,
which is mirrored in two CI configs that drive the wildcard suite
filter. The new BatchedBlobReader merge/IO tests live in the
.blob package so they can reach private[blob] helpers (mergeRanges,
identifyConsecutiveRanges, RowAccessor) without widening internal
visibility - matching the same-package pattern already used for the
.analysis, .catalog, and .command production sub-packages.

Add 'org.apache.spark.sql.hudi.blob' to all three places:
- ALLOWED_PACKAGES in TestSparkSqlHudiPackageStructure
- job6HudiSparkDdlOthersWildcardSuites in azure-pipelines-20230430.yml
- SCALA_TEST_OTHERS_FILTER in .github/workflows/bot.yml
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds a focused suite of merge-algorithm unit tests and I/O-count integration tests for BatchedBlobReader, plus a minimal privateprivate[blob] visibility relaxation needed to call the merge helpers directly. The test math (gap boundaries, file-grouping, interleaved input) checks out, and the CountingHoodieStorage decorator delegates correctly — close() propagation is safe because the underlying HoodieHadoopStorage.close() is intentionally a no-op. No issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

@voonhous voonhous requested review from rahil-c and yihua May 14, 2026 15:23
@voonhous voonhous added this to the release-1.2.0 milestone May 14, 2026
Drop tests that re-verify the merge algorithm at the I/O layer when
TestBatchedBlobReaderMerge already covers it. Keep only the cases that
prove the merge result maps 1:1 to physical I/O ops:

- batching reduces I/O end-to-end (single-file, batched vs baseline)
- mixed gaps in one file produce one I/O per merged group
- multi-file routing with interleaved input and mixed gap patterns

Reduces TestBatchedBlobReaderIO from 9 to 3 tests (~67% fewer Spark
DataFrame builds and real file reads, the dominant runtime cost).
Comment on lines +58 to +66
def testSingleRowProducesSingleRange(): Unit = {
val merged = reader().mergeRanges(Seq(row("/f", 1000, 200, index = 7)), maxGap = 100)
assertEquals(1, merged.size)
val r = merged.head
assertEquals("/f", r.filePath)
assertEquals(1000L, r.startOffset)
assertEquals(1200L, r.endOffset)
assertEquals(1, r.rows.size)
assertEquals(7L, r.rows.head.index)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to test this with SQL so that it works end-to-end.

TestBatchedBlobReaderMerge already catches algorithmic regressions in
the merge logic via direct calls to mergeRanges/identifyConsecutiveRanges,
so the I/O-count layer was duplicative. Existing TestBatchedBlobReader
and TestReadBlobSQL cover the readBatched API and the SQL surface
respectively at byte-level granularity.

Replace TestBatchedBlobReaderIO with TestReadBlobBatching: SQL-driven
tests that exercise the read_blob() path with batching configurations
chosen to drive specific merge behaviors:
  - 20 small reads in one file batched under maxGap=4096
  - mixed small/large gaps in one file (above and below threshold)
  - threshold-boundary case (gap == maxGap)
  - multi-file interleaved input with mixed gap patterns

Each test asserts the returned bytes match the deterministic pattern at
each row's recorded offset, validating both query success and output
correctness through the SQL planner / BatchedBlobReadExec path.

Move to org.apache.hudi.blob (no longer needs private[blob] access).
Drop CountingHoodieStorage. The package-allowlist entry for
org.apache.spark.sql.hudi.blob is retained since TestBatchedBlobReaderMerge
still needs same-package access to the merge helpers.
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing coverage before this PR

TestBatchedBlobReader (calls readBatched API, validates bytes):

  • testBasicBatchedRead — contiguous reads in one file
  • testGapThresholdSmallGaps — 4 reads, 20-byte gaps, maxGapBytes=4096directly covers new IO scenario 3
  • testGapThresholdLargeGaps — 4 reads, 9.9KB gaps, maxGapBytes=1000directly covers new IO scenario 4
  • testNoBatchingDifferentFiles — multi-file, one read each
  • testMixedScenario — batchable + non-batchable groups across 2 files → covers new IO scenarios 6, 7, 8
  • testPreserveInputOrder, testEmptyDataset, testOverlappingRangesThrowsException, etc.

TestReadBlobSQL (uses read_blob() SQL function):

  • testConfigurationParameters — sets hoodie.blob.batching.max.gap.bytes=10000 + lookahead.size=100, 3 reads with 4.9KB gap (only existing SQL test that exercises the batching config knob)
  • testReadBlobMultipleFiles — multi-file reads via SQL
  • testReadOutOfLineBlobOnHudiBackedTable, testBasicReadBlobSQL, joins, subqueries, etc.

So, it would be good to add SQL-level tests for batch reading of blobs. We should rewrite TestBatchedBlobReaderIO as a SQL-driven test focused on batching scenarios that the existing TestReadBlobSQL doesn't already exercise. Validate bytes, not I/O counts.

TestBatchedBlobReaderMerge is the only place that catches a merge-algorithm regression, and we can keep it.

@yihua yihua changed the title test(spark): Add merge-algorithm and I/O-count tests for read_blob test(spark): Add tests for batch-mode blob reads May 14, 2026
Each scenario now bulk_inserts a Hudi table with the blob column and
reads it back via sparkSession.read.format("hudi"), then runs
SELECT read_blob(...) against the loaded view. This exercises
HoodieFileIndex and BatchedBlobReadExec serialization in addition to
the merge/batching path, matching the production read plan rather than
a Spark-only temp view.

Adds writeHudiBlobTable helper that coerces the input DataFrame to the
canonical BlobType schema (nullable reference struct, as required by
HoodieSparkSchemaConverters.validateBlobStructure) before save.
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 9230f7c into apache:master May 15, 2026
20 of 62 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.01%. Comparing base (f2fdca2) to head (5f4faad).
⚠️ Report is 11 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (f2fdca2) and HEAD (5f4faad). Click for more details.

HEAD has 32 uploads less than BASE
Flag BASE (f2fdca2) HEAD (5f4faad)
common-and-other-modules 1 0
spark-java-tests 18 0
spark-scala-tests 12 0
utilities 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18736       +/-   ##
=============================================
- Coverage     68.14%   54.01%   -14.14%     
+ Complexity    29051    12460    -16591     
=============================================
  Files          2516     1434     -1082     
  Lines        140935    72161    -68774     
  Branches      17472     8245     -9227     
=============================================
- Hits          96047    38978    -57069     
+ Misses        36993    29688     -7305     
+ Partials       7895     3495     -4400     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 44.96% <ø> (-0.02%) ⬇️
spark-client-hadoop-common 48.32% <ø> (-0.03%) ⬇️
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1863 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants