test(spark): Simplify I/O-count tests for read_blob batching by yihua · Pull Request #18740 · apache/hudi

yihua · 2026-05-14T21:22:48Z

Describe the issue this Pull Request addresses

Follow-up to #18736. The newly added TestBatchedBlobReaderIO had several scenarios that re-verified the merge algorithm at the I/O layer when TestBatchedBlobReaderMerge already covers them as cheap unit tests. Each removed I/O test built a Spark DataFrame and read a real file, so the duplication was the dominant runtime cost of the new suite.

Summary and Changelog

Drops I/O tests whose assertions duplicate merge unit tests:
- contiguous-zero-gap merge
- small-gaps-within-threshold merge
- large-gaps-above-threshold no-merge
- exact threshold boundary (both inclusive and exclusive halves)
- interleaved input ordering
Folds the single-file batched + non-batched pair into one test (same input, two maxGap values).
Folds three multi-file scenarios (per-file batching, mixed gap patterns, interleaved order) into one combined test.
Result: TestBatchedBlobReaderIO goes from 9 → 3 tests, retaining the three distinct integration claims: I/O reduction end-to-end, one I/O per merged group, and per-file routing across multiple files.

Impact

Tests only. No production code changes. Merge-algorithm coverage is unchanged — every removed I/O scenario has a corresponding test in TestBatchedBlobReaderMerge.

This is stacked on #18736; the diff includes that PR's commits until it lands.

Risk Level

none

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…tching Existing tests for the BatchedBlobReader (PR apache#18098) only assert byte correctness of the data returned, so they cannot detect a regression that causes the batching optimization to stop reducing I/O. Add two test classes that close that gap: - TestBatchedBlobReaderMerge: direct unit tests on mergeRanges and identifyConsecutiveRanges (newly package-private), asserting merged range counts, gap-threshold boundaries, multi-file grouping, sort, index preservation, and overlap rejection. No Spark, no I/O. - TestBatchedBlobReaderIO: integration tests that drive processPartition with a CountingHoodieStorage wrapper around a real storage, asserting openSeekable/seek counts across four scenarios — many blobs in one file, contiguous zero-gap blobs, threshold-controlled small/large gaps including the inclusive boundary, and multi-file queries (per-file batching, mixed gap patterns, interleaved input order).

TestSparkSqlHudiPackageStructure rejects Scala test classes under org.apache.spark.sql.hudi.* that are not in the curated allow-list, which is mirrored in two CI configs that drive the wildcard suite filter. The new BatchedBlobReader merge/IO tests live in the .blob package so they can reach private[blob] helpers (mergeRanges, identifyConsecutiveRanges, RowAccessor) without widening internal visibility - matching the same-package pattern already used for the .analysis, .catalog, and .command production sub-packages. Add 'org.apache.spark.sql.hudi.blob' to all three places: - ALLOWED_PACKAGES in TestSparkSqlHudiPackageStructure - job6HudiSparkDdlOthersWildcardSuites in azure-pipelines-20230430.yml - SCALA_TEST_OTHERS_FILTER in .github/workflows/bot.yml

Drop tests that re-verify the merge algorithm at the I/O layer when TestBatchedBlobReaderMerge already covers it. Keep only the cases that prove the merge result maps 1:1 to physical I/O ops: - batching reduces I/O end-to-end (single-file, batched vs baseline) - mixed gaps in one file produce one I/O per merged group - multi-file routing with interleaved input and mixed gap patterns Reduces TestBatchedBlobReaderIO from 9 to 3 tests (~67% fewer Spark DataFrame builds and real file reads, the dominant runtime cost).

codecov-commenter · 2026-05-14T22:46:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.14%. Comparing base (f2fdca2) to head (2ff7167).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #18740      +/-   ##
============================================
- Coverage     68.14%   68.14%   -0.01%     
- Complexity    29051    29099      +48     
============================================
  Files          2516     2517       +1     
  Lines        140935   141113     +178     
  Branches      17472    17508      +36     
============================================
+ Hits          96047    96155     +108     
- Misses        36993    37054      +61     
- Partials       7895     7904       +9

Flag	Coverage Δ
common-and-other-modules	`44.41% <ø> (+<0.01%)`	⬆️
hadoop-mr-java-client	`45.02% <ø> (+0.04%)`	⬆️
spark-client-hadoop-common	`48.32% <ø> (-0.03%)`	⬇️
spark-java-tests	`48.97% <ø> (-0.05%)`	⬇️
spark-scala-tests	`44.90% <ø> (+<0.01%)`	⬆️
utilities	`37.63% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala	`84.83% <ø> (+0.36%)`	⬆️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

voonhous and others added 3 commits May 14, 2026 11:08

yihua closed this May 14, 2026

yihua deleted the simplify-blob-batch-io-tests branch May 14, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(spark): Simplify I/O-count tests for read_blob batching#18740

test(spark): Simplify I/O-count tests for read_blob batching#18740
yihua wants to merge 3 commits into
apache:masterfrom
yihua:simplify-blob-batch-io-tests

yihua commented May 14, 2026

Uh oh!

codecov-commenter commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yihua commented May 14, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

codecov-commenter commented May 14, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants