Skip to content

test(spark): Simplify I/O-count tests for read_blob batching#18740

Closed
yihua wants to merge 3 commits into
apache:masterfrom
yihua:simplify-blob-batch-io-tests
Closed

test(spark): Simplify I/O-count tests for read_blob batching#18740
yihua wants to merge 3 commits into
apache:masterfrom
yihua:simplify-blob-batch-io-tests

Conversation

@yihua
Copy link
Copy Markdown
Contributor

@yihua yihua commented May 14, 2026

Describe the issue this Pull Request addresses

Follow-up to #18736. The newly added TestBatchedBlobReaderIO had several scenarios that re-verified the merge algorithm at the I/O layer when TestBatchedBlobReaderMerge already covers them as cheap unit tests. Each removed I/O test built a Spark DataFrame and read a real file, so the duplication was the dominant runtime cost of the new suite.

Summary and Changelog

  • Drops I/O tests whose assertions duplicate merge unit tests:
    • contiguous-zero-gap merge
    • small-gaps-within-threshold merge
    • large-gaps-above-threshold no-merge
    • exact threshold boundary (both inclusive and exclusive halves)
    • interleaved input ordering
  • Folds the single-file batched + non-batched pair into one test (same input, two maxGap values).
  • Folds three multi-file scenarios (per-file batching, mixed gap patterns, interleaved order) into one combined test.
  • Result: TestBatchedBlobReaderIO goes from 9 → 3 tests, retaining the three distinct integration claims: I/O reduction end-to-end, one I/O per merged group, and per-file routing across multiple files.

Impact

Tests only. No production code changes. Merge-algorithm coverage is unchanged — every removed I/O scenario has a corresponding test in TestBatchedBlobReaderMerge.

This is stacked on #18736; the diff includes that PR's commits until it lands.

Risk Level

none

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

voonhous and others added 3 commits May 14, 2026 11:08
…tching

Existing tests for the BatchedBlobReader (PR apache#18098) only assert byte
correctness of the data returned, so they cannot detect a regression
that causes the batching optimization to stop reducing I/O. Add two
test classes that close that gap:

- TestBatchedBlobReaderMerge: direct unit tests on mergeRanges and
  identifyConsecutiveRanges (newly package-private), asserting merged
  range counts, gap-threshold boundaries, multi-file grouping, sort,
  index preservation, and overlap rejection. No Spark, no I/O.

- TestBatchedBlobReaderIO: integration tests that drive processPartition
  with a CountingHoodieStorage wrapper around a real storage, asserting
  openSeekable/seek counts across four scenarios — many blobs in one
  file, contiguous zero-gap blobs, threshold-controlled small/large
  gaps including the inclusive boundary, and multi-file queries
  (per-file batching, mixed gap patterns, interleaved input order).
TestSparkSqlHudiPackageStructure rejects Scala test classes under
org.apache.spark.sql.hudi.* that are not in the curated allow-list,
which is mirrored in two CI configs that drive the wildcard suite
filter. The new BatchedBlobReader merge/IO tests live in the
.blob package so they can reach private[blob] helpers (mergeRanges,
identifyConsecutiveRanges, RowAccessor) without widening internal
visibility - matching the same-package pattern already used for the
.analysis, .catalog, and .command production sub-packages.

Add 'org.apache.spark.sql.hudi.blob' to all three places:
- ALLOWED_PACKAGES in TestSparkSqlHudiPackageStructure
- job6HudiSparkDdlOthersWildcardSuites in azure-pipelines-20230430.yml
- SCALA_TEST_OTHERS_FILTER in .github/workflows/bot.yml
Drop tests that re-verify the merge algorithm at the I/O layer when
TestBatchedBlobReaderMerge already covers it. Keep only the cases that
prove the merge result maps 1:1 to physical I/O ops:

- batching reduces I/O end-to-end (single-file, batched vs baseline)
- mixed gaps in one file produce one I/O per merged group
- multi-file routing with interleaved input and mixed gap patterns

Reduces TestBatchedBlobReaderIO from 9 to 3 tests (~67% fewer Spark
DataFrame builds and real file reads, the dominant runtime cost).
@yihua yihua closed this May 14, 2026
@yihua yihua deleted the simplify-blob-batch-io-tests branch May 14, 2026 21:24
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.14%. Comparing base (f2fdca2) to head (2ff7167).
⚠️ Report is 3 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18740      +/-   ##
============================================
- Coverage     68.14%   68.14%   -0.01%     
- Complexity    29051    29099      +48     
============================================
  Files          2516     2517       +1     
  Lines        140935   141113     +178     
  Branches      17472    17508      +36     
============================================
+ Hits          96047    96155     +108     
- Misses        36993    37054      +61     
- Partials       7895     7904       +9     
Flag Coverage Δ
common-and-other-modules 44.41% <ø> (+<0.01%) ⬆️
hadoop-mr-java-client 45.02% <ø> (+0.04%) ⬆️
spark-client-hadoop-common 48.32% <ø> (-0.03%) ⬇️
spark-java-tests 48.97% <ø> (-0.05%) ⬇️
spark-scala-tests 44.90% <ø> (+<0.01%) ⬆️
utilities 37.63% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...apache/spark/sql/hudi/blob/BatchedBlobReader.scala 84.83% <ø> (+0.36%) ⬆️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants