Skip to content

fix: Support data pruning using nested partition columns#18126

Merged
yihua merged 20 commits into
apache:masterfrom
linliu-code:nested_partitioning
May 15, 2026
Merged

fix: Support data pruning using nested partition columns#18126
yihua merged 20 commits into
apache:masterfrom
linliu-code:nested_partitioning

Conversation

@linliu-code
Copy link
Copy Markdown
Collaborator

@linliu-code linliu-code commented Feb 7, 2026

Describe the issue this Pull Request addresses

There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes

Meanwhile, Spark does not support nested partition columns.

Summary and Changelog

  1. If there's a table with a nested partition column whose leaf name conflicts with another top level field the partitionedSchema passed to the new file group reader is incorrect. The fix is to return the partition field with the full path name instead of the inner field name.
  2. Spark passes wrong filters to Hudi index since Spark does not support nested partition columns. Therefore, when Hudi gets partition and data filters, these filters are reclassified based on the partition columns metadata.

Impact

Medium

Risk Level

Low.

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Feb 7, 2026
@linliu-code linliu-code force-pushed the nested_partitioning branch 3 times, most recently from d6f9ca7 to 413fa60 Compare February 8, 2026 01:00
@linliu-code linliu-code changed the title fix: Reproduce nested partition columns pruning data validation failure fix: Support data pruning using nested partition columns Feb 8, 2026
@linliu-code linliu-code marked this pull request as ready for review February 8, 2026 05:50
@linliu-code linliu-code requested a review from yihua February 8, 2026 05:54
@nsivabalan
Copy link
Copy Markdown
Contributor

@hudi-bot run azure

@linliu-code
Copy link
Copy Markdown
Collaborator Author

@hudi-bot run azure

The command seems not working. Let me push it again to trigger the Azure test.

@apache apache deleted a comment from hudi-bot Feb 10, 2026
@@ -2546,6 +2453,204 @@ class TestCOWDataSource extends HoodieSparkClientTestBase with ScalaAssertionSup
writeToHudi(opt, firstUpdateDF, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
})
}

@ParameterizedTest
@CsvSource(Array("COW", "MOR"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this test be extracted and called from TestCOWDataSource for COW table and TestMORDataSource for MOR table?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg.

@github-actions github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Feb 17, 2026
@linliu-code linliu-code force-pushed the nested_partitioning branch 2 times, most recently from 29981f5 to e8557d3 Compare February 18, 2026 17:42
@linliu-code linliu-code force-pushed the nested_partitioning branch 2 times, most recently from 482ac01 to 0b6dc83 Compare April 1, 2026 00:25
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

[Update review] This incremental review of PR #18126 focuses on the nested partition column pruning support. The approach — caching partition-pruned file slices in HoodieFileIndex and reconstructing a nested StructType for Spark's optimizer — is sound in concept and the test coverage is thorough. However, the core concern from the previous review cycle (struct-parent prefix matching being too broad) remains unaddressed. A filter on a non-partition nested field of a partition-containing struct (e.g., nested_record.nested_int = 10 when only nested_record.level is a partition column) would be misclassified as a partition-pruning predicate, leading to an evaluation failure at runtime. Additionally, the single-use-then-clear cache pattern in listFiles risks full table scans under AQE re-planning.

@linliu-code linliu-code force-pushed the nested_partitioning branch from 63de989 to 57b9cc9 Compare April 10, 2026 19:34
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! This is a solid approach to handling nested partition columns — the core idea of re-extracting partition predicates from data filters and building a nested schema for Spark's planner is well thought out. A few issues to address in the inline comments, most notably a case-sensitivity inconsistency in the GetStructField binding path.

@yihua yihua added this to the release-1.2.0 milestone May 15, 2026
@yihua yihua force-pushed the nested_partitioning branch from 3ec0d25 to 59d4bab Compare May 15, 2026 20:20
yihua added 2 commits May 15, 2026 13:50
- SparkHoodieTableFileIndex: add `DataType` and `LinkedHashMap` imports,
  drop the inline fully-qualified references in `NestedFieldNode` and
  `buildNestedPartitionSchema`.
- TestCOWDataSource: add imports for `LogicalRelation`, `HadoopFsRelation`,
  `HoodieFileIndex`, `HoodieBaseRelation`; drop FQN in
  `runNestedFieldPartitionTest`.
- TestHoodieFileIndex: collapse six `testBuildNestedPartitionSchema*`
  cases into a single `@ParameterizedTest` driven by
  `buildNestedPartitionSchemaCases`; collapse three
  `testExtractNestedPartitionFilters*` cases into
  `extractNestedPartitionFiltersCases`. Keeps the conflict-throws case
  as a focused `@Test`.
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after minor revision

@yihua yihua force-pushed the nested_partitioning branch from ac03916 to 525f932 Compare May 15, 2026 21:11
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit dca76ca into apache:master May 15, 2026
60 of 63 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 82.65306% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.14%. Comparing base (ce1f5f0) to head (4bc966c).
⚠️ Report is 8 commits behind head on master.

Files with missing lines Patch % Lines
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala 85.93% 2 Missing and 7 partials ⚠️
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala 76.00% 3 Missing and 3 partials ⚠️
...alysis/Spark3HoodiePruneFileSourcePartitions.scala 75.00% 0 Missing and 1 partial ⚠️
...alysis/Spark4HoodiePruneFileSourcePartitions.scala 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18126       +/-   ##
=============================================
+ Coverage     54.01%   68.14%   +14.12%     
- Complexity    12461    29110    +16649     
=============================================
  Files          1434     2517     +1083     
  Lines         72161   141194    +69033     
  Branches       8245    17528     +9283     
=============================================
+ Hits          38979    96218    +57239     
- Misses        29686    37062     +7376     
- Partials       3496     7914     +4418     
Flag Coverage Δ
common-and-other-modules 44.40% <37.63%> (?)
hadoop-mr-java-client 45.00% <ø> (-0.01%) ⬇️
spark-client-hadoop-common 48.32% <ø> (-0.01%) ⬇️
spark-java-tests 48.98% <82.65%> (?)
spark-scala-tests 44.91% <53.06%> (?)
utilities 37.61% <49.46%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...lysis/Spark33HoodiePruneFileSourcePartitions.scala 78.26% <100.00%> (ø)
...alysis/Spark3HoodiePruneFileSourcePartitions.scala 80.85% <75.00%> (ø)
...alysis/Spark4HoodiePruneFileSourcePartitions.scala 78.26% <75.00%> (ø)
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala 82.57% <76.00%> (ø)
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala 75.26% <85.93%> (ø)

... and 1857 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants