fix: Support data pruning using nested partition columns by linliu-code · Pull Request #18126 · apache/hudi

linliu-code · 2026-02-07T23:25:56Z

Describe the issue this Pull Request addresses

There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes

Meanwhile, Spark does not support nested partition columns.

Summary and Changelog

If there's a table with a nested partition column whose leaf name conflicts with another top level field the partitionedSchema passed to the new file group reader is incorrect. The fix is to return the partition field with the full path name instead of the inner field name.
Spark passes wrong filters to Hudi index since Spark does not support nested partition columns. Therefore, when Hudi gets partition and data filters, these filters are reclassified based on the partition columns metadata.

Impact

Medium

Risk Level

Low.

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

nsivabalan · 2026-02-09T03:10:07Z

@hudi-bot run azure

linliu-code · 2026-02-09T17:33:52Z

@hudi-bot run azure

The command seems not working. Let me push it again to trigger the Azure test.

yihua · 2026-02-10T16:25:57Z

@@ -2546,6 +2453,204 @@ class TestCOWDataSource extends HoodieSparkClientTestBase with ScalaAssertionSup
      writeToHudi(opt, firstUpdateDF, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
    })
  }
+
+  @ParameterizedTest
+  @CsvSource(Array("COW", "MOR"))


Could this test be extracted and called from TestCOWDataSource for COW table and TestMORDataSource for MOR table?

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

[Update review] This incremental review of PR #18126 focuses on the nested partition column pruning support. The approach — caching partition-pruned file slices in HoodieFileIndex and reconstructing a nested StructType for Spark's optimizer — is sound in concept and the test coverage is thorough. However, the core concern from the previous review cycle (struct-parent prefix matching being too broad) remains unaddressed. A filter on a non-partition nested field of a partition-containing struct (e.g., nested_record.nested_int = 10 when only nested_record.level is a partition column) would be misclassified as a partition-pruning predicate, leading to an evaluation failure at runtime. Additionally, the single-use-then-clear cache pattern in listFiles risks full table scans under AQE re-planning.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! This is a solid approach to handling nested partition columns — the core idea of re-extracting partition predicates from data filters and building a nested schema for Spark's planner is well thought out. A few issues to address in the inline comments, most notably a case-sensitivity inconsistency in the GetStructField binding path.

- SparkHoodieTableFileIndex: add `DataType` and `LinkedHashMap` imports, drop the inline fully-qualified references in `NestedFieldNode` and `buildNestedPartitionSchema`. - TestCOWDataSource: add imports for `LogicalRelation`, `HadoopFsRelation`, `HoodieFileIndex`, `HoodieBaseRelation`; drop FQN in `runNestedFieldPartitionTest`. - TestHoodieFileIndex: collapse six `testBuildNestedPartitionSchema*` cases into a single `@ParameterizedTest` driven by `buildNestedPartitionSchemaCases`; collapse three `testExtractNestedPartitionFilters*` cases into `extractNestedPartitionFiltersCases`. Keeps the conflict-throws case as a focused `@Test`.

yihua

LGTM after minor revision

hudi-bot · 2026-05-15T21:12:42Z

CI report:

0da3ad8 UNKNOWN
fe5742b UNKNOWN
5ac8719 UNKNOWN
68c1aad UNKNOWN
3ec0d25 Azure: SUCCESS
59d4bab UNKNOWN
830c281 UNKNOWN
3f38ecc UNKNOWN
ac03916 UNKNOWN
525f932 UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-bot · 2026-05-15T21:16:58Z

CI report:

0da3ad8 UNKNOWN
fe5742b UNKNOWN
5ac8719 UNKNOWN
68c1aad UNKNOWN
3ec0d25 Azure: SUCCESS
59d4bab UNKNOWN
830c281 UNKNOWN
3f38ecc UNKNOWN
ac03916 UNKNOWN
525f932 Azure: PENDING
4bc966c UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-15T23:00:12Z

Codecov Report

❌ Patch coverage is 82.65306% with 17 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.14%. Comparing base (ce1f5f0) to head (4bc966c).
⚠️ Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala	85.93%	2 Missing and 7 partials ⚠️
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala	76.00%	3 Missing and 3 partials ⚠️
...alysis/Spark3HoodiePruneFileSourcePartitions.scala	75.00%	0 Missing and 1 partial ⚠️
...alysis/Spark4HoodiePruneFileSourcePartitions.scala	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             master   #18126       +/-   ##
=============================================
+ Coverage     54.01%   68.14%   +14.12%     
- Complexity    12461    29110    +16649     
=============================================
  Files          1434     2517     +1083     
  Lines         72161   141194    +69033     
  Branches       8245    17528     +9283     
=============================================
+ Hits          38979    96218    +57239     
- Misses        29686    37062     +7376     
- Partials       3496     7914     +4418

Flag	Coverage Δ
common-and-other-modules	`44.40% <37.63%> (?)`
hadoop-mr-java-client	`45.00% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.32% <ø> (-0.01%)`	⬇️
spark-java-tests	`48.98% <82.65%> (?)`
spark-scala-tests	`44.91% <53.06%> (?)`
utilities	`37.61% <49.46%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...lysis/Spark33HoodiePruneFileSourcePartitions.scala	`78.26% <100.00%> (ø)`
...alysis/Spark3HoodiePruneFileSourcePartitions.scala	`80.85% <75.00%> (ø)`
...alysis/Spark4HoodiePruneFileSourcePartitions.scala	`78.26% <75.00%> (ø)`
...c/main/scala/org/apache/hudi/HoodieFileIndex.scala	`82.57% <76.00%> (ø)`
...la/org/apache/hudi/SparkHoodieTableFileIndex.scala	`75.26% <85.93%> (ø)`

... and 1857 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Feb 7, 2026

linliu-code force-pushed the nested_partitioning branch 3 times, most recently from d6f9ca7 to 413fa60 Compare February 8, 2026 01:00

linliu-code changed the title ~~fix: Reproduce nested partition columns pruning data validation failure~~ fix: Support data pruning using nested partition columns Feb 8, 2026

linliu-code marked this pull request as ready for review February 8, 2026 05:50

linliu-code requested a review from yihua February 8, 2026 05:54

nsivabalan approved these changes Feb 9, 2026

View reviewed changes

linliu-code force-pushed the nested_partitioning branch from 413fa60 to eaf9bd8 Compare February 9, 2026 17:34

apache deleted a comment from hudi-bot Feb 10, 2026

yihua reviewed Feb 10, 2026

View reviewed changes

Comment thread ...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala Outdated

yihua reviewed Feb 10, 2026

View reviewed changes

Comment thread ...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala Outdated

linliu-code force-pushed the nested_partitioning branch from f2d9632 to c5a44d7 Compare February 17, 2026 23:03

github-actions Bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Feb 17, 2026

linliu-code force-pushed the nested_partitioning branch 2 times, most recently from 29981f5 to e8557d3 Compare February 18, 2026 17:42

yihua reviewed Feb 18, 2026

View reviewed changes

Comment thread ...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala Outdated

linliu-code force-pushed the nested_partitioning branch 2 times, most recently from 482ac01 to 0b6dc83 Compare April 1, 2026 00:25

yihua reviewed Apr 4, 2026

View reviewed changes

linliu-code force-pushed the nested_partitioning branch from 63de989 to 57b9cc9 Compare April 10, 2026 19:34

yihua reviewed Apr 10, 2026

View reviewed changes

yihua mentioned this pull request Apr 10, 2026

[OSS PR #18126] fix: Support data pruning using nested partition columns yihua/hudi#33

Open

linliu-code force-pushed the nested_partitioning branch 2 times, most recently from 68c1aad to 4f087e0 Compare April 11, 2026 00:19

yihua added this to the release-1.2.0 milestone May 15, 2026

vinishjail97 and others added 16 commits May 15, 2026 13:19

Handle nested map and array columns in MDT

5b94415

Fix the issue and add tests

0e26dc8

Addressed the comments

ade116e

Fix another bug for partition filter

7c63922

remove reclassification

7dbec4b

fix CI failures

9ed6496

fix CI issues

af1aa0f

refactor

874aa25

refactor

45e6fb3

address comments

aa9c501

Address comments

c44767c

address comments

0b85f2e

refactor

2d85550

clean comments

f8a0bb8

fix ci

3d86257

Ci failures

59d4bab

yihua force-pushed the nested_partitioning branch from 3ec0d25 to 59d4bab Compare May 15, 2026 20:20

yihua reviewed May 15, 2026

View reviewed changes

Comment thread ...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala Outdated

Comment thread ...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala Outdated

yihua mentioned this pull request May 15, 2026

Address review comments: replace FQN with imports, parameterize tests linliu-code/hudi#5

Closed

3 tasks

yihua added 2 commits May 15, 2026 13:50

Fix imports

3f38ecc

yihua approved these changes May 15, 2026

View reviewed changes

Fix build

525f932

yihua force-pushed the nested_partitioning branch from ac03916 to 525f932 Compare May 15, 2026 21:11

Fix one more build issue

4bc966c

yihua merged commit dca76ca into apache:master May 15, 2026
60 of 63 checks passed

Conversation

linliu-code commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

nsivabalan commented Feb 9, 2026

Uh oh!

linliu-code commented Feb 9, 2026

Uh oh!

yihua Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

linliu-code Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 15, 2026

CI report:

Uh oh!

hudi-bot commented May 15, 2026

CI report:

Uh oh!

Uh oh!

codecov-commenter commented May 15, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

linliu-code commented Feb 7, 2026 •

edited

Loading