fix: Support data pruning using nested partition columns#18126
Conversation
d6f9ca7 to
413fa60
Compare
|
@hudi-bot run azure |
The command seems not working. Let me push it again to trigger the Azure test. |
413fa60 to
eaf9bd8
Compare
| @@ -2546,6 +2453,204 @@ class TestCOWDataSource extends HoodieSparkClientTestBase with ScalaAssertionSup | |||
| writeToHudi(opt, firstUpdateDF, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) | |||
| }) | |||
| } | |||
|
|
|||
| @ParameterizedTest | |||
| @CsvSource(Array("COW", "MOR")) | |||
There was a problem hiding this comment.
Could this test be extracted and called from TestCOWDataSource for COW table and TestMORDataSource for MOR table?
f2d9632 to
c5a44d7
Compare
29981f5 to
e8557d3
Compare
482ac01 to
0b6dc83
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
[Update review] This incremental review of PR #18126 focuses on the nested partition column pruning support. The approach — caching partition-pruned file slices in HoodieFileIndex and reconstructing a nested StructType for Spark's optimizer — is sound in concept and the test coverage is thorough. However, the core concern from the previous review cycle (struct-parent prefix matching being too broad) remains unaddressed. A filter on a non-partition nested field of a partition-containing struct (e.g., nested_record.nested_int = 10 when only nested_record.level is a partition column) would be misclassified as a partition-pruning predicate, leading to an evaluation failure at runtime. Additionally, the single-use-then-clear cache pattern in listFiles risks full table scans under AQE re-planning.
63de989 to
57b9cc9
Compare
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for contributing! This is a solid approach to handling nested partition columns — the core idea of re-extracting partition predicates from data filters and building a nested schema for Spark's planner is well thought out. A few issues to address in the inline comments, most notably a case-sensitivity inconsistency in the GetStructField binding path.
68c1aad to
4f087e0
Compare
3ec0d25 to
59d4bab
Compare
- SparkHoodieTableFileIndex: add `DataType` and `LinkedHashMap` imports, drop the inline fully-qualified references in `NestedFieldNode` and `buildNestedPartitionSchema`. - TestCOWDataSource: add imports for `LogicalRelation`, `HadoopFsRelation`, `HoodieFileIndex`, `HoodieBaseRelation`; drop FQN in `runNestedFieldPartitionTest`. - TestHoodieFileIndex: collapse six `testBuildNestedPartitionSchema*` cases into a single `@ParameterizedTest` driven by `buildNestedPartitionSchemaCases`; collapse three `testExtractNestedPartitionFilters*` cases into `extractNestedPartitionFiltersCases`. Keeps the conflict-throws case as a focused `@Test`.
ac03916 to
525f932
Compare
CI report:
Bot commands@hudi-bot supports the following commands:
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18126 +/- ##
=============================================
+ Coverage 54.01% 68.14% +14.12%
- Complexity 12461 29110 +16649
=============================================
Files 1434 2517 +1083
Lines 72161 141194 +69033
Branches 8245 17528 +9283
=============================================
+ Hits 38979 96218 +57239
- Misses 29686 37062 +7376
- Partials 3496 7914 +4418
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Describe the issue this Pull Request addresses
There's a change in behavior for for SparkHoodieTableFileIndex since 0.14.1. The StructType(partitionFields) returned doesn't have the full path and causing data validation failures. This behavior was changed as part of this PR https://github.com/apache/hudi/pull/9863/changes
Meanwhile, Spark does not support nested partition columns.
Summary and Changelog
Impact
Medium
Risk Level
Low.
Documentation Update
Contributor's checklist