fix: Enable schema merging for incremental and dfs sources by linliu-code · Pull Request #18385 · apache/hudi

linliu-code · 2026-03-25T23:04:07Z

Describe the issue this Pull Request addresses

#18382 — source files with heterogeneous schemas in a single Hudi-Streamer batch can silently drop columns. This PR closes the FS-based-source side of that bug class. The MetadataBootstrap paths (ParquetBootstrapMetadataHandler, OrcBootstrapMetadataHandler) are deferred to a follow-up because they need partition-level schema discovery, which is architecturally bigger than the per-read knobs in this PR.

Summary and Changelog

Cloud incremental sources (S3/GCS):

New hoodie.streamer.source.cloud.data.merge.schema.enable (since 1.2.0, advanced), default true. Covers both Parquet and ORC.
In CloudObjectsSelectorCommon, the format gate is widened to parquet || orc and Spark reader option mergeSchema=true is set before applying SPARK_DATASOURCE_OPTIONS, so users can still override per-format via JSON options (e.g. {"mergeSchema":"false"}).
TestCloudObjectsSelectorCommon covers Parquet end-to-end (merged-by-default) plus a predicate test for the format dispatch. End-to-end ORC isn't exercised in this module because hudi-utilities pulls in orc-core-nohive which conflicts with Spark 3.x's ORC writer; the predicate test pins the format dispatch and the e2e ORC behaviour matches Parquet via the shared helper.

ParquetDFSSource:
4. ParquetDFSSourceConfig.PARQUET_DFS_MERGE_SCHEMA default flipped from false to true. Heterogeneous-schema parquet files in a single Streamer batch now get a unioned schema instead of silently dropping columns. The new primary key is hoodie.streamer.source.parquet.dfs.merge.schema.enable; the previous underscore-style key (...merge_schema.enable, since 0.15.0) is preserved as a back-compat alternative.

⚠️ Release-notes call-out (broader scope than the issue title): this default flip applies to all ParquetDFSSource users, not only S3/GCS incremental. Plain DFS Parquet ingest now incurs footer reads across all files in a batch and may resolve a different schema than before for tables with heterogeneous-schema source files. This is intentional — the silent-drop bug class is the same on both paths — but operators upgrading should know. Set the property to false to restore the previous reader behavior.

ORCDFSSource:
5. New ORCDFSSourceConfig.ORC_DFS_MERGE_SCHEMA (hoodie.streamer.source.orc.dfs.merge.schema.enable, since 1.2.0, advanced), default true. Mirrors the parquet config. Plumbed into ORCDFSSource.fromFiles() as reader.option("mergeSchema", flag). Requires spark.sql.orc.impl=native (default since Spark 2.4); silently ignored under the Hive impl.

AvroDFSSource regression test:
6. TestAvroDFSSource.testAdditiveSchemaEvolutionAcrossFiles writes one narrow + one wider Avro file under a unique subdirectory, configures the source's reader schema to the wider one, and asserts records from the narrow file get the wider schema's default for the new field while records from the wider file preserve their value. Locks in Avro reader/writer schema-resolution behaviour end-to-end through AvroDFSSource.

Impact

Cloud incremental Parquet AND ORC ingestion (e.g. HoodieStreamer cloud sources, S3EventsHoodieIncrSource, GcsEventsHoodieIncrSource) gets a unified schema across files in each read by default.
DFS Parquet ingestion gets a unified schema by default (was opt-in since 0.15.0). Affects all ParquetDFSSource users, not just cloud incremental.
DFS ORC ingestion gets a unified schema by default (new in this PR).
AvroDFSSource behaviour locked in by regression test.

Compatibility:

Default is on for cloud-Parquet/ORC and DFS-Parquet/ORC.
Set the new properties to false to restore the previous reader behavior.
The previous underscore-style parquet.dfs.merge_schema.enable key continues to work via withAlternatives — explicit overrides preserved.
Performance: mergeSchema can add work (footer/schema aggregation) versus a single-schema read; usually acceptable relative to correctness for heterogeneous batches.

Risk Level

Medium. Touches default behaviour for three FS-based source paths; mitigations: each format has its own kill-switch config, and SPARK_DATASOURCE_OPTIONS continues to override per-read.

Documentation Update

The new configs hoodie.streamer.source.cloud.data.merge.schema.enable and hoodie.streamer.source.orc.dfs.merge.schema.enable are documented in CloudSourceConfig and ORCDFSSourceConfig respectively. The flipped default is documented in ParquetDFSSourceConfig. Release notes should highlight the ParquetDFSSource default flip explicitly, since it changes behaviour for non-cloud DFS users.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable — TestCloudObjectsSelectorCommon, TestParquetDFSSource, TestAvroDFSSource all pass locally (14 tests, 0 failures, 0 errors).

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The fix is well-motivated and the implementation is clean — applying mergeSchema before SPARK_DATASOURCE_OPTIONS so the user override ordering is correct. One test reliability concern worth addressing before merging.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — Minor consistency issue in test code: first test method doesn't extract result.get() to a variable like the other test methods do.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM — clean, well-scoped fix that adds mergeSchema support for Parquet reads in cloud source ingestion. The config is properly defined with alternative keys and sensible defaults, and the ordering (apply before SPARK_DATASOURCE_OPTIONS) correctly allows user overrides. Tests cover the key scenarios.

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR extends the cloud incremental source mergeSchema behavior to ORC and unifies it under a single config key with backward-compatible aliases. No critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review.

cc @yihua

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR enables Spark's mergeSchema option for Parquet/ORC reads in S3/GCS incremental cloud sources, adds a unified config with back-compat aliases for the prior parquet-only key, and flips the existing ParquetDFSSourceConfig default. One question worth confirming about the scope of the default change. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor naming/consistency suggestions below.

linliu-code · 2026-05-05T19:42:07Z

+          // Back-compat aliases: an earlier iteration of this PR used a `merge_schema` (underscore)
+          // form and an earlier dot-style `merge.schema` form, plus the original parquet-only key.
+          // All four are still honored.
+          STREAMER_CONFIG_PREFIX + "source.cloud.data.merge_schema.enable",
+          DELTA_STREAMER_CONFIG_PREFIX + "source.cloud.data.merge_schema.enable",
+          STREAMER_CONFIG_PREFIX + "source.cloud.data.merge.schema",
+          DELTA_STREAMER_CONFIG_PREFIX + "source.cloud.data.merge.schema",
+          STREAMER_CONFIG_PREFIX + "source.cloud.data.parquet.merge.schema",
+          DELTA_STREAMER_CONFIG_PREFIX + "source.cloud.data.parquet.merge.schema")


We don't need these alternative configs, since this PR is never merged and used by any code.

linliu-code · 2026-05-05T19:42:34Z

+          // Back-compat aliases for the underscore-style keys used in an earlier iteration of this PR.
+          STREAMER_CONFIG_PREFIX + "source.orc.dfs.merge_schema.enable",
+          DELTA_STREAMER_CONFIG_PREFIX + "source.orc.dfs.merge_schema.enable")


hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR enables mergeSchema by default on the DFS Parquet/ORC sources and the cloud-incremental S3/GCS source so heterogeneous-schema batches don't silently drop columns, with back-compat aliases for the previous config keys. No new issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. One minor naming nit in CloudObjectsSelectorCommon; the rest of the change is clean and well-documented.

cc @yihua

hudi-agent · 2026-05-05T19:45:06Z

+    if (fileFormat == null) {
+      return false;
+    }
+    String f = fileFormat.trim();


🤖 nit: the single-character name f doesn't communicate intent here — could you rename it to trimmed (or just inline fileFormat.trim() in the return expression) so it's immediately clear what the variable represents?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR enables Spark mergeSchema for cloud-incremental and DFS Parquet/ORC sources to fix silent column drops on heterogeneous-schema batches, with appropriate back-compat alternatives and a release-notes call-out for the default flip. No new critical correctness issues flagged from this automated pass — a Hudi committer or PMC member can take it from here for a final review. A few naming suggestions below, but the change is clean overall.

cc @yihua

hudi-agent · 2026-05-05T21:52:45Z

+  static boolean isParquetOrOrcFileFormat(String fileFormat) {
+    if (fileFormat == null) {
+      return false;
+    }


🤖 nit: could you rename f to something like trimmed or normalizedFormat? Single-letter locals make sense in tiny lambdas but here it's a named local in a package-private method that test code calls directly, so a slightly longer name would make the reader's intent clearer at a glance.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

Widens the parquet-only mergeSchema injection in CloudObjectsSelectorCommon to also cover ORC. Spark's native ORC reader honors the per-read `mergeSchema` option on Spark 3.0+ (the native ORC impl has been the default since Spark 2.4); on older runtimes the option is silently ignored, which is harmless. Renames the config from CLOUD_INCREMENTAL_PARQUET_MERGE_SCHEMA to CLOUD_INCREMENTAL_MERGE_SCHEMA (key: source.cloud.data.merge.schema) and keeps the previous parquet-only key as a back-compat alternative so existing customer overrides continue to work. Renames the helper applyParquetMergeSchemaOption -> applyMergeSchemaOption and isParquetFileFormat -> isParquetOrOrcFileFormat. No external production callers reference the old names by symbol. Adds three ORC tests mirroring the existing parquet ones: - orcMixedSchemasMergedByDefault - orcMixedSchemasDropExtraColumnsWhenMergeDisabled - orcSparkDatasourceOptionsMergeSchemaFalseDropsExtraColumns Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…conflict) The three orcMixedSchemas* tests added in the prior commit fail in hudi-utilities CI with `NoSuchFieldError: type` from `OrcMapredRecordWriter.addVariableLengthColumns`. Hudi-utilities pulls in `orc-core-nohive` (kept for compile-time reasons per the pom comment) while Spark 3.x's ORC writer was compiled against regular `orc-core`, which has a `type` field the nohive variant lacks. Result: `sparkSession.write().orc(...)` cannot be used in this module's tests. Replaces the three e2e ORC tests with a single predicate test on `CloudObjectsSelectorCommon.isParquetOrOrcFileFormat(String)` that verifies the format-gating logic without exercising Spark's ORC writer. The e2e behaviour for ORC mirrors Parquet via the shared helper `applyMergeSchemaOption`, which is already covered by the three Parquet e2e tests. The ORC-vs-Parquet difference is one boolean check in the predicate, which the new test pins. `isParquetOrOrcFileFormat` becomes package-private (was private) so the test in the same package can reference it without reflection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per PR review feedback: this test asserts that with the Hudi CLOUD_INCREMENTAL_MERGE_SCHEMA flag turned off, the result has only one file's columns (2, not 3). It implicitly relies on Spark's session-level spark.sql.parquet.mergeSchema being false, which is the Spark default but can be overridden by the test runner. If a session enables mergeSchema globally, Spark merges anyway and the test fails. The other two parquet tests are reliable because they set the mergeSchema option explicitly per-read (which overrides session-level config): - parquetMixedSchemasMergedByDefault: Hudi default true sets reader.option("mergeSchema", "true"). - parquetSparkDatasourceOptionsMergeSchemaFalseDropsExtraColumns: SPARK_DATASOURCE_OPTIONS sets reader.option("mergeSchema", "false"). The override path (SPARK_DATASOURCE_OPTIONS) and the default-on path (CLOUD_INCREMENTAL_MERGE_SCHEMA defaults to true) are both still covered. The flag-off-but-session-default path is not covered and is the part deemed flaky. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…olumns Per PR review feedback: this test relies on the per-read mergeSchema option overriding the session-level spark.sql.parquet.mergeSchema. While that's the documented Spark behaviour, a test runner with the session- level config set to true could surface a false negative depending on order of operations. The reviewer flagged it as the same flakiness class as the previously-removed parquetMixedSchemasDropExtraColumnsWhenMergeDisabled. Coverage that remains: - parquetMixedSchemasMergedByDefault: end-to-end happy path, Hudi default (true) → mergeSchema=true on the reader → merged result. Reliable because the per-read option is set, overriding any session config. - isParquetOrOrcFileFormatRecognisesBothFormats: format-gating predicate. The override path (SPARK_DATASOURCE_OPTIONS overriding the Hudi flag) is no longer covered by an e2e test; the option-merge order is exercised by Spark's own DataFrameReader semantics, not by Hudi-side logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both removed flaky tests (parquetMixedSchemasDropExtraColumnsWhenMergeDisabled and parquetSparkDatasourceOptionsMergeSchemaFalseDropsExtraColumns) were the only references to CloudSourceConfig in this test file. Drop the now-unused import to satisfy the UnusedImports checkstyle rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…urce Follows up on the cloud-incremental-source fix in this PR. Closes the remaining FS-based source gaps from ENG-41047 except the MetadataBootstrap paths (handled separately as those need partition-level schema discovery). - ParquetDFSSourceConfig.PARQUET_DFS_MERGE_SCHEMA: default flipped from false to true so that heterogeneous-schema parquet files in a single Hudi Streamer batch get a unioned schema instead of silently dropping columns that exist only in some files. Set to false to restore the prior single-file-schema-wins behavior. - New ORCDFSSourceConfig.ORC_DFS_MERGE_SCHEMA, default true, mirroring the parquet config. Plumbed through ORCDFSSource.fromFiles() as reader.option("mergeSchema", flag). Requires spark.sql.orc.impl=native (default since Spark 2.4); silently ignored under the Hive impl. - TestAvroDFSSource: regression test confirming additive-schema evolution across files works end-to-end. Writes one narrow + one wider Avro file, configures the source's reader schema to the wider one, and asserts records from the narrow file get the wider schema's default for the new field while records from the wider file preserve their value.

- Drop withAlternatives(DELTA_STREAMER_CONFIG_PREFIX + ...) on the new CLOUD_INCREMENTAL_MERGE_SCHEMA and ORC_DFS_MERGE_SCHEMA configs; legacy prefix alternatives are not needed for new keys. - Drop the new-key-with-legacy-prefix alternative on PARQUET_DFS_MERGE_SCHEMA; keep underscore-style back-compat aliases since the key itself was renamed. - Static-import assertTrue/assertFalse in the new tests.

yihua

LGTM

hudi-bot · 2026-05-15T22:02:07Z

CI report:

28ac065 UNKNOWN
60645b1 UNKNOWN
cadeb4b UNKNOWN
6652082 UNKNOWN
ff794b0 Azure: SUCCESS
f609bff UNKNOWN

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2026-05-15T23:17:54Z

Codecov Report

❌ Patch coverage is 65.62500% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.14%. Comparing base (6e32f36) to head (f609bff).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
...ache/hudi/utilities/config/ORCDFSSourceConfig.java	0.00%	7 Missing ⚠️
...rg/apache/hudi/utilities/sources/ORCDFSSource.java	0.00%	2 Missing ⚠️
...es/sources/helpers/CloudObjectsSelectorCommon.java	80.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18385   +/-   ##
=========================================
  Coverage     68.14%   68.14%           
- Complexity    29109    29118    +9     
=========================================
  Files          2517     2518    +1     
  Lines        141197   141221   +24     
  Branches      17529    17531    +2     
=========================================
+ Hits          96212    96238   +26     
- Misses        37068    37069    +1     
+ Partials       7917     7914    -3

Flag	Coverage Δ
common-and-other-modules	`44.40% <65.62%> (+<0.01%)`	⬆️
hadoop-mr-java-client	`45.01% <ø> (-0.01%)`	⬇️
spark-client-hadoop-common	`48.32% <ø> (+<0.01%)`	⬆️
spark-java-tests	`48.97% <0.00%> (+0.05%)`	⬆️
spark-scala-tests	`44.90% <0.00%> (-0.01%)`	⬇️
utilities	`37.65% <40.62%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...pache/hudi/utilities/config/CloudSourceConfig.java	`99.10% <100.00%> (+0.05%)`	⬆️
.../hudi/utilities/config/ParquetDFSSourceConfig.java	`87.50% <100.00%> (ø)`
...pache/hudi/utilities/sources/ParquetDFSSource.java	`100.00% <100.00%> (ø)`
...rg/apache/hudi/utilities/sources/ORCDFSSource.java	`0.00% <0.00%> (ø)`
...es/sources/helpers/CloudObjectsSelectorCommon.java	`75.00% <80.00%> (+0.11%)`	⬆️
...ache/hudi/utilities/config/ORCDFSSourceConfig.java	`0.00% <0.00%> (ø)`

... and 11 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Mar 25, 2026

linliu-code force-pushed the enforce_schema_merging branch from b13559e to 5adce47 Compare March 25, 2026 23:21

linliu-code marked this pull request as ready for review March 26, 2026 03:23

yihua reviewed Apr 4, 2026

View reviewed changes

Comment thread .../src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java Outdated

Comment thread .../src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java Outdated

linliu-code force-pushed the enforce_schema_merging branch from 5adce47 to a262c46 Compare April 15, 2026 09:44

yihua reviewed Apr 15, 2026

View reviewed changes

Comment thread .../src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java Outdated

yihua reviewed Apr 15, 2026

View reviewed changes

linliu-code changed the title ~~fix: Merge schema for bootstrap~~ fix: Enable schema merging for Parquet and ORC in S3/GCS incremental sources Apr 29, 2026

linliu-code requested a review from yihua April 29, 2026 20:28

hudi-agent reviewed Apr 30, 2026

View reviewed changes

hudi-agent reviewed May 5, 2026

View reviewed changes

linliu-code force-pushed the enforce_schema_merging branch from cadeb4b to 86ed1c6 Compare May 5, 2026 17:42

linliu-code changed the title ~~fix: Enable schema merging for Parquet and ORC in S3/GCS incremental sources~~ fix: Enable schema merging for incremental and dfs sources May 5, 2026

linliu-code force-pushed the enforce_schema_merging branch 2 times, most recently from 6652082 to 4a96745 Compare May 5, 2026 18:18

linliu-code commented May 5, 2026

View reviewed changes

hudi-agent reviewed May 5, 2026

View reviewed changes

linliu-code force-pushed the enforce_schema_merging branch from 4a96745 to ff794b0 Compare May 5, 2026 19:51

hudi-agent reviewed May 5, 2026

View reviewed changes

yihua added this to the release-1.2.0 milestone May 15, 2026

yihua reviewed May 15, 2026

View reviewed changes

linliu-code and others added 8 commits May 15, 2026 14:54

Merge schema for bootstrap

8938ef3

Fix comments

32cfdc9

yihua reviewed May 15, 2026

View reviewed changes

Comment thread .../src/test/java/org/apache/hudi/utilities/sources/helpers/TestCloudObjectsSelectorCommon.java Outdated

yihua force-pushed the enforce_schema_merging branch from ff794b0 to f609bff Compare May 15, 2026 22:00

yihua approved these changes May 15, 2026

View reviewed changes

yihua merged commit ae9866a into apache:master May 15, 2026
61 of 63 checks passed

Conversation

linliu-code commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linliu-code May 5, 2026

Choose a reason for hiding this comment

Uh oh!

linliu-code May 5, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 5, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 15, 2026

CI report:

Uh oh!

Uh oh!

codecov-commenter commented May 15, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linliu-code commented Mar 25, 2026 •

edited

Loading