fix: timestamp logical types by linliu-code · Pull Request #18132 · apache/hudi

linliu-code · 2026-02-08T15:02:56Z

Describe the issue this Pull Request addresses

This PR combines mainly two PRs that fixing timestamp_millis logical type issue.

Summary and Changelog

The below is the PR description from #14161.

This pr #9743 adds more schema evolution functionality and schema processing. However, we used the InternalSchema system to do various operations such as fix null ordering, reorder, and add columns. At the time, InternalSchema only had a single Timestamp type. When converting back to avro, this was assumed to be micros. Therefore, if the schema provider had any millis columns, the processed schema would end up with those columns as micros.

In this pr to update column stats with better support for logical types: #13711, the schema issues were fixed, as well as additional issues with handling and conversion of timestamps during ingestion.

this pr aims to add functionality to spark and hive readers and writers to automatically repair affected tables.
After switching to use the 1.1 binary, the affected columns will undergo evolution from timestamp-micros to timestamp-mills. Normally a lossy evolution that is not supported, this evolution is ok because the data is actually still timestamp-millis it is just mislabeled as micros in the parquet and table schemas

Impact

When reading from a hudi table using spark or hive reader if the table schema has a column as millis, but the data schema is micros, we will assume that this column is affected and read it as a millis value instead of a micros value. This correction is also applied to all readers that the default write paths use. As a table is rewritten the parquet files will be correct. A table's latest snapshot can be immediately fixed by writing one commit with the 1.1 binary.

Risk Level

High, extensive testing was done and functional tests were added.

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

linliu-code · 2026-02-09T17:29:25Z

    val enableVectorizedReader: Boolean =
-      sqlConf.parquetVectorizedReaderEnabled &&
-        resultSchema.forall(_.dataType.isInstanceOf[AtomicType])
+      ParquetUtils.isBatchReadSupportedForSchema(sqlConf, resultSchema)


This is the fix.

lokeshj1703

@linliu-code Found few optimisation which can be added. These would be needed in branch-0.x as well.

lokeshj1703 · 2026-03-12T13:50:35Z


    Schema writerSchema = mergeHandle.getWriterSchemaWithMetaFields();
-    Schema readerSchema = baseFileReader.getSchema();
+    Schema readerSchema = AvroSchemaUtils.getRepairedSchema(baseFileReader.getSchema(), writerSchema);


I can see this code getting executed in executor. Ran test: org.apache.hudi.functional.TestRecordLevelIndex#testRLIUpsert

This might be a problem with branch-0.x as well.

can you help me understand, why are we not trying to infer isLogicalTimestampRepairEnabled in the executor code?
isn't the idea to parse the schema once in the driver and infer it in the executor and avoid schema repair code altogether if table does not contain any logical types at all

This is difficult to fix. Multiple nested callers and functions like handleUpdateInternal are involved here.

but why can't we add it to hadoopConfiguration thats part of table.getHadoopConf() in the driver and then fetch it from here

lokeshj1703 · 2026-03-12T13:53:36Z

+    Schema orignalReaderSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getSchema()), config.allowOperationMetadataField());
    // config.getSchema is not canonicalized, while config.getWriteSchema is canonicalized. So, we have to use the canonicalized schema to read the existing data.
    baseFileReaderSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getWriteSchema()), config.allowOperationMetadataField());
+    // Repair reader schema.


This would also be executed in executor and we can probably optimise by adding a flag in the caller. Could not validate though.

lets fix this. should not take much effort to fix.
lets fix 0.15.1 if need be.

Addressed
#18478 for 0.15

lokeshj1703 · 2026-03-12T13:54:47Z

    this.internalSchema = internalSchema == null ? InternalSchema.getEmptyInternalSchema() : internalSchema;
    this.enableOptimizedLogBlocksScan = enableOptimizedLogBlocksScan;
+    this.enableLogicalTimestampFieldRepair = !hoodieTableMetaClient.isMetadataTable() && fs.getConf().getBoolean(HoodieFileReader.ENABLE_LOGICAL_TIMESTAMP_REPAIR,
+        readerSchema != null && AvroSchemaUtils.hasTimestampMillisField(readerSchema));


This is also getting executed in the executor code path. We can optimise by adding a flag in the caller instead. Validated in the logical repair tests added in TestHoodieDeltaStreamer.

I think we took an informed decision to add optimisation for just metadata table here. We can probably check the others.

linliu-code · 2026-03-15T20:46:14Z

Will address these comments after cherry-pick commits from branch-0.x

…fety

…se FileSystem and hadoop Configuration changes (0.14 equivalent)

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

nsivabalan · 2026-04-04T16:56:02Z

Accidentally approved

nsivabalan · 2026-04-07T02:10:09Z

why the PR desc is empty? Can we fill that in please

nsivabalan

I see we are removing SparkAvroPostProcessor.
can we just leave it and not add it by default.
in minor release, wondering if we should remove a utility class.

nsivabalan · 2026-04-07T03:00:51Z

+    Schema orignalReaderSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getSchema()), config.allowOperationMetadataField());
    // config.getSchema is not canonicalized, while config.getWriteSchema is canonicalized. So, we have to use the canonicalized schema to read the existing data.
    baseFileReaderSchema = HoodieAvroUtils.addMetadataFields(new Schema.Parser().parse(config.getWriteSchema()), config.allowOperationMetadataField());
+    // Repair reader schema.


lets fix this. should not take much effort to fix.
lets fix 0.15.1 if need be.

nsivabalan · 2026-04-07T03:02:12Z


    Schema writerSchema = mergeHandle.getWriterSchemaWithMetaFields();
-    Schema readerSchema = baseFileReader.getSchema();
+    Schema readerSchema = AvroSchemaUtils.getRepairedSchema(baseFileReader.getSchema(), writerSchema);


nsivabalan · 2026-04-07T03:05:20Z


    Schema writerSchema = mergeHandle.getWriterSchemaWithMetaFields();
-    Schema readerSchema = baseFileReader.getSchema();
+    Schema readerSchema = AvroSchemaUtils.getRepairedSchema(baseFileReader.getSchema(), writerSchema);


can you help me understand, why are we not trying to infer isLogicalTimestampRepairEnabled in the executor code?
isn't the idea to parse the schema once in the driver and infer it in the executor and avoid schema repair code altogether if table does not contain any logical types at all

nsivabalan · 2026-04-07T03:07:34Z

+   */
+  def hasTimestampMillisField(schema: Schema): Boolean = {
+    if (schema == null) {
+      true


why do we return true if the schema is null?

nsivabalan

I need to pull down the changes and do some additional inspection. for now, you can address existing feedback.

nsivabalan · 2026-04-07T03:16:03Z

+   * @param avroSchema the table's Avro schema
+   * @return set of field names whose type is timestamp-millis or local-timestamp-millis
+   */
+  def getTimestampMillisColumns(avroSchema: org.apache.avro.Schema): Set[String] = {


are we considering only top level fields here?

It seems like its only top level fields being considered here

Apache master uses index definition to get source fields and validates those source fields

nsivabalan · 2026-04-07T15:02:21Z

-      ss.emptyDataFrame
+    // Avoid calling isEmpty() which can cause serialization issues with Ordering$Reverse
+    // Check partition count instead, which doesn't require task serialization
+    val structType = convertAvroSchemaToStructType(new Schema.Parser().parse(schemaStr))


is this optimization applied to 0.15.1 as well ?

The code is different in 0.15.1. This optimisation is not required there.

nsivabalan · 2026-04-08T01:57:10Z


    Schema writerSchema = mergeHandle.getWriterSchemaWithMetaFields();
-    Schema readerSchema = baseFileReader.getSchema();
+    Schema readerSchema = AvroSchemaUtils.getRepairedSchema(baseFileReader.getSchema(), writerSchema);


but why can't we add it to hadoopConfiguration thats part of table.getHadoopConf() in the driver and then fetch it from here

nsivabalan · 2026-04-08T02:01:56Z

    //       sure that in case the file-schema is not equal to read-schema we'd still
    //       be able to read that file (in case projection is a proper one)
-    if (!requestedSchema.isPresent()) {
+    Schema repairedFileSchema = getRepairedSchema(getSchema(), schema);


When we are instantiating the base file reader in L84 in HoodieMergeHelper, if we can embed a boolean flag in hadoopConf, we can fetch it again here and avoid repair calls for tables w/o any logical type.

nsivabalan · 2026-04-08T02:10:55Z

      conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, enableCompatibility);
    }
-    return new HoodieAvroReadSupport<>(model);
+    return new HoodieAvroReadSupport<>(model, Option.ofNullable(tableSchema).map(schema -> getAvroSchemaConverter(conf).convert(schema)),


if hadoopConf has the value for hasLogicalTsField, we can also avoid the additional call in L90

nsivabalan · 2026-04-08T02:57:16Z

+          new HoodieParquetReadSupport(
            convertTz,
            enableVectorizedReader = false,
+            enableTimestampFieldRepair = true,


shouldn't we set the value to hasTimestampMillisFieldInTableSchema

nsivabalan · 2026-04-08T03:05:38Z

+  }
+
+  @ParameterizedTest
+  @CsvSource(value = {"SIX,AVRO,CLUSTER", "CURRENT,AVRO,NONE", "CURRENT,AVRO,CLUSTER", "CURRENT,SPARK,NONE", "CURRENT,SPARK,CLUSTER"})


in 0.x, we don't have support for multiple writer versions. CURRENT and SIX are one and the same.
can we trim the unnecessary combinations.

nsivabalan · 2026-04-08T03:05:44Z

+
+  @ParameterizedTest
+  @CsvSource(value = {
+      "SIX,AVRO,CLUSTER,AVRO",


can we trim the unnecessary combinations.

nsivabalan · 2026-04-08T03:08:45Z

and can you point me to the places where we avoid calls to repairSchema if its metadata table. I don't remember seeing it.

nsivabalan

for MOR reads:

HoodieMergeOnReadRDD
L89 -> where we broadcast the hadoop conf, lets inject the hasLogicalTs into it in the driver.

and this eventually will get wired all the way to AbstractHoodieLogRecordReader.fs.getHadoopConf
This piece of code will be executed in executor. We can rely on the boolean flag and decide whether to call repairSchema or not.

nsivabalan · 2026-04-08T03:19:38Z

    this.forceFullScan = forceFullScan;
    this.internalSchema = internalSchema == null ? InternalSchema.getEmptyInternalSchema() : internalSchema;
    this.enableOptimizedLogBlocksScan = enableOptimizedLogBlocksScan;
+    this.enableLogicalTimestampFieldRepair = !hoodieTableMetaClient.isMetadataTable() && fs.getConf().getBoolean(HoodieFileReader.ENABLE_LOGICAL_TIMESTAMP_REPAIR,


so, this is the only place where we optimize for mdt is it?

nsivabalan · 2026-04-08T03:33:08Z

for MOR reads:

HoodieMergeOnReadRDD L89 -> where we broadcast the hadoop conf, lets inject the hasLogicalTs into it in the driver.

and this eventually will get wired all the way to AbstractHoodieLogRecordReader.fs.getHadoopConf This piece of code will be executed in executor. We can rely on the boolean flag and decide whether to call repairSchema or not.

btw (HoodieMergeOnReadRDD L89) is going to be invoked N no of times for N file groups in driver, then we might be parsing the schema N no of times unless we cache the value of hasLogicalTs at a table level. So, lets add an instance variable in the class and make it lazy

nsivabalan · 2026-04-08T18:57:15Z

-  @CsvSource(value = {"SIX,AVRO,CLUSTER", "CURRENT,AVRO,NONE", "CURRENT,AVRO,CLUSTER", "CURRENT,SPARK,NONE", "CURRENT,SPARK,CLUSTER"})
-  void testCOWLogicalRepair(String tableVersion, String recordType, String operation) throws Throwable {
+  @CsvSource(value = {"CLUSTER", "NONE"})
+  void testCOWLogicalRepair(String operation) throws Throwable {


lets include SPARK as well and not just AVRO

nsivabalan · 2026-04-08T19:01:59Z

    ParquetReader<IndexedRecord> reader =
-        new HoodieAvroParquetReaderBuilder<IndexedRecord>(path)
+        new HoodieAvroParquetReaderBuilder<IndexedRecord>(path,
+            AvroSchemaUtils.isLogicalTimestampRepairNeeded(conf, false) || schema == null || AvroSchemaUtils.hasTimestampMillisField(schema))


why the default value is false here, but true in all other places?

This has been removed now

nsivabalan · 2026-04-08T19:16:13Z

-    Schema readerSchema = AvroSchemaUtils.getRepairedSchema(baseFileReader.getSchema(), writerSchema);
-
+    Schema readerSchema;
+    if (AvroSchemaUtils.isLogicalTimestampRepairNeeded(table.getHadoopConf(), true)) {


this line is executed in the executor. we should parse the schema in the driver and set the boolean flag in hadoop conf at one of the caller site in the driver.

This is just a check in hadoop config, its not parsing the schema. I had checked the caller, I think it was HoodieCompactor.

nsivabalan · 2026-04-09T15:13:25Z

-    if (AvroSchemaUtils.hasTimestampMillisField(writerSchema)) {
-      AvroSchemaUtils.setLogicalTimestampRepairNeeded(table.getHadoopConf());
-    }
+    AvroSchemaUtils.setLogicalTimestampRepairIfNotSet(table.getHadoopConf(), () -> AvroSchemaUtils.hasTimestampMillisField(writerSchema));


do we need to check for mdt here and avoid setting the config?

nsivabalan · 2026-04-09T15:15:24Z


-    jobConf.set(HoodieFileReader.ENABLE_LOGICAL_TIMESTAMP_REPAIR,
-      java.lang.Boolean.toString(hasTimestampMillisFieldInTableSchema))
+    AvroSchemaUtils.setLogicalTimestampRepairIfNotSet(jobConf, JFunction.toJavaSupplier(() => hasTimestampMillisFieldInTableSchema.asInstanceOf[java.lang.Boolean]))


do we need to check for mdt here?

hudi-bot · 2026-04-09T21:09:00Z

CI report:

7aec228 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

The PR adds optimisation for merged read handle so that schema is repaired only if schema has timestamp millis field. The existence of timestamp millis field is computed in driver. It also addresses review comments in PR #18132 for 0.15.1 version so that repair is applied only in cases where timestamp millis field is present in schema. --------- Co-authored-by: Lokesh Jain <ljain@Lokeshs-MacBook-Pro.local> Co-authored-by: Lokesh Jain <ljain@192.168.1.5> Co-authored-by: sivabalan <n.siva.b@gmail.com>

github-actions Bot added the size:XL PR with lines of changes > 1000 label Feb 8, 2026

linliu-code commented Feb 9, 2026

View reviewed changes

apache deleted a comment from hudi-bot Feb 10, 2026

linliu-code marked this pull request as ready for review February 10, 2026 04:41

lokeshj1703 reviewed Mar 12, 2026

View reviewed changes

linliu-code force-pushed the pr-18100-1 branch 7 times, most recently from 2b481e3 to 722842e Compare March 16, 2026 13:54

apache deleted a comment from hudi-bot Mar 20, 2026

linliu-code and others added 16 commits April 2, 2026 10:28

Fix timestamp_millis issue

41deec7

Solve more compiling errors

f966a51

Fix some bugs

0f28b8c

Fix incremental queries for auto repair

b4fe80a

Fix data skipping support

b269ce0

Pass schema from option instead of global configuration for thread sa…

d039cdf

…fety

Address partial comments

4ebe05e

Address more comments

4092037

changing code parts around HoodieStorage (introduced in 0.15.0.) to u…

2436a84

…se FileSystem and hadoop Configuration changes (0.14 equivalent)

Addressed comments

8c7c8a5

address wiring comments

297f657

Fix hive related tests

48bcddb

Fix tests

46616bf

Fix tests

f787b5f

Fix tests

10d552a

Fix test and address review comments

06c921d

yihua previously approved these changes Apr 3, 2026

View reviewed changes

nsivabalan reviewed Apr 7, 2026

View reviewed changes

Add optimisation for merged read handle

6db2a16

nsivabalan reviewed Apr 8, 2026

View reviewed changes

lokeshj1703 mentioned this pull request Apr 8, 2026

[MINOR] Add optimisation for repair with logical timestamp #18478

Merged

3 tasks

Lokesh Jain and others added 8 commits April 8, 2026 18:00

Address review comments

990ea48

Addressing feedback and optimizing schema parsing

8c9b54f

Fix compilation

7595d6d

More fixes and test fix

a2ff34c

Other fixes

bd54b98

fix compilation

3a41101

Fix compilation

63a4781

Fix compilation

6e221eb

nsivabalan reviewed Apr 9, 2026

View reviewed changes

Fix test failures

7aec228

nsivabalan approved these changes Apr 10, 2026

View reviewed changes

danny0405 merged commit a1ac307 into apache:release-0.14.2-prep Apr 10, 2026
89 of 92 checks passed

Conversation

linliu-code commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lokeshj1703 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linliu-code commented Mar 15, 2026

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Apr 4, 2026

Uh oh!

nsivabalan commented Apr 7, 2026

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linliu-code commented Feb 8, 2026 •

edited

Loading

nsivabalan commented Apr 8, 2026 •

edited

Loading