Flink dataloss fixes by suryaprasanna · Pull Request #18737 · apache/hudi

suryaprasanna · 2026-05-14T05:46:25Z

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

…adata events (apache#13543)

The original OSS test was designed for the multi-checkpoint EventBuffers model. Adapt it for this fork's single eventBuffer model: - Remove intermediate DATA_SET_PART3 batch that would be lost on restart - Add checkpoint(2) after assertNextEvent to capture coordinator state WITH the event buffer (coordinator checkpoint saves state before write fn flush, so ckp-2 captures events from ckp-1's flush) - Write new data (DATA_SET_PART4) after restart under a new instant - Verify both recommitted data (par1) and new data (par2) Made-with: Cursor

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR fixes a real data-loss scenario by persisting the coordinator's eventBuffer to checkpoint state and recommitting any pending instant on coordinator restart. The core logic in restoreEvents() looks reasonable. However, the test-side changes appear to have compile errors that would prevent the PR from building, and there are a couple of edge cases worth considering. Please take a look at the inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of minor nits: one idiomatic Java cleanup and one potential dataset duplication worth clarifying.

hudi-agent · 2026-05-14T05:59:03Z

+
+  public void restartCoordinator() throws Exception {
+    this.coordinator.close();
+    this.coordinator = new StreamWriteOperatorCoordinator(conf, this.coordinatorContext);


🤖 The coordinator field is declared private final StreamWriteOperatorCoordinator coordinator; (line 75), so reassigning it here should fail to compile. Did you intend to drop the final modifier on the field declaration as part of this change?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-14T05:59:03Z

+        .assertNextEvent()
+        // simulate failure: no checkpointComplete was called, data is NOT committed.
+        // restart coordinator: restores par1 event from ckp-2 state and recommits it.
+        .restartCoordinator()


🤖 .restartCoordinator() is invoked on the TestHarness returned by preparePipeline(conf), but TestHarness in TestWriteBase.java only defines coordinatorFails() — there's no restartCoordinator() wrapper. This test should fail to compile as is. Could you add a restartCoordinator() method to TestHarness that delegates to this.pipeline.restartCoordinator()?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-14T05:59:03Z

  public void resetToCheckpoint(long checkpointID, byte[] checkpointData) {
-    // no operation
+    if (checkpointData != null && checkpointData.length > 0) {
+      this.eventBuffer = SerializationUtils.deserialize(checkpointData);


🤖 If the coordinator's parallelism is rescaled between checkpoints (Flink supports this for operator coordinators), the restored eventBuffer will be sized to the OLD parallelism. After restore, this.parallelism reflects the NEW value, but the buffer length doesn't — could handleWriteMetaEvent then hit ArrayIndexOutOfBoundsException when event.getTaskID() >= eventBuffer.length, or silently drop slots if shrinking? Might be worth resizing the buffer here based on the current parallelism.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-14T05:59:03Z

+        writeClient.getHeartbeatClient().start(restoreInstant);
+      }
+      commitInstant(restoreInstant);
+    }


🤖 When containsInstant(restoreInstant) returns true (instant was already committed before the crash), this path leaves the stale events in eventBuffer rather than calling reset(). As bootstrap events arrive after restart they will overwrite slots one by one, but until then any intervening checkpoint or commit attempt would re-serialize/re-process stale data. Would it be safer to reset() here in the already-committed branch?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-14T05:59:03Z

+      if (writeClient.getConfig().getFailedWritesCleanPolicy().isLazy()) {
+        writeClient.getHeartbeatClient().start(restoreInstant);
+      }
+      commitInstant(restoreInstant);


🤖 restoreEvents() runs synchronously inside start() and may perform a full Hudi commit (writeClient.commit on potentially large WriteStatus lists). If this takes a long time, could it exceed Flink's coordinator start timeout? The pre-existing recommit on bootstrap path runs through the executor — would it be safer to dispatch this through executor.execute(...) as well, or is the synchronous behavior intentional here?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-14T05:59:04Z

+    }
+    String restoreInstant = Arrays.stream(this.eventBuffer)
+        .filter(Objects::nonNull)
+        .filter(e -> e.getWriteStatuses().size() > 0)


🤖 nit: .size() > 0 could be flipped to !e.getWriteStatuses().isEmpty() — a bit more idiomatic Java and avoids the unnecessary size computation.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-05-14T05:59:04Z

      insertRow(StringData.fromString("id1"), StringData.fromString("Danny"), 23,
          TimestampData.fromEpochMillis(1), StringData.fromString("par1")));

+  public static List<RowData> DATA_SET_PART1 = Collections.singletonList(


🤖 nit: DATA_SET_PART1 looks identical to the existing DATA_SET_SINGLE_INSERT (id1/Danny/23/ts=1/par1). Could you either reuse DATA_SET_SINGLE_INSERT in the test, or add a brief comment here explaining why a separate constant is needed? As-is, a future reader will wonder if the duplication is intentional.

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-bot · 2026-05-14T06:03:35Z

CI report:

40ca335 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

cshuo and others added 2 commits May 13, 2026 22:40

[HUDI-9570] Using coordinator state to persist and recommit write met…

78b6745

…adata events (apache#13543)

github-actions Bot added the size:M PR with lines of changes in (100, 300] label May 14, 2026

hudi-agent reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink dataloss fixes#18737

Flink dataloss fixes#18737
suryaprasanna wants to merge 2 commits into
apache:release-0.14.2from
suryaprasanna:flink-dataloss-fixes

suryaprasanna commented May 14, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-agent May 14, 2026

Uh oh!

hudi-bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

suryaprasanna commented May 14, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent May 14, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented May 14, 2026

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants