Skip to content

feat(flink): Implement continuous sorting feature for append write#18083

Merged
nsivabalan merged 9 commits into
apache:masterfrom
prashantwason:pw_oss_commit_porting_36
Apr 20, 2026
Merged

feat(flink): Implement continuous sorting feature for append write#18083
nsivabalan merged 9 commits into
apache:masterfrom
prashantwason:pw_oss_commit_porting_36

Conversation

@prashantwason
Copy link
Copy Markdown
Member

Describe the issue this Pull Request addresses

This PR implements a continuous sorting feature for the Flink append write sink. The continuous sorting mode maintains sorted order incrementally using a TreeMap, which avoids large pause times from batch sorting and reduces single-partition lag during ingestion by minimizing backpressure.

Summary and Changelog

Summary: Added a continuous sorting mode (AppendWriteFunctionWithContinuousSort) that provides non-blocking O(log n) inserts and incremental draining, offering predictable latency without sort spikes.

Changelog:

  • Added AppendWriteFunctionWithContinuousSort class that keeps records in a TreeMap keyed by a code-generated normalized key and an insertion sequence
  • When buffer reaches max capacity, oldest entries are drained and written immediately
  • Updated AppendWriteFunctions.create to instantiate the continuous sorter when WRITE_BUFFER_SORT_CONTINUOUS_ENABLED is true
  • Introduced new FlinkOptions:
    • write.buffer.sort.continuous.enabled - Whether to use continuous sorting instead of batch sorting
    • write.buffer.sort.continuous.drain.size - Number of records to drain each time max capacity is reached
  • Added ITTestAppendWriteFunctionWithContinuousSort integration tests covering buffer flush triggers, sorted output correctness, drain behaviors, and invalid-parameter error cases

Impact

New Configuration Options:

  • write.buffer.sort.continuous.enabled (default: false) - Enables continuous sorting mode
  • write.buffer.sort.continuous.drain.size (default: 1) - Controls drain batch size

No breaking changes to existing functionality. The feature is disabled by default.

Risk Level

low - This is a new optional feature that is disabled by default. Existing behavior is unchanged unless the user explicitly enables continuous sorting.

Documentation Update

The config description is included in the code. Documentation update for the Hudi website may be needed to describe the new continuous sorting feature and its configuration options.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

… sorted order incrementally and avoid single-partition lag during ingestion by reducing large pause time from sort and backpressure

Summary:
- Added AppendWriteFunctionWithContinuousSort which keeps records in a TreeMap keyed by a code-generated normalized key and an insertion sequence, drains oldest entries when a configurable threshold is reached, and writes drained records immediately; snapshot/endInput drain remaining records.
- Updated AppendWriteFunctions.create to instantiate the continuous sorter when WRITE_BUFFER_SORT_CONTINUOUS_ENABLED is true.
- Introduced three new FlinkOptions: WRITE_BUFFER_SORT_CONTINUOUS_ENABLED, WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_THRESHOLD_PERCENT, and WRITE_BUFFER_SORT_CONTINUOUS_DRAIN_SIZE, and added runtime validation (buffer > 0, 0 < threshold < 100, drainSize > 0, parsed non-empty sort keys).
- Added ITTestAppendWriteFunctionWithContinuousSort integration tests covering buffer flush triggers, sorted output correctness (with and without continuous drain), drain threshold/size behaviors, and invalid-parameter error cases.

Verified that we can push higher CPU utilization and input rate with continuous sorting enabled (See before and after 11:30AM)
@github-actions github-actions Bot added the size:L PR with lines of changes in (300, 1000] label Feb 4, 2026
@prashantwason
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

@cshuo
Copy link
Copy Markdown
Collaborator

cshuo commented Feb 6, 2026

@prashantwason thks for the contribution.

A high-level question before reviewing the details, I noticed this PR doesn't support async write like #13892, and actually the time cost of flushing records to file will be greater than sorting. Do you have any benchmark number about perf comparison of this PR and the previous version.

@prashantwason
Copy link
Copy Markdown
Member Author

@hudi-bot run azure

@prashantwason
Copy link
Copy Markdown
Member Author

Thanks @cshuo for the review!

Regarding async write support:
You're correct that this PR doesn't include async write like #13892. The goal of this PR is more focused - it specifically addresses the sorting bottleneck in append write by switching from batch sorting to continuous (incremental) sorting. This eliminates the large pause times that occur when sorting a full buffer.

The key benefits of continuous sorting:

  • Non-blocking O(log n) inserts vs O(n log n) batch sort at flush time
  • Predictable latency - no sort spikes when buffer fills up
  • Incremental draining - oldest entries are drained and written immediately when buffer reaches max capacity
  • Reduced backpressure - minimizes single-partition lag during ingestion

Regarding benchmarks:
I don't have formal benchmark numbers comparing to the existing batch sort approach. The primary motivation was to eliminate the unpredictable latency spikes from batch sorting rather than overall throughput improvement. That said, adding async write on top of continuous sorting would be a natural extension that could improve throughput further.

Would you like me to add some basic benchmarking to quantify the latency improvements? Or do you have specific concerns about the approach that I should address first?

@hudi-bot run azure

@cshuo
Copy link
Copy Markdown
Collaborator

cshuo commented Feb 24, 2026

@prashantwason thks for the detail explanation.

Would you like me to add some basic benchmarking to quantify the latency improvements? Or do you have specific concerns about the approach that I should address first?

Regarding the append write with buffer sorting, there are already two approaches, AppendWriteFunctionWithBIMBufferSort and AppendWriteFunctionWithDisruptorBufferSort in the repository. Both implement sorting and flushing in an asynchronous manner, aiming to mitigate spike issues during the flush process.

This PR adopts a continuous sorting approach, which can indeed alleviate sorting spikes. But the flush process involves not only sorting but also file-writing overhead and continuous sort will spread sorting overhead across each record writing process. Therefore, I’d like to understand how much performance gain this PR will bring compared to the existed functions.

@nsivabalan
Copy link
Copy Markdown
Contributor

hey @prashantwason : can you respond to @cshuo on this asks.

- Move config validation before super.open() to avoid IllegalStateException
  when Flink runtime context is not initialized (fixes 5 validation tests)
- Add RecordComparator for full comparison fallback when normalized keys
  are equal, ensuring correct multi-field sort order (fixes 2 sort tests)
- Store RowData reference in SortKey for RecordComparator access
@prashantwason
Copy link
Copy Markdown
Member Author

@cshuo Thanks for the thoughtful follow-up. You raise a valid point about the existing AppendWriteFunctionWithBIMBufferSort and AppendWriteFunctionWithDisruptorBufferSort.

Here's how continuous sorting differs from the existing async approaches:

Existing approaches (BIM/Disruptor):

  • Batch sort (O(n log n)) at flush time, but move the sort+write to a background thread
  • Require double-buffering (2x memory) or a ring buffer
  • Add threading complexity (synchronization, error propagation, buffer swaps)
  • Sorting still happens as a single O(n log n) burst, just on a different thread

Continuous sorting (this PR):

  • O(log n) per insert, no batch sort at all
  • Single buffer (no double-buffering overhead)
  • No threading complexity — simpler to reason about and debug
  • Incremental draining — when buffer fills, oldest sorted records are written immediately
  • Better for latency-sensitive workloads where predictable per-record cost is preferred over throughput

The key trade-off is:

  • BIM/Disruptor optimize throughput by overlapping sort+write with ingestion (async), at the cost of memory and complexity
  • Continuous sort optimizes latency predictability by eliminating sort spikes entirely (no batch sort), at the cost of higher per-record overhead (O(log n) vs O(1) insert)

These approaches are complementary — continuous sorting could potentially be combined with async write in the future. This PR adds it as an opt-in alternative via write.buffer.sort.continuous.enabled=true.

I don't have formal benchmark numbers yet. Would it be helpful if I ran a comparison benchmark against the BIM approach to quantify the latency distribution differences?

@prashantwason
Copy link
Copy Markdown
Member Author

Pushed a fix for the CI test failures:

Fixed 5 validation tests (testInvalidDrainSizeNegative, testInvalidDrainSizeZero, testInvalidBufferSizeZero, testInvalidBufferSizeNegative, testInvalidSortKeysOnlyWhitespace, testInvalidSortKeysOnlyCommas):

  • Moved config validation before super.open() which requires a Flink runtime context. Tests that directly instantiate the function without a test harness were getting IllegalStateException instead of IllegalArgumentException.

Fixed 2 sort order tests (testSortedResult, testSortedResultWithContinuousDrain):

  • Added RecordComparator for full comparison fallback when normalized keys are equal. Normalized keys are a prefix optimization and may not fully encode all sort fields, so a full RecordComparator is needed for correct multi-field sort order.

The Azure CI failure was infrastructure-related (Docker process exit code 1), not a test failure.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 10, 2026

Codecov Report

❌ Patch coverage is 6.87023% with 122 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.79%. Comparing base (ec04479) to head (1caf9fa).
⚠️ Report is 1124 commits behind head on master.

Files with missing lines Patch % Lines
.../append/AppendWriteFunctionWithContinuousSort.java 0.00% 117 Missing ⚠️
.../apache/hudi/sink/append/AppendWriteFunctions.java 28.57% 1 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18083      +/-   ##
============================================
+ Coverage     61.43%   68.79%   +7.35%     
- Complexity    23082    28221    +5139     
============================================
  Files          2108     2461     +353     
  Lines        127636   135388    +7752     
  Branches      14534    16410    +1876     
============================================
+ Hits          78409    93135   +14726     
+ Misses        42873    34875    -7998     
- Partials       6354     7378    +1024     
Flag Coverage Δ
common-and-other-modules 44.54% <6.87%> (?)
hadoop-mr-java-client 44.86% <ø> (?)
spark-client-hadoop-common 48.45% <ø> (?)
spark-java-tests 48.91% <ø> (?)
spark-scala-tests 45.51% <ø> (?)
utilities 38.23% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...va/org/apache/hudi/configuration/FlinkOptions.java 99.78% <100.00%> (ø)
...n/java/org/apache/hudi/sink/buffer/BufferType.java 100.00% <100.00%> (ø)
.../apache/hudi/sink/append/AppendWriteFunctions.java 67.74% <28.57%> (ø)
.../append/AppendWriteFunctionWithContinuousSort.java 0.00% <0.00%> (ø)

... and 4521 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@cshuo cshuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thks for the detail explanation, make sense to me.

Would it be helpful if I ran a comparison benchmark against the BIM approach to quantify the latency distribution differences?

yeah, that will be helpful, a concrete result would provide more valuable reference, e.g., how much can the spikes be reduced and how much does the average throughput decrease.

Left some comments for the implementation.

}

// Parse and validate sort keys
String sortKeys = config.get(FlinkOptions.WRITE_BUFFER_SORT_KEYS);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing append-sort code resolves missing write.buffer.sort.keys by falling back to the record key via AppendWriteFunctions.resolveSortKeys, can the function also follow the behavior?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Now using AppendWriteFunctions.resolveSortKeys(config) which falls back to record key when write.buffer.sort.keys is not set.

normalizedKeyComputer.putKey(data, reusableKeySegment, 0);

// Create sort key (copies the normalized key from reusable segment)
SortKey key = new SortKey(reusableKeySegment, normalizedKeySize, data, insertionSequence++);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map stores raw RowData references and later compares them. With Flink object reuse enabled, those retained rows can be mutated after insertion, which breaks
TreeMap ordering guarantees and risks incorrect output.

Use getRuntimeContext().isObjectReuseEnabled(); to check whether object reuse is enabled, and copy rowData here if necessary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added object reuse detection via getRuntimeContext().isObjectReuseEnabled() and RowDataSerializer.copy() in processElement() when object reuse is enabled.

// Sort key computation
private transient NormalizedKeyComputer normalizedKeyComputer;
private transient RecordComparator recordComparator;
private transient byte[] reusableKeyBuffer;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be local variable.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Made reusableKeyBuffer a local variable in open().

}

// Clear buffer and reset sequence
sortedRecords.clear();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this line inside the upper if branch?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved sortedRecords.clear() and insertionSequence = 0L inside the if (!sortedRecords.isEmpty()) branch.

}

// Reset for next checkpoint interval
sortedRecords.clear();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this line inside the upper if branch?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Moved sortedRecords.clear() and insertionSequence = 0L inside the if (!sortedRecords.isEmpty()) branch.

/**
* Sink function to write data with continuous sorting for improved compression.
*
* <p>Unlike {@link AppendWriteFunctionWithBufferSort} which uses batch sorting,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class name AppendWriteFunctionWithBufferSort is stale and has been renamed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Fixed to reference AppendWriteFunctionWithBIMBufferSort.

.defaultValue(false) // default use batch sorting
.withDescription("Whether to use continuous sorting (TreeMap-based) instead of batch sorting. "
+ "Continuous sorting provides O(log n) inserts and incremental draining, "
+ "but has higher per-record overhead. Requires write.buffer.sort.enabled=true.");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.buffer.sort.enabled is deprecated, and append writing uses WRITE_BUFFER_TYPE to select different write functions, can we also use WRITE_BUFFER_TYPE here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed WRITE_BUFFER_SORT_CONTINUOUS_ENABLED config entirely. Now uses WRITE_BUFFER_TYPE=CONTINUOUS_SORT via the BufferType enum. Added CONTINUOUS_SORT to BufferType and updated AppendWriteFunctions.create() to route accordingly. Updated tests as well.

…ject reuse, cleanup

- Replace WRITE_BUFFER_SORT_CONTINUOUS_ENABLED with WRITE_BUFFER_TYPE=CONTINUOUS_SORT
- Add CONTINUOUS_SORT to BufferType enum
- Use AppendWriteFunctions.resolveSortKeys() for sort key resolution with record key fallback
- Copy RowData when Flink object reuse is enabled to prevent mutation
- Move sortedRecords.clear()/insertionSequence reset inside if branches
- Fix stale class name reference in javadoc
- Update test to use new WRITE_BUFFER_TYPE config

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@prashantwason
Copy link
Copy Markdown
Member Author

@cshuo All review comments have been addressed:

  1. Sort key resolution: Now uses AppendWriteFunctions.resolveSortKeys() which falls back to record key when write.buffer.sort.keys is not set
  2. Object reuse safety: Added RowDataSerializer.copy() when Flink object reuse is enabled to prevent TreeMap ordering corruption
  3. Local variable: Made reusableKeyBuffer a local variable in open()
  4. endInput cleanup: Moved sortedRecords.clear() and insertionSequence = 0L inside the if branch
  5. snapshotState cleanup: Same as above
  6. Stale class name: Fixed javadoc reference to AppendWriteFunctionWithBIMBufferSort
  7. Config migration: Removed WRITE_BUFFER_SORT_CONTINUOUS_ENABLED, now uses WRITE_BUFFER_TYPE=CONTINUOUS_SORT via BufferType enum. Updated AppendWriteFunctions.create() routing and tests.

Please take another look when you get a chance. Thanks!

prashantwason and others added 2 commits March 19, 2026 16:48
…tions

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@cshuo cshuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thks for updating.

} catch (IOException e) {
throw new HoodieIOException("Failed to drain buffer during snapshot", e);
} finally {
super.snapshotState();
Copy link
Copy Markdown
Collaborator

@cshuo cshuo Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move super.snapshotState(); out of the finally clause like other append write functions, since write meta event should not be sent to coordinator when there is ingestion exception.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.reusableKeySegment = MemorySegmentFactory.wrap(reusableKeyBuffer);

// Detect object reuse mode and create serializer for copying if needed
this.objectReuseEnabled = getRuntimeContext().isObjectReuseEnabled();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is compiling error in CI, seems we need to adapt isObjectReuseEnabled() for lower flink versions, like flink 1.17, 1.18.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, can you also add a test with object reuse enabled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Using RuntimeContextUtils.isObjectReuseEnabled(getRuntimeContext()) which adapts across Flink versions. For Flink 1.17-1.20, it delegates to getExecutionConfig().isObjectReuseEnabled(), and for Flink 2.0+/2.1+, it calls runtimeContext.isObjectReuseEnabled() directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added testObjectReuseEnabled() which uses a single reusable GenericRowData instance and verifies all 3 records are distinct after write.

// Initialize TreeMap with comparator that uses normalized keys for fast comparison
// and falls back to RecordComparator for full comparison when normalized keys are equal
this.sortedRecords = new TreeMap<>((k1, k2) -> {
MemorySegment seg1 = MemorySegmentFactory.wrap(k1.keyBytes);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TreeMap comparator wraps keyBytes into new MemorySegment objects for every comparison. Since comparisons happen frequently on insert/drain, this creates avoidable allocation pressure in the hot path, which increases GC preasure.

We can reuse/precompute comparable representations per SortKey, i.e., store wrapped segments in SortKey.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. SortKey now copies the normalized key bytes and wraps them as a MemorySegment once in its constructor, so the TreeMap comparator uses pre-computed k1.keySegment/k2.keySegment without any per-comparison allocation.


} catch (IOException e) {
throw new HoodieIOException("Failed to drain buffer during endInput", e);
} finally {
Copy link
Copy Markdown
Collaborator

@cshuo cshuo Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move super.endInput(); out of the finally clause like other append write functions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. super.endInput() is already outside the try/catch block, consistent with the other append write functions.

…e test

- Fix isObjectReuseEnabled() compilation on Flink 1.17/1.18 by using
  getExecutionConfig().isObjectReuseEnabled() instead of RuntimeContext method
- Remove unused imports (java.util.Arrays, java.util.stream.Collectors) to fix
  checkstyle violations
- Move super.snapshotState() and super.endInput() out of finally clauses to
  match other append write functions (avoid sending write meta event on error)
- Optimize TreeMap comparator: store pre-wrapped MemorySegment in SortKey to
  avoid per-comparison allocation and reduce GC pressure
- Filter empty strings after trim in resolveSortKeys() to properly validate
  whitespace-only and comma-only sort key configs
- Add testObjectReuseEnabled test that verifies records are correctly copied
  when Flink object reuse is enabled (prevents TreeMap corruption)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@prashantwason
Copy link
Copy Markdown
Member Author

@cshuo All review comments from the latest round have been addressed in commit e592166:

  1. super.snapshotState() / super.endInput() out of finally: Moved both out of finally clauses to match other append write functions — write meta event is no longer sent to coordinator when there is an ingestion exception.
  2. isObjectReuseEnabled() compilation on Flink 1.17/1.18: Changed to getRuntimeContext().getExecutionConfig().isObjectReuseEnabled() for compatibility with lower Flink versions.
  3. MemorySegment allocation in TreeMap comparator: Pre-wrapped MemorySegment is now stored in SortKey.keySegment to avoid per-comparison allocation and reduce GC pressure.
  4. Test with object reuse enabled: Added testObjectReuseEnabled test that verifies records are correctly copied when Flink object reuse is enabled (prevents TreeMap corruption from mutated references).

Please take another look when you get a chance. Thanks!

this.reusableKeySegment = MemorySegmentFactory.wrap(reusableKeyBuffer);

// Detect object reuse mode and create serializer for copying if needed
this.objectReuseEnabled = getRuntimeContext().getExecutionConfig().isObjectReuseEnabled();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some compile error for higher flink versions.

Caused by: org.apache.maven.plugin.compiler.CompilationFailureException: Compilation failure
/home/runner/work/hudi/hudi/hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/append/AppendWriteFunctionWithContinuousSort.java:[158,50] cannot find symbol
  symbol:   method getExecutionConfig()

You can refer to RuntimeContextUtils to adapt object reuse check for different flink versions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Using RuntimeContextUtils.isObjectReuseEnabled() which delegates to getExecutionConfig().isObjectReuseEnabled() for Flink 1.17-1.20 and runtimeContext.isObjectReuseEnabled() directly for Flink 2.0+/2.1+.

…useEnabled

getExecutionConfig() was removed from RuntimeContext in Flink 2.0+. Add
isObjectReuseEnabled() to version-specific RuntimeContextUtils adapters:
- Flink 1.17-1.20: uses getExecutionConfig().isObjectReuseEnabled()
- Flink 2.0-2.1: uses getGlobalJobParameters() with PipelineOptions.OBJECT_REUSE

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
AUTO_WATERMARK_INTERVAL.defaultValue().toMillis() + ""));
}

public static boolean isObjectReuseEnabled(RuntimeContext runtimeContext) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use runtimeContext.isObjectReuseEnabled() directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Using runtimeContext.isObjectReuseEnabled() directly for Flink 2.0.x.

}

public static boolean isObjectReuseEnabled(RuntimeContext runtimeContext) {
Map<String, String> jobParameters = runtimeContext.getGlobalJobParameters();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use runtimeContext.isObjectReuseEnabled() directly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Using runtimeContext.isObjectReuseEnabled() directly for Flink 2.1.x.

@nsivabalan
Copy link
Copy Markdown
Contributor

hey @cshuo: can you re-review the patch.

@cshuo
Copy link
Copy Markdown
Collaborator

cshuo commented Apr 2, 2026

hey @cshuo: can you re-review the patch.

@nsivabalan Except for the last two comments, which are not resolved yet @prashantwason , the PR LGTM overall.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — This PR introduces a new continuous sorting buffer type for Flink append writes, which is a great feature. The implementation looks solid, with good test coverage for various scenarios, including object reuse and error handling. Overall, the code is clean, well-structured, and adheres to Hudi's coding standards. I have a few minor nits below.

* <p>Strategy:
* <ol>
* <li>Records are inserted in sorted order (TreeMap)</li>
* <li>When buffer reaches max capacity, oldest record(s) are drained synchronously</li>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: Could maxCapacity and drainSize be made final as they are configured once and not changed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already final. maxCapacity is private final long and drainSize is private final int, set once in the constructor.

private transient RecordComparator recordComparator;
private transient MemorySegment reusableKeySegment;
private transient int normalizedKeySize;
private transient boolean objectReuseEnabled;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: It might be clearer to initialize insertionSequence to 0L when sortedRecords is cleared in snapshotState() and endInput() to consistently reflect a fresh state, similar to how it's done in the constructor. Or, is it intended to continue the sequence?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already done. insertionSequence is reset to 0L alongside sortedRecords.clear() in both snapshotState() and endInput().

String.format("Buffer capacity must be positive, got: %d", maxCapacity));
}

if (drainSize <= 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: The check if (cmp != 0) is a common pattern, but it's redundant. recordComparator.compare and Long.compare also return 0 if equal, so you could directly return their results if cmp == 0 without an explicit check.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if (cmp != 0) guards are intentional here - this is a 3-level comparator chain (normalized key -> full record -> insertion order). The early returns avoid the more expensive recordComparator.compare() when the cheaper normalizedKeyComputer.compareKey() already distinguishes the records. Keeping the pattern explicit for readability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 97: 🤖 nit: The s -> !s.isEmpty() filter is good. Could we also add a .trim() to this stream pipeline before filtering, to handle cases like "key1, ,key2" that would result in an empty string after trimming?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already handled. The stream pipeline does .map(String::trim).filter(s -> !s.isEmpty()), so "key1, ,key2" splits to ["key1", " ", "key2"], trims to ["key1", "", "key2"], then filters out the empty string to get ["key1", "key2"].

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

LGTM! This is a solid addition for incremental sorting that helps mitigate the latency spikes of batch sorting. I've left a couple of architectural notes for you to consider around memory footprint tracking and minimizing short-lived allocations during draining.

drainRecords(drainSize);

// Verify there's space after draining
if (sortedRecords.size() >= maxCapacity) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Unlike BIMBufferSort, this sorting mode does not bound its memory footprint using write.task.max.size.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added TotalSizeTracer to bound the buffer memory footprint using write.task.max.size. Now processElement() triggers draining when either the record count limit (write.buffer.size) or memory size limit is exceeded. Record size is estimated once via ObjectSizeCalculator and tracked via TotalSizeTracer.trace()/countDown()/reset().

totalDrainOperations++;
totalDrainedRecords += drained;
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Allocating an Iterator for every drained record can create unnecessary garbage when drainSize is small.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the Iterator-based loop with TreeMap.pollFirstEntry() which removes and returns the first entry directly, avoiding Iterator object allocation entirely. This matters especially when drainSize=1 (the default), where each incoming record at capacity would have created a new Iterator.

prashantwason and others added 2 commits April 6, 2026 14:57
…rectly for Flink 2.0+/2.1+

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TotalSizeTracer to bound buffer memory footprint using
  write.task.max.size, triggering drain when memory limit is exceeded
  in addition to record count limit
- Replace Iterator-based drain loop with TreeMap.pollFirstEntry()
  to avoid iterator allocation per drain call (matters when drainSize=1)
- Estimate record size once via ObjectSizeCalculator and track
  cumulative memory via TotalSizeTracer.trace()/countDown()/reset()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback — both my prior findings are resolved nicely. The memory footprint is now bounded via TotalSizeTracer with write.task.max.size, and the drain loop uses pollFirstEntry() to avoid iterator allocation. The RuntimeContextUtils simplification for Flink 2.0.x/2.1.x to use runtimeContext.isObjectReuseEnabled() directly is clean. All other reviewer comments appear addressed as well. LGTM.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

CodeRabbit Walkthrough: This PR introduces continuous sorting functionality for Hudi Flink datasource. A new TreeMap-based buffer type (CONTINUOUS_SORT) enables in-memory sorted write operations with configurable incremental draining. The implementation includes new configuration options, a complete write function with lifecycle management, comprehensive integration tests, and utility enhancements for object reuse support across Flink versions.

Sequence Diagram (CodeRabbit):

sequenceDiagram
    participant Client as Client
    participant Processor as AppendWriteFunctionWithContinuousSort
    participant TreeMap as TreeMap Buffer
    participant Writer as Underlying Writer
    participant State as State Backend

    Client->>Processor: processElement(RowData)
    Note over Processor: Compute normalized key
    Processor->>TreeMap: insert(SortKey, record)
    Note over Processor: Check buffer capacity/memory
    alt Buffer exceeds limit
        Processor->>TreeMap: pollFirstEntry() x drain_size
        TreeMap-->>Processor: oldest records
        Processor->>Writer: flush records
    end
    
    Client->>Processor: snapshotState() / endInput()
    Processor->>TreeMap: pollFirstEntry() until empty
    TreeMap-->>Processor: all buffered records
    Processor->>Writer: flush remaining
    Processor->>State: save checkpoint state
    
    Client->>Processor: close()
    Processor->>Writer: delegate to super.close()
Loading

CodeRabbit: yihua#46 (review)

private RowType rowType;

@TempDir
protected File tempFile;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Shadowed @TempDir tempFile field.

This class declares @TempDir protected File tempFile (line 61-62), but TestWriteBase already declares @TempDir protected File tempFile (inherited). This shadowing may cause confusion and could lead to unexpected behavior if both are initialized differently by JUnit.

Consider removing the local declaration since the inherited field should suffice.

🔧 Proposed fix
 public class ITTestAppendWriteFunctionWithContinuousSort extends TestWriteBase {
   private Configuration conf;
   private RowType rowType;
-
-  `@TempDir`
-  protected File tempFile;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/append/ITTestAppendWriteFunctionWithContinuousSort.java`
around lines 61 - 62, The class ITTestAppendWriteFunctionWithContinuousSort
currently redeclares the JUnit TempDir field as "protected File tempFile",
shadowing the same "@TempDir protected File tempFile" declared in the superclass
TestWriteBase; remove the local declaration of tempFile from
ITTestAppendWriteFunctionWithContinuousSort so the test uses the inherited
TestWriteBase.tempFile, and update any local references in that class to rely on
the inherited field name (no other changes required).

CodeRabbit (original) (source:comment#3095501107)

@cshuo
Copy link
Copy Markdown
Collaborator

cshuo commented Apr 17, 2026

+1
cc @nsivabalan @danny0405

@nsivabalan nsivabalan merged commit 3a387da into apache:master Apr 20, 2026
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants