Skip to content

fix: Follow-ups to JsonKinesisSource: numeric sequence comparison and call-site fixes#18689

Merged
yihua merged 2 commits into
apache:masterfrom
linliu-code:kinesis-fixes-from-internal
May 15, 2026
Merged

fix: Follow-ups to JsonKinesisSource: numeric sequence comparison and call-site fixes#18689
yihua merged 2 commits into
apache:masterfrom
linliu-code:kinesis-fixes-from-internal

Conversation

@linliu-code
Copy link
Copy Markdown
Collaborator

@linliu-code linliu-code commented May 4, 2026

Describe the issue this Pull Request addresses

Follow-up to #18224 (JsonKinesisSource). Small correctness fixes and a naming consistency tweak that surfaced after the initial merge.

Summary and Changelog

  1. Numeric sequence-number comparison. Kinesis sequence numbers are 128-bit integers represented as decimal strings whose lengths can vary, so String.compareTo is not correct for ordering them. Introduce KinesisOffsetGen.CheckpointUtils.compareSequenceNumbers(a, b) using BigInteger, and use it in:

    • KinesisShardRange.hasUnreadRecords (closed-shard fully-consumed check)
    • KinesisOffsetGen.checkDataLossOnExpiredShards (also adds an empty-lastSeq guard for |endSeq checkpoints — a closed shard that has not yet had any record consumed is treated as not consumed)
    • KinesisOffsetGen.checkDataLossOnAvailableShards (trim-horizon check)
  2. Null-safe millisBehindLatest. GetRecordsResponse#millisBehindLatest() is documented as nullable in the AWS SDK; the previous response.millisBehindLatest() == 0 would NPE on auto-unbox if a server returned null. Extract to a Long and null-check before comparing.

  3. Integer-overflow guard on per-call limit. (int) (maxTotalRecords - totalConsumed) could silently truncate if the remaining budget exceeded Integer.MAX_VALUE. Switch to Math.toIntExact(...) so it fails loudly instead.

  4. KinesisReadConfig getter naming consistency.

    • Rename the field enableDeaggregationdeaggregationEnabled so Lombok's @Getter produces the desired isDeaggregationEnabled() (adjective form), letting us drop the manual override.
    • Keep an explicit shouldAddMetaFields() method (with @Getter(AccessLevel.NONE) on the field) since Lombok would otherwise generate isShouldAddMetaFields(), which doesn't read well.
    • Update the two call sites in JsonKinesisSource (isEnableDeaggregation()isDeaggregationEnabled(), isShouldAddMetaFields()shouldAddMetaFields()).

Impact

No user-visible behavior change in the common case. The numeric-comparison fix only matters when comparing two Kinesis sequence numbers of different decimal lengths; Math.toIntExact and the null guard are defensive and only change behavior on edge-case inputs that previously truncated or NPE'd. The KinesisReadConfig change is internal — the public getters remain isDeaggregationEnabled() / shouldAddMetaFields().

Risk Level

Low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable — existing Kinesis tests (TestKinesisCheckpointUtils, TestJsonKinesisSource, TestShardRecordIterator, TestKinesisShardRange, TestKinesisSourceFiltering, TestKinesisDataLossChecks) all pass: 80 tests, 0 failures, 0 errors, 2 pre-existing skips.

@github-actions github-actions Bot added the size:S PR with lines of changes in (10, 100] label May 4, 2026
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the follow-up! These are targeted correctness fixes on the JsonKinesisSource path: numeric sequence-number comparison via BigInteger, null-safe handling of millisBehindLatest, a Math.toIntExact guard for the per-call limit, and explicit getter names on KinesisReadConfig. I traced each call site of compareSequenceNumbers and confirmed the empty/null lastSeq cases are guarded everywhere it's invoked. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One small naming inconsistency between the backing field and its newly-added getter in KinesisReadConfig; the rest of the changes are clean.

cc @yihua

}

public boolean isDeaggregationEnabled() {
return enableDeaggregation;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: the getter is isDeaggregationEnabled() (adjective form) but the backing field is enableDeaggregation (verb form) — could you rename the field to deaggregationEnabled to keep them consistent?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

@linliu-code linliu-code force-pushed the kinesis-fixes-from-internal branch from a90b3db to 45e1219 Compare May 5, 2026 13:36
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The follow-up correctness fixes (BigInteger sequence-number ordering, null-safe millisBehindLatest, and explicit accessor renames) look good and the empty-lastSeq guard in checkDataLossOnExpiredShards is a sensible defensive addition. One question worth checking on the Math.toIntExact change in the inline comment. Please take a look at any inline comments, and this should be ready for a Hudi committer or PMC member to take it from here. One small redundancy to consider in KinesisReadConfig — otherwise the changes look clean.

- KinesisOffsetGen: add CheckpointUtils.compareSequenceNumbers using
  BigInteger and use it in hasUnreadRecords, checkDataLossOnExpiredShards,
  and checkDataLossOnAvailableShards. Kinesis sequence numbers are 128-bit
  decimal strings whose lengths can vary, so lexicographic String.compareTo
  is not correct for numeric ordering. Also treat empty lastSeq (e.g. a
  "|endSeq" checkpoint) as not-consumed in checkDataLossOnExpiredShards.
- KinesisSource: null-guard response.millisBehindLatest() before unboxing
  (the AWS SDK documents the field as nullable), and use Math.toIntExact
  on the per-call limit so we fail loudly instead of silently truncating
  if the remaining record budget exceeds Integer.MAX_VALUE.
- KinesisReadConfig: replace Lombok-generated isShouldAddMetaFields and
  isEnableDeaggregation with explicit shouldAddMetaFields and
  isDeaggregationEnabled, and update call sites in JsonKinesisSource.
@linliu-code linliu-code force-pushed the kinesis-fixes-from-internal branch from 45e1219 to 33b430d Compare May 5, 2026 13:49
Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the follow-up! Numeric BigInteger comparison for sequence numbers, the null-safe millisBehindLatest check, and the KinesisReadConfig getter naming cleanup all look reasonable, and I traced the new compareSequenceNumbers call sites — each has prior guards against null/empty inputs. No new issues flagged from this automated pass beyond what earlier rounds already raised — a Hudi committer or PMC member can take it from here for a final review. One small naming/consistency suggestion in KinesisReadConfig; the rest of the changes are clean.

cc @yihua

private final String accessKey; // null if not set
private final String secretKey; // null if not set
private final KinesisSourceConfig.KinesisStartingPositionStrategy startingPosition;
@Getter(AccessLevel.NONE)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 nit: could you rename the field to metaFieldsEnabled (mirroring deaggregationEnabled) and let Lombok generate isMetaFieldsEnabled() automatically? That would let you drop the @Getter(AccessLevel.NONE) suppression and the hand-written getter below, keeping the boolean accessor style consistent across the class.

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yihua yihua added this to the release-1.2.0 milestone May 15, 2026
@yihua yihua changed the title [MINOR] Follow-ups to JsonKinesisSource: numeric sequence comparison and call-site fixes fix: Follow-ups to JsonKinesisSource: numeric sequence comparison and call-site fixes May 15, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 6e32f36 into apache:master May 15, 2026
19 of 62 checks passed
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 54.00%. Comparing base (4d0e9cd) to head (3a1e9be).
⚠️ Report is 36 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (4d0e9cd) and HEAD (3a1e9be). Click for more details.

HEAD has 32 uploads less than BASE
Flag BASE (4d0e9cd) HEAD (3a1e9be)
spark-scala-tests 12 0
spark-java-tests 18 0
common-and-other-modules 1 0
utilities 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18689       +/-   ##
=============================================
- Coverage     68.08%   54.00%   -14.08%     
+ Complexity    28940    12456    -16484     
=============================================
  Files          2519     1434     -1085     
  Lines        140646    72161    -68485     
  Branches      17427     8245     -9182     
=============================================
- Hits          95757    38973    -56784     
+ Misses        37030    29690     -7340     
+ Partials       7859     3498     -4361     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 44.99% <ø> (+0.03%) ⬆️
spark-client-hadoop-common 48.32% <ø> (-0.12%) ⬇️
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 1872 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants