Skip to content

[ENG-41301] Fix V9 batcher dropping all batches when savepoint shares instant time with deltacommit#187

Open
tiennguyen-onehouse wants to merge 1 commit into
mainfrom
fix-batcher-savepoint-interleave
Open

[ENG-41301] Fix V9 batcher dropping all batches when savepoint shares instant time with deltacommit#187
tiennguyen-onehouse wants to merge 1 commit into
mainfrom
fix-batcher-savepoint-interleave

Conversation

@tiennguyen-onehouse
Copy link
Copy Markdown
Contributor

Summary

  • For V9/V2 tables with daily-savepoint workloads (e.g. Concentric perf), ActiveTimelineInstantBatcher.createBatches was returning zero batches and silently halting upload of the active timeline.
  • Pre-extracting complete savepoint pairs into their own 2-file batches lets the deltacommit/commit triplets at the same instant time group cleanly.

Root cause

A savepoint pinned to instant time T writes <T>.savepoint.inflight + <T>_<completionTs>.savepoint. After lex-sort these interleave with the same-T deltacommit's three-file group, leaving <T>.savepoint.inflight at position 3. The triplet check areRelatedInstants(inflight, requested, completed) fails — states become {inflight, requested, inflight} — and with the default upload strategy this trips shouldStopIteration=true at the very first group, producing zero batches. The bug is silent (logged only at INFO level: Could not create batches with completed commits…).

Confirmed in staging by cloning Concentric's lhperftest/.../permissions_bronze/.hoodie/ (which has 5 daily savepoint groups + 1 compaction-with-savepoint) and observing the looping INFO log every poll cycle with no progress.

Test plan

  • Existing ActiveTimelineInstantBatcherTest cases still pass (V1 layout / no savepoints).
  • Existing V9 tests still pass.
  • Add a new V9-with-savepoint test case modelled on permissions_bronze's timeline.
  • Re-run the staging clone of permissions_bronze and observe Uploading batch … logs progressing past April 30.

Related

  • ENG-41301 — Concentric Lakeview v9 stats missing / 500
  • gateway-controller PR #9103 (read-side fixes for the lifecycle sentinel rows that surface from the same incident)

🤖 Generated with Claude Code

… instant time with deltacommit

In Hudi V2 / table-version-9 layout a savepoint pinned to an instant time T
produces five files at the same T:

  <T>.deltacommit.inflight
  <T>.deltacommit.requested
  <T>.savepoint.inflight
  <T>_<completionA>.deltacommit
  <T>_<completionB>.savepoint

After lex-sort, <T>.savepoint.inflight lands at position 3 of the same-T
deltacommit's group, breaking the (inflight, requested, completed) triplet
check in createBatches: the third file is a savepoint inflight, so states
become {inflight, requested, inflight} instead of the expected superset of
{inflight, requested, completed}. With the default upload strategy this
sets shouldStopIteration=true at the very first group, returning zero
batches and silently halting upload of the entire active timeline. The
extractor logs only the INFO "Could not create batches with completed
commits" each cycle, so the table never advances.

Fix: pre-extract complete savepoint pairs (one inflight + one completed at
the same T) into their own 2-file batches before the triplet loop runs,
restoring the deltacommit/commit triplet to a clean three-file group at
that T. Partial savepoints (single file) stay in the main flow so the
existing single-savepoint handling can still process them. Savepoint
batches are appended after commit batches; savepoints carry no write
metadata so this ordering is benign.

Reproduction: any V9 MoR table that runs daily savepoints (e.g. the
Concentric perf-test workload). After this fix LakeView resumes uploading
post-savepoint instants — including any compaction the customer triggers
once log files accumulate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@nimahajan
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor Author

@tiennguyen-onehouse tiennguyen-onehouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/push-image hudi1.1-May5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants