[ENG-41301] Fix V9 batcher dropping all batches when savepoint shares instant time with deltacommit#187
Open
tiennguyen-onehouse wants to merge 1 commit into
Open
Conversation
… instant time with deltacommit
In Hudi V2 / table-version-9 layout a savepoint pinned to an instant time T
produces five files at the same T:
<T>.deltacommit.inflight
<T>.deltacommit.requested
<T>.savepoint.inflight
<T>_<completionA>.deltacommit
<T>_<completionB>.savepoint
After lex-sort, <T>.savepoint.inflight lands at position 3 of the same-T
deltacommit's group, breaking the (inflight, requested, completed) triplet
check in createBatches: the third file is a savepoint inflight, so states
become {inflight, requested, inflight} instead of the expected superset of
{inflight, requested, completed}. With the default upload strategy this
sets shouldStopIteration=true at the very first group, returning zero
batches and silently halting upload of the entire active timeline. The
extractor logs only the INFO "Could not create batches with completed
commits" each cycle, so the table never advances.
Fix: pre-extract complete savepoint pairs (one inflight + one completed at
the same T) into their own 2-file batches before the triplet loop runs,
restoring the deltacommit/commit triplet to a clean three-file group at
that T. Partial savepoints (single file) stay in the main flow so the
existing single-savepoint handling can still process them. Savepoint
batches are appended after commit batches; savepoints carry no write
metadata so this ordering is benign.
Reproduction: any V9 MoR table that runs daily savepoints (e.g. the
Concentric perf-test workload). After this fix LakeView resumes uploading
post-savepoint instants — including any compaction the customer triggers
once log files accumulate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
tiennguyen-onehouse
left a comment
There was a problem hiding this comment.
/push-image hudi1.1-May5
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ActiveTimelineInstantBatcher.createBatcheswas returning zero batches and silently halting upload of the active timeline.Root cause
A savepoint pinned to instant time
Twrites<T>.savepoint.inflight+<T>_<completionTs>.savepoint. After lex-sort these interleave with the same-T deltacommit's three-file group, leaving<T>.savepoint.inflightat position 3. The triplet checkareRelatedInstants(inflight, requested, completed)fails — states become{inflight, requested, inflight}— and with the default upload strategy this tripsshouldStopIteration=trueat the very first group, producing zero batches. The bug is silent (logged only at INFO level:Could not create batches with completed commits…).Confirmed in staging by cloning Concentric's
lhperftest/.../permissions_bronze/.hoodie/(which has 5 daily savepoint groups + 1 compaction-with-savepoint) and observing the looping INFO log every poll cycle with no progress.Test plan
ActiveTimelineInstantBatcherTestcases still pass (V1 layout / no savepoints).permissions_bronze's timeline.permissions_bronzeand observeUploading batch …logs progressing past April 30.Related
🤖 Generated with Claude Code