Skip to content

[FLINK-39728] Fix pauseOrResumeSplits race with unassigned partitions#257

Open
jnh5y wants to merge 1 commit into
apache:mainfrom
jnh5y:FLINK-39728-pause-resume-race
Open

[FLINK-39728] Fix pauseOrResumeSplits race with unassigned partitions#257
jnh5y wants to merge 1 commit into
apache:mainfrom
jnh5y:FLINK-39728-pause-resume-race

Conversation

@jnh5y
Copy link
Copy Markdown

@jnh5y jnh5y commented May 21, 2026

Filter against consumer.assignment() before calling pause()/resume() to prevent IllegalStateException when a partition is concurrently unassigned by fetch() or removeEmptySplits().

Generated-by: Claude Opus 4.6 (Anthropic)

Filter against consumer.assignment() before calling pause()/resume()
to prevent IllegalStateException when a partition is concurrently
unassigned by fetch() or removeEmptySplits().

Generated-by: Claude Opus 4.6 (Anthropic)
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 21, 2026

Thanks for opening this pull request! Please check out our contributing guidelines. (https://flink.apache.org/contributing/how-to-contribute.html)

Copy link
Copy Markdown
Contributor

@Savonitar Savonitar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the fix. The filter against consumer.assignment() before each pause/resume call is the right defensive guard.

I find “concurrently” in the comments/description a bit misleading: the unassign and the pause/resume call are sequential on the SplitFetcher thread, not parallel. The race is between the mailbox thread’s view at enqueue time and the SplitFetcher’s state at execution time.

But it does not affect the proposed fix.

LGTM

Copy link
Copy Markdown
Contributor

@Efrat19 Efrat19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed this IllegalStateException can occur when the split finished event doesn't propagate in time to abort alignment check on the sourceOperator, causing pauseOrResumeSplits to be called on a finished split
LGTM % nit

Collection<KafkaPartitionSplit> splitsToResume) {
// Filter against current assignment to avoid IllegalStateException when a partition
// was concurrently unassigned by fetch() or removeEmptySplits().
Set<TopicPartition> assigned = consumer.assignment();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Wdyt about leaving a warn here to unmask any edge case where pauseOrResumeSplits is called for unassigned partition from an unfinished split?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants