Skip to content

Conversation

@lluwm
Copy link
Contributor

@lluwm lluwm commented Jan 21, 2026

Problem Statement

Fixed a data integrity issue where DIV (Data Integrity Validator) would report "data missing" errors during leader transitions.

This occurred when:

  • A follower became leader before consuming all local VT (Version Topic) messages
  • The follower fell back to follower role when the original leader returned
  • The original leader would skip messages produced by the temporary leader due to duplicate div sequence number

Solution

The fix adds an extra constraint for regular stores, that is before EOP is received, a regular store follower has to wait for local partition being fully consumed before allowing to switching to leader.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

…grity issue in leader transitions

Fixed a data integrity issue where DIV (Data Integrity Validator) would report "data missing" errors during leader transitions.

This occurred when:
- A follower became leader before consuming all local VT (Version Topic) messages
- The follower fell back to follower role when the original leader returned
- The original leader would skip messages produced by the temporary leader due to duplicate div sequence number

The fix adds an extra contraint for regular stores, that is before EOP is received, a regular store follower
has to wait for local partition being fully consumed before allowing to switching to leader.
Copilot AI review requested due to automatic review settings January 21, 2026 17:30
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a data integrity issue where DIV (Data Integrity Validator) reported "data missing" errors during leader transitions. The issue occurred when a follower became leader before consuming all local VT (Version Topic) messages, then fell back to follower when the original leader returned, causing the original leader to skip messages produced by the temporary leader.

Changes:

  • Modified canSwitchToLeaderTopic logic to require regular stores to either have received EOP or fully consumed local VT partition before switching to leader
  • Added comprehensive unit tests to verify the new boolean logic
  • Reduced test message count from 1000 to 500 in testDataValidationCheckPointing for performance optimization

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java Simplified canSwitchToLeaderTopic logic to add constraint for regular stores before EOP; changed visibility to package-private for testing
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTaskTest.java Added comprehensive unit test testCanSwitchToLeaderTopicLogic to verify all scenarios of the new boolean logic
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTaskTest.java Reduced message count from 1000 to 500 to improve test performance

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant