-
Notifications
You must be signed in to change notification settings - Fork 114
[server] Add DoL loopback to ensure new leader is fully caught up on VT #2314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[server] Add DoL loopback to ensure new leader is fully caught up on VT #2314
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a Declaration-of-Leadership (DoL) loopback mechanism to ensure new leader replicas are fully caught up on the Version Topic (VT) before switching to consume from remote VT or Real-Time (RT) topics. This replaces the previous time-based heuristic with a deterministic approach that eliminates the risk of duplicate consumption and data inconsistencies during leader transitions.
Key Changes:
- New DoL control message type with unique GUID that leaders produce to local VT during STANDBY→LEADER transition
- Leader waits to consume its own DoL message back (loopback confirmation) before switching to remote sources
- Configurable rollout via separate flags for system stores and user stores (
SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM_FOR_SYSTEM_STORESandSERVER_LEADER_HANDOVER_USE_DOL_MECHANISM_FOR_USER_STORES)
Reviewed changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
DolStamp.java |
New class tracking DoL state (produced/consumed flags, leadership term, host ID) during leader transition |
DolGuidGenerator.java |
GUID generator for DoL control messages using UUID type 3 |
DoLStampGuidGenerator.java |
Duplicate GUID generator implementation (identical to DolGuidGenerator) |
KafkaKey.java |
Adds DOL_STAMP constant for DoL control message key |
VeniceWriter.java |
Implements sendDoLStamp() and getDoLStampKME() for producing DoL messages |
StoreIngestionTask.java |
Adds checkAndHandleDoLMessage() to detect and handle consumed DoL messages, validates DoL messages like heartbeats |
LeaderFollowerStoreIngestionTask.java |
Orchestrates DoL mechanism: initializes DoL state, sends DoL stamp, checks readiness in canSwitchToLeaderTopic(), falls back to legacy behavior when DoL disabled |
PartitionConsumptionState.java |
Tracks DoL state and highest observed leadership term per partition |
LeaderFollowerPartitionStateModel.java |
Uses Helix message creation timestamp as leadership term |
SharedKafkaConsumer.java |
Adds region name and index to toString() for better debugging |
ConfigKeys.java |
Defines two new config flags for DoL mechanism enablement |
VeniceServerConfig.java |
Exposes DoL config flags via getters |
VeniceServerWrapper.java |
Enables DoL mechanism for both system and user stores in integration tests |
VeniceClusterWrapper.java |
Increases timeout for version wait from 60s to 120s to accommodate DoL latency |
TestHybrid.java |
Adds timeout and unique store name for log compaction test |
TestHybridMultiRegion.java |
Updates test to use sendEmptyPushAndWait() and improves error message assertion |
TestTopicRequestOnHybridDelete.java |
Removes unused imports and deletes deleteStoreAfterStartedPushAllowsNewPush test |
log4j2.properties |
Updates logging configuration (contains hardcoded user path) |
StoreIngestionTaskTest.java |
Renames test method from resolveRtTopicPartitionWithPubSubBrokerAddress to resolveTopicPartitionWithPubSubBrokerAddress |
SharedKafkaConsumerTest.java |
Updates test to pass region name and index to SharedKafkaConsumer constructor |
ActiveActiveStoreIngestionTask.java |
Updates method calls from resolveRtTopicPartitionWithPubSubBrokerAddress to resolveTopicPartitionWithPubSubBrokerAddress |
KafkaConsumerService.java |
Passes region name and index when creating SharedKafkaConsumer instances |
PartitionWiseKafkaConsumerService.java |
Adds consumer instance to log output for better debugging |
HelixReadWriteSchemaRepository.java |
Adds store name to exception log message |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...inci-client/src/main/java/com/linkedin/davinci/kafka/consumer/PartitionConsumptionState.java
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/DolStamp.java
Show resolved
Hide resolved
...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/writer/VeniceWriter.java
Show resolved
Hide resolved
internal/venice-test-common/src/integrationTest/resources/log4j2.properties
Show resolved
Hide resolved
...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java
Show resolved
Hide resolved
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/DolStamp.java
Show resolved
Hide resolved
...ient/src/main/java/com/linkedin/davinci/kafka/consumer/LeaderFollowerStoreIngestionTask.java
Show resolved
Hide resolved
internal/venice-common/src/main/java/com/linkedin/venice/guid/DolGuidGenerator.java
Show resolved
Hide resolved
Add SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM config to enable the new Declaration of Leadership (DoL) mechanism for fast leader handover. Changes: - Add SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM config key in ConfigKeys.java - Add leaderHandoverUseDoLMechanism field and getter in VeniceServerConfig - Refactor canSwitchToLeaderTopic() to check config and route logic - Extract canSwitchToLeaderTopicLegacy() with original time-based logic - Add comprehensive design document for DoL mechanism Default: false (maintains backward compatibility with legacy time-based mechanism) This is step 1 of the DoL implementation. The actual DoL loopback logic will be implemented in subsequent commits when the config is enabled. Add a separate config Create DoL message Add leadership term Add leadership term in LeaderSessionIdChecker
0998fab to
1bbff55
Compare
sixpluszero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your change! A great work to improve ingestion stability. I left some comments for clarification.
| private void initializeAndSendDoLStamp(PartitionConsumptionState partitionConsumptionState, long leadershipTerm) { | ||
| if (!shouldUseDolMechanism()) { | ||
| LOGGER.debug( | ||
| "Skipping DoL stamp initialization for replica: {} as DoL mechanism is disabled", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is valuable to have this log as this will be ramped with config. Also the rate of this message I expect it to be low.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we change log level to info, we will keep printing the log until the feature is turned on. This is not super helpful IMHO. When feature is enabled, we will see "Initialized DoL state:" which is better signal
| leadershipTerm, | ||
| exception); | ||
| // Clear DoL state on failure | ||
| pcs.clearDolState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in this case, shouldn't we mark something and retry again in the promote check logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All venice writers (producers) are configured to have infinite retries. Due to this we don't need additional retries
| leadershipTerm, | ||
| dolStamp); | ||
| } else { | ||
| LOGGER.warn( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could this happen? Unless you have S->L->S->L in a short time?
| private boolean canSwitchToLeaderTopic(PartitionConsumptionState pcs) { | ||
| // Check if DoL mechanism is enabled via config (system stores vs user stores) | ||
| DolStamp dolStamp = pcs.getDolState(); | ||
| if (shouldUseDolMechanism() && dolStamp != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you produce failed, then this will be skipped directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. When produce fails, we will clear dol record in pcs and then it will fallback to legacy mode
|
|
||
| // Ignore DoL from different host | ||
| if (!expectedHostId.equals(consumedHostId)) { | ||
| LOGGER.debug( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for the DEBUG level. IMO these can all be INFO for easier debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These will end up polluting logs since it should be fairly common occurrence to see old DoL messages from the topic
| } | ||
|
|
||
| // Handle DoL from future term - indicates race or concurrent leadership change | ||
| if (consumedTermId > expectedTermId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this happens will we ever flip to leader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately yes (until we have termId based e2e impl we'll not take any action)
|
Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions. |
[server] Add DoL loopback to ensure new leader is fully caught up on VT
Newly elected leaders were re-consuming the NR source topic because the
promotion logic relied only on elapsed time since the last consumed
message before switching to the remote version topic (VT). This
time-based heuristic is insufficient and can cause duplicate consumption
and data inconsistencies.
Fix the issue by requiring the new leader to produce a Declaration-of-
Leadership (DoL) marker to the local VT and wait until it consumes that
same marker back. This provides a deterministic guarantee that the
leader has fully caught up on VT before switching to RT or NR sources.
Code changes
Concurrency-Specific Checks
Both reviewer and PR author to verify
synchronized,RWLock) are used where needed.ConcurrentHashMap,CopyOnWriteArrayList).How was this PR tested?
Does this PR introduce any user-facing or breaking changes?