Skip to content

Conversation

@sushantmane
Copy link
Contributor

[server] Add DoL loopback to ensure new leader is fully caught up on VT

Newly elected leaders were re-consuming the NR source topic because the
promotion logic relied only on elapsed time since the last consumed
message before switching to the remote version topic (VT). This
time-based heuristic is insufficient and can cause duplicate consumption
and data inconsistencies.

Fix the issue by requiring the new leader to produce a Declaration-of-
Leadership (DoL) marker to the local VT and wait until it consumes that
same marker back. This provides a deterministic guarantee that the
leader has fully caught up on VT before switching to RT or NR sources.

Code changes

  • Added new code behind a config. If so list the config names and their default values in the PR description.
  • Introduced new log lines.
    • Confirmed if logs need to be rate limited to avoid excessive logging.

Concurrency-Specific Checks

Both reviewer and PR author to verify

  • Code has no race conditions or thread safety issues.
  • Proper synchronization mechanisms (e.g., synchronized, RWLock) are used where needed.
  • No blocking calls inside critical sections that could lead to deadlocks or performance degradation.
  • Verified thread-safe collections are used (e.g., ConcurrentHashMap, CopyOnWriteArrayList).
  • Validated proper exception handling in multi-threaded code to avoid silent thread termination.

How was this PR tested?

  • New unit tests added.
  • New integration tests added.
  • Modified or extended existing tests.
  • Verified backward compatibility (if applicable).

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.
  • Yes. Clearly explain the behavior change and its impact.

Copilot AI review requested due to automatic review settings November 24, 2025 18:46
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Declaration-of-Leadership (DoL) loopback mechanism to ensure new leader replicas are fully caught up on the Version Topic (VT) before switching to consume from remote VT or Real-Time (RT) topics. This replaces the previous time-based heuristic with a deterministic approach that eliminates the risk of duplicate consumption and data inconsistencies during leader transitions.

Key Changes:

  • New DoL control message type with unique GUID that leaders produce to local VT during STANDBY→LEADER transition
  • Leader waits to consume its own DoL message back (loopback confirmation) before switching to remote sources
  • Configurable rollout via separate flags for system stores and user stores (SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM_FOR_SYSTEM_STORES and SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM_FOR_USER_STORES)

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
DolStamp.java New class tracking DoL state (produced/consumed flags, leadership term, host ID) during leader transition
DolGuidGenerator.java GUID generator for DoL control messages using UUID type 3
DoLStampGuidGenerator.java Duplicate GUID generator implementation (identical to DolGuidGenerator)
KafkaKey.java Adds DOL_STAMP constant for DoL control message key
VeniceWriter.java Implements sendDoLStamp() and getDoLStampKME() for producing DoL messages
StoreIngestionTask.java Adds checkAndHandleDoLMessage() to detect and handle consumed DoL messages, validates DoL messages like heartbeats
LeaderFollowerStoreIngestionTask.java Orchestrates DoL mechanism: initializes DoL state, sends DoL stamp, checks readiness in canSwitchToLeaderTopic(), falls back to legacy behavior when DoL disabled
PartitionConsumptionState.java Tracks DoL state and highest observed leadership term per partition
LeaderFollowerPartitionStateModel.java Uses Helix message creation timestamp as leadership term
SharedKafkaConsumer.java Adds region name and index to toString() for better debugging
ConfigKeys.java Defines two new config flags for DoL mechanism enablement
VeniceServerConfig.java Exposes DoL config flags via getters
VeniceServerWrapper.java Enables DoL mechanism for both system and user stores in integration tests
VeniceClusterWrapper.java Increases timeout for version wait from 60s to 120s to accommodate DoL latency
TestHybrid.java Adds timeout and unique store name for log compaction test
TestHybridMultiRegion.java Updates test to use sendEmptyPushAndWait() and improves error message assertion
TestTopicRequestOnHybridDelete.java Removes unused imports and deletes deleteStoreAfterStartedPushAllowsNewPush test
log4j2.properties Updates logging configuration (contains hardcoded user path)
StoreIngestionTaskTest.java Renames test method from resolveRtTopicPartitionWithPubSubBrokerAddress to resolveTopicPartitionWithPubSubBrokerAddress
SharedKafkaConsumerTest.java Updates test to pass region name and index to SharedKafkaConsumer constructor
ActiveActiveStoreIngestionTask.java Updates method calls from resolveRtTopicPartitionWithPubSubBrokerAddress to resolveTopicPartitionWithPubSubBrokerAddress
KafkaConsumerService.java Passes region name and index when creating SharedKafkaConsumer instances
PartitionWiseKafkaConsumerService.java Adds consumer instance to log output for better debugging
HelixReadWriteSchemaRepository.java Adds store name to exception log message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Add SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM config to enable the new
Declaration of Leadership (DoL) mechanism for fast leader handover.

Changes:
- Add SERVER_LEADER_HANDOVER_USE_DOL_MECHANISM config key in ConfigKeys.java
- Add leaderHandoverUseDoLMechanism field and getter in VeniceServerConfig
- Refactor canSwitchToLeaderTopic() to check config and route logic
- Extract canSwitchToLeaderTopicLegacy() with original time-based logic
- Add comprehensive design document for DoL mechanism

Default: false (maintains backward compatibility with legacy time-based mechanism)

This is step 1 of the DoL implementation. The actual DoL loopback logic
will be implemented in subsequent commits when the config is enabled.

Add a separate config

Create DoL message

Add leadership term

Add leadership term in LeaderSessionIdChecker
Remove docs/dev_guide/declaration_of_leadership_design.md
Works well

Full working copy

Close parition in VW before sending DOL

add latency

Delete bogus test

Fix flaky test

Improve logs

Fix flaky test
@sushantmane sushantmane force-pushed the ST-AckBack-DuringLeaderPromotion branch from 0998fab to 1bbff55 Compare November 24, 2025 21:22
Copy link
Contributor

@sixpluszero sixpluszero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your change! A great work to improve ingestion stability. I left some comments for clarification.

private void initializeAndSendDoLStamp(PartitionConsumptionState partitionConsumptionState, long leadershipTerm) {
if (!shouldUseDolMechanism()) {
LOGGER.debug(
"Skipping DoL stamp initialization for replica: {} as DoL mechanism is disabled",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is valuable to have this log as this will be ramped with config. Also the rate of this message I expect it to be low.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change log level to info, we will keep printing the log until the feature is turned on. This is not super helpful IMHO. When feature is enabled, we will see "Initialized DoL state:" which is better signal

leadershipTerm,
exception);
// Clear DoL state on failure
pcs.clearDolState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in this case, shouldn't we mark something and retry again in the promote check logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All venice writers (producers) are configured to have infinite retries. Due to this we don't need additional retries

leadershipTerm,
dolStamp);
} else {
LOGGER.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how could this happen? Unless you have S->L->S->L in a short time?

private boolean canSwitchToLeaderTopic(PartitionConsumptionState pcs) {
// Check if DoL mechanism is enabled via config (system stores vs user stores)
DolStamp dolStamp = pcs.getDolState();
if (shouldUseDolMechanism() && dolStamp != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you produce failed, then this will be skipped directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. When produce fails, we will clear dol record in pcs and then it will fallback to legacy mode


// Ignore DoL from different host
if (!expectedHostId.equals(consumedHostId)) {
LOGGER.debug(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here for the DEBUG level. IMO these can all be INFO for easier debug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will end up polluting logs since it should be fairly common occurrence to see old DoL messages from the topic

}

// Handle DoL from future term - indicates race or concurrent leadership change
if (consumedTermId > expectedTermId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this happens will we ever flip to leader?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes (until we have termId based e2e impl we'll not take any action)

@github-actions
Copy link

Hi there. This pull request has been inactive for 30 days. To keep our review queue healthy, we plan to close it in 7 days unless there is new activity. If you are still working on this, please push a commit, leave a comment, or convert it to draft to signal intent. Thank you for your time and contributions.

@github-actions github-actions bot added the stale label Jan 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants