Skip to content

Conversation

@manujose0
Copy link
Contributor

What changes and why

User-initiated batch pushes were incorrectly blocked when a compliance repush (TTL repush) was ongoing, even though user pushes should be able to kill repushes and proceed. This issue was reported in DEPEND-92712.

Root Cause:
The bug was in VeniceParentHelixAdmin.incrementVersionIdempotent() at lines 1813-1815. The code only checked Version.isPushIdRePush() which looks for the "venice_re_push_" prefix. However, TTL repushes (compliance repushes) use the "venice_ttl_re_push_" prefix, so they were not being detected as repushes. This caused the logic to fall through to throwing a ConcurrentBatchPushException, blocking user-initiated pushes.

How the fix achieves the goal

Updated the repush detection logic to check for both regular repushes and TTL repushes:

// Before:
boolean isExistingPushJobARepush = Version.isPushIdRePush(existingPushJobId);
boolean isIncomingPushJobARepush = Version.isPushIdRePush(pushJobId);

// After (lines 1813-1815):
boolean isExistingPushJobARepush =
    Version.isPushIdRePush(existingPushJobId) || Version.isPushIdTTLRePush(existingPushJobId);
boolean isIncomingPushJobARepush = 
    Version.isPushIdRePush(pushJobId) || Version.isPushIdTTLRePush(pushJobId);

When isExistingPushJobARepush is true and the incoming push is a user push (not a repush), the code at line 1841 correctly kills the existing repush and allows the user push to proceed.

Testing done

Unit Tests Added:

  • testTTLRepushDetection() - Verifies that TTL repush IDs are correctly identified as repushes using the new logic
  • testRepushTypeIdentification() - Validates all repush type detection including the bug scenario

Test Results:

com.linkedin.venice.controller.TestVeniceParentHelixAdmin > testRepushTypeIdentification PASSED
com.linkedin.venice.controller.TestVeniceParentHelixAdmin > testTTLRepushDetection PASSED
BUILD SUCCESSFUL

Pre-commit Validation:

  • All pre-commit hooks passed (spotless formatting applied)
  • Code compiles successfully with Java 17

Files Changed

  • services/venice-controller/src/main/java/com/linkedin/venice/controller/VeniceParentHelixAdmin.java - Bug fix
  • services/venice-controller/src/test/java/com/linkedin/venice/controller/TestVeniceParentHelixAdmin.java - Unit tests

Checklist

  • Tests added demonstrating the bug fix
  • All tests pass
  • Code follows Venice style guide (spotless applied)
  • Relates to DEPEND-92712

User-initiated batch pushes were incorrectly blocked when a compliance
repush (TTL repush) was ongoing. The issue was that TTL repushes use the
"venice_ttl_re_push_" prefix, but the code only checked for the
"venice_re_push_" prefix when determining if an existing push is a repush.

Updated VeniceParentHelixAdmin.incrementVersionIdempotent() to check both
isPushIdRePush() and isPushIdTTLRePush() when identifying repushes,
allowing user pushes to correctly kill TTL repushes and proceed.

Added unit tests to verify the fix.
Copilot AI review requested due to automatic review settings January 22, 2026 01:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where user-initiated batch pushes were incorrectly blocked by ongoing TTL (compliance) repushes. The root cause was that the repush detection logic only checked for regular repushes with the "venice_re_push_" prefix, missing TTL repushes that use the "venice_ttl_re_push_" prefix.

Changes:

  • Updated repush detection logic in incrementVersionIdempotent() to check for both regular and TTL repushes
  • Added comprehensive unit tests to verify TTL repush detection and all repush type identification
  • Added clarifying comments explaining that user-initiated pushes should be able to kill compliance repushes

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
VeniceParentHelixAdmin.java Fixed repush detection logic on lines 1813-1815 to include TTL repushes; added explanatory comments on lines 1846-1847
TestVeniceParentHelixAdmin.java Added two comprehensive test methods (testTTLRepushDetection and testRepushTypeIdentification) to verify the fix

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@manujose0
Copy link
Contributor Author

Thank you for the feedback - you were right to question this! After reviewing the actual DEPEND-92712 incident details, I now understand the issue:

Actual Incident Analysis

The blocking push job ID was:

1766008760794_https://observe.prod.linkedin.com/g/d/...

This is a regular push with no prefix (not venice_re_push_ or venice_ttl_re_push_). It was a compliance/privacy deletion push that ran for ~5 hours.

The Real Problem

The current logic at line 1841 only kills existing pushes if they are repushes:

} else if (isExistingPushJobARepush && !pushType.isIncremental() && !isIncomingPushJobARepush) {
    killOfflinePush(clusterName, currentPushTopic.get(), true);
}

Since the compliance push had no repush prefix, it wasn't killed, and the user push was blocked.

My Fix is Incorrect

My current fix adds TTL repushes to repush detection, but this doesn't solve the actual problem since the blocking push was a regular push, not a repush.

What Should the Fix Be?

The requirement from DEPEND-92712 is: "user-initiated pushes should be able to kill compliance/automated pushes"

However, there's no way to distinguish user-initiated vs compliance pushes just from the push job ID.

Questions:

  1. Is there metadata or a flag that identifies whether a push is user-initiated vs compliance/automated?
  2. Should we change the logic to allow user pushes to kill ANY ongoing push (not just repushes)?
  3. Or should compliance pushes use a specific prefix or push type?

I'll hold off on updating this PR until we clarify the correct approach. Sorry for the confusion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant