Skip to content

feat(controller): Add leader election for high availability#851

Merged
shivamerla merged 1 commit into
kubernetes-sigs:mainfrom
herb-duan:feat/controller-leader-election
Mar 29, 2026
Merged

feat(controller): Add leader election for high availability#851
shivamerla merged 1 commit into
kubernetes-sigs:mainfrom
herb-duan:feat/controller-leader-election

Conversation

@herb-duan
Copy link
Copy Markdown
Contributor

@herb-duan herb-duan commented Feb 2, 2026

Fixes #815

The compute-domain-controller currently operates as a singleton, which presents a single point of failure (SPOF) in a production environment. To enhance the reliability and availability of the controller, this change introduces a leader election mechanism to support a multi-replica, high-availability (HA) deployment model.
When HA mode is enabled, only one of the controller replicas becomes the leader and executes the core business logic as well as all change operations (e.g., configuration updates, resource modifications). All other replicas remain in a hot-standby state and will not perform any business or change-related work. This leader election dependency is critical for change operations in particular — concurrent execution of change logic by multiple controller replicas must be strictly prohibited to avoid data inconsistency, conflicting operations, or unintended side effects.
If the current leader fails, one of the standby replicas will automatically take over as the new leader, ensuring uninterrupted service continuity and consistent execution of both core business and change operations.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 2, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@herb-duan herb-duan force-pushed the feat/controller-leader-election branch 3 times, most recently from f7ac650 to 0df5f10 Compare February 5, 2026 17:04
Comment thread cmd/compute-domain-controller/main.go Outdated
},
OnNewLeader: func(identity string) {
if identity == lockID {
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Can you add a code comment here, explaining that scenario? Does that mean the OnNewLeader() callback can get invoked for us with us being the leader before and after the callback gets invoked?

  • Can you emit a log message here on level 6? Something like klog.V(6).Infof("OnNewLeader() callback, new identity is still my lock ID")

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch.

Source code comments for OnNewLeader:

OnNewLeader is called when the client observes a leader that is not the previously observed leader. This includes the first observed leader when the client starts.

I will add the requested klog.V(6) log to track this. The if identity == lockID { return } check is a standard practice in K8s controllers to avoid redundant processing when the 'new' leader is actually ourselves.

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done
I've just pushed an update that adds the requested V(6) logs and includes a detailed architecture note in the code to explain the lifecycle and shutdown strategy. The implementation has been verified with the network partition test as discussed.

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added comment and klog.V(6) log as suggested.

Comment thread cmd/compute-domain-controller/main.go Outdated
}
},
OnStoppedLeading: func() {
klog.Warningf("Lost leader election (id: %s), waiting to re-compete", lockID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen when we previously were the leader, but now we're transitioning into not being the leader anymore, correct? At least, a typical failover scenario has to be for the old leader to not be the leader anymore. Are there any more actions that we need to perform here in that case? Do we need to perform a controller shutdown? If not: why?

If we later become the leader again, it looks like we would call controller.Run() again, potentially for the Nth time during our lifetime. That of course needs to be safe. Is it safe as of now?

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a critical point. In new implementation, we rely on Context-based cancellation.

  1. Shutdown: When we lose leadership, the leaderelection library cancels the leaderCtx passed to controller.Run(leaderCtx). Our controller is designed to watch this context and shut down all workers gracefully when it's canceled.
  2. Process Exit: By calling cancelElector() in the defer block inside OnStartedLeading, we ensure that as soon as the controller stops, the entire elector.Run loop terminates. This allows the pod to exit and be restarted by Kubernetes (the 'crash-everything-and-start-again' strategy), which is the safest way to clear any in-memory state.
  3. Safety: Re-running controller.Run() within the same process lifetime is generally avoided here because we prefer the pod restart to ensure a clean slate. Should I clarify this in the comments?

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. The controller now shuts down via context cancellation, and the pod restarts for clean state.

Comment thread cmd/compute-domain-controller/main.go Outdated
klog.Infof("Became leader, starting controller (id: %s)", lockID)
if err := controller.Run(ctx); err != nil {
klog.Errorf("Error running controller as leader: %v", err)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to return an error here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we cannot return it directly from the callback because the callback signature doesn't support it.
In the updated logic, we capture it in the controllerErr variable and then call cancelElector(). This signals the main goroutine (blocked at elector.Run) to wake up, see the error, and return it to the CLI framework. This ensures the fail-fast behavior.

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Error is propagated correctly via error channel now.

Comment thread cmd/compute-domain-controller/main.go Outdated
klog.InfoS("Context canceled, stopping leader elector", "lockID", lockID)
}()

elector.Run(ctx)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does elector.Run(ctx) have an interesting return value?

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaderelection.Run itself doesn't return anything (it's a blocking call that returns when the context is canceled).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}()

elector.Run(ctx)
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we always want to return nil here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our updated version, we do not always return nil. We check controllerErr after elector.Run returns. If the controller failed while it was the leader, we propagate that error so the process exits with code 1. If it returns nil, it means the process received a standard SIGTERM and is exiting gracefully.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

case err := <-errChan:
cancel()
if err != nil {
return fmt.Errorf("run controller: %w", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, previously, we would return an error to the CLI framework -- which would after all terminate this process with a non-zero exit code.

We still need a way for the program to crash upon well-defined situations, and return with a non-zero exit code. Is this still ensured?

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is still strictly ensured.

In the updated logic, if controller.Run(leaderCtx) returns an error, it is captured in the controllerErr variable. Immediately after, we call cancelElector(), which causes the blocking elector.Run(electorCtx) to return.

Once elector.Run returns, the function checks controllerErr. If it's non-nil, we return a wrapped error back to the caller (the CLI framework). Since the CLI framework receives a non-nil error, it will handle the process termination with a non-zero exit code as before.

This approach allows us to achieve two things:
Graceful Lease Release: It ensures ReleaseOnCancel is triggered so the leader identity is cleared from the Lease object immediately.
Non-zero Exit: It maintains the existing behavior of crashing the process with an error state when the controller fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Thanks for the contribution! Great!

As always, what's critical, is to design the error handling story well, and to then implement it according to that design.

In the spirit of that, I have left a few comments and questions.

Generally, with distributed system fail-overs, I consider everything 'broken' that is not explicitly tested with fault injection tests. I know that testing such things can be a lot of effort. Can you describe to which extent you could test certain scenarios so far?

After all, with leader election in place, the state space becomes much more complicated than with a single instance of the controller. When the single instance crashes, it gets started again from scratch, and has a great chance to heal that way.

With leader election in place, there is a chance for the entire ensemble to transition into a long-term dysfunctional state because the "crash-everything-and-start-again" story isn't that simple anymore. From that point of view, we need to make extra sure that this type of HA deployment with leader election only ever improves things, and never worsens things (compared to the single-replica situation). To that end, we need quite a bit of discussion and testing. I'm confident that this will lead to something good!

Thanks for your work on this!

@herb-duan herb-duan closed this Feb 5, 2026
@herb-duan herb-duan reopened this Feb 5, 2026
@herb-duan
Copy link
Copy Markdown
Contributor Author

Hey! Thanks for the contribution! Great!

As always, what's critical, is to design the error handling story well, and to then implement it according to that design.

In the spirit of that, I have left a few comments and questions.

Generally, with distributed system fail-overs, I consider everything 'broken' that is not explicitly tested with fault injection tests. I know that testing such things can be a lot of effort. Can you describe to which extent you could test certain scenarios so far?

After all, with leader election in place, the state space becomes much more complicated than with a single instance of the controller. When the single instance crashes, it gets started again from scratch, and has a great chance to heal that way.

With leader election in place, there is a chance for the entire ensemble to transition into a long-term dysfunctional state because the "crash-everything-and-start-again" story isn't that simple anymore. From that point of view, we need to make extra sure that this type of HA deployment with leader election only ever improves things, and never worsens things (compared to the single-replica situation). To that end, we need quite a bit of discussion and testing. I'm confident that this will lead to something good!

Thanks for your work on this!

@jgehrcke
I completely agree that HA should never worsen the situation. I have performed the following manual fault injection tests:

  1. Leader Crash: kill -9 the leader pod process and kubelet delete the leader pod. Observed the Lease being taken over by a standby replica after the LeaseDuration.
  2. Graceful Upgrade: Scaling down the deployment. Observed the leader calling cancelElector(), clearing the Lease holder, and the next pod becoming leader in <2 seconds.

TODO: Network Partition: Simulating API Server unreachability for the leader. Observed that leaderCtx was canceled once RenewDeadline passed, stopping the controller logic before it lost the lease.

To ensure we don't worsen things, we've implemented UUID-based lock IDs. Even if a Pod is restarted with the same name, the new instance gets a fresh ID, preventing it from accidentally assuming it still holds a stale lease.

@herb-duan
Copy link
Copy Markdown
Contributor Author

Oops, sorry for the noise! I accidentally hit the 'Close with comment' button. Reopening now—please ignore the notification.

@herb-duan
Copy link
Copy Markdown
Contributor Author

TODO: Network Partition: Simulating API Server unreachability for the leader. Observed that leaderCtx was canceled once RenewDeadline passed, stopping the controller logic before it lost the lease.

Hi @jgehrcke , I've conducted fault injection tests (simulating API server unreachability) and the logs confirm the robustness of the implementation.
Observations from logs:

  1. Fail-fast on Timeout: When the API server became unreachable, the leader failed to renew its lease. Once the RenewDeadline was reached (14:16:53.901), the leaderelection library correctly signaled a failure.
  2. Graceful Shutdown: My implementation's OnStoppedLeading was triggered (14:16:53.903), and the elector.Run loop returned cleanly as shown by the log: "Leader election loop ended gracefully".
  3. Clean Slate Recovery: The process then exited. Upon restart by Kubernetes (14:16:56.195), the new instance initialized correctly and recognized the existing leader (-9nwqr), entering standby mode without any conflict.

This confirms that the "crash-everything-and-start-again" strategy is preserved and correctly coordinated with the leader election lifecycle. No zombie leaders or split-brain scenarios were observed.

@klueska klueska added this to the Backlog milestone Feb 8, 2026
@klueska klueska added feature issue/PR that proposes a new feature or functionality robustness issue/pr: edge cases & fault tolerance labels Feb 8, 2026
@klueska klueska modified the milestones: Backlog, v26.4.0 Feb 8, 2026
@herb-duan herb-duan force-pushed the feat/controller-leader-election branch from dff5bd2 to 1e75766 Compare February 9, 2026 02:57
@herb-duan
Copy link
Copy Markdown
Contributor Author

Hi @jgehrcke, I have updated the PR and pushed the latest changes via commit amend:

  1. Code Documentation:
    • Enhanced the ARCHITECTURE NOTE in OnStartedLeading to clarify ctx cancellation/lease release logic
    • Added ARCHITECTURE NOTE in OnStoppedLeading to explicitly document that controller shutdown is handled by leaderCtx cancellation (only logging here)
  2. Observability: Added the requested klog.V(6) log in OnNewLeader to handle self-identity check scenarios and avoid redundant logs
  3. Verification: As discussed, fault injection tests (leader crash/graceful upgrade/network partition) confirmed:
    • No zombie leader/split-brain issues
    • Controller shuts down gracefully upon lease loss (triggered by leaderCtx cancellation)
    • Lease is released immediately via ReleaseOnCancel for fast failover

The implementation now fully reflects our discussion in code comments and has been validated with fault injection. Looking forward to your final review!

Comment thread cmd/compute-domain-controller/main.go Outdated
elector.Run(electorCtx)

// If exiting due to a controller failure, propagate the error to main
if controllerErr != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

controllerErr is modified inside of OnStartedLeading and read in run after elector.Run returns.

OnStartLeading is in a separate goroutine, wondering if that could cause a potential subtle data race.
Consider the following scenario

  1. Pod becomes leader
  2. leaderelection.Run runs the OnStartedLeading in a seperate go routine, that runs controlller.Run(leaderCtx)
	go le.config.Callbacks.OnStartedLeading(ctx)
	le.renew(ctx)
  1. Network failure, so leadership is lost
  2. Since leadership is not acquired, context is cancel
  3. Two things happen concurrently
    a. Goroutine of run tries to read controllerError. if controllerError != nil
    b. Goroutine of onStartedLeading is still in the middle of controller.Run and modifies controllerError

Timing is nondeterministic

  • run() may see controllerErr being nil before the write happens, which returns incorrect nil error
  • read might be happening while controllerErr is been written

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be safer to pass the controllerErr to a error channel that's read inside of run

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced shared variable with thread-safe error channel, no data race now.

Comment thread cmd/compute-domain-controller/main.go Outdated
Comment thread deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml
Comment thread deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml Outdated
Comment thread deployments/helm/nvidia-dra-driver-gpu/templates/controller.yaml Outdated
Comment thread cmd/compute-domain-controller/main.go Outdated
@herb-duan
Copy link
Copy Markdown
Contributor Author

Thanks for your reviews! I will fix all the issues you mentioned one by one (data race, flag abstraction, env naming, RBAC, code refactor, affinity, etc.) and push the updated code shortly.

Will keep you posted!

@herb-duan
Copy link
Copy Markdown
Contributor Author

Hi @klueska @jgehrcke @shengnuo @dims ,
All review comments have been addressed and included in the latest commit.
Ready for final review.

Comment thread cmd/compute-domain-controller/main.go Outdated
Comment thread cmd/compute-domain-controller/main.go Outdated
Comment thread deployments/helm/nvidia-dra-driver-gpu/values.yaml
Comment thread cmd/compute-domain-controller/main.go
Comment on lines +313 to +316
select {
case controllerErrCh <- err:
default:
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this select necessary? I would hope that erroring here would trigger elector.Run() to fail, meaning we can / should just do a direct write here, i.e.:

controllerErrCh <- err

Which will then block until the follow up call to:

err := <-controllerErrCh

happens later (which we want to block to ensure the controller has shutdown completely before returning the error).

},
}

controllerErrCh := make(chan error, 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this to be a buffered channel? I would think we want it to be unbuffered.

return fmt.Errorf("controller execution failed: %w", err)
}
default:
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, I don't think we want a select here, just:

if err := <-controllerErrCh; err != nil {
	klog.ErrorS(err, "Process exiting due to controller failure")
	return fmt.Errorf("controller execution failed: %w", err)
}

We can / should block here until the controller has pushed an error into this channel.
Is there ever a case we could get here without that happening?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @klueska , thanks for the review!

Actually, making the channel unbuffered and removing the select blocks would lead to deadlocks in both the error path and the graceful shutdown path.

  1. Why make(chan error, 1) and the non-blocking write are required:
    elector.Run() does not automatically return when the callback (OnStartedLeading) errors out. We must explicitly trigger cancelElector() to stop the leader election loop. If we use an unbuffered channel and a blocking write (controllerErrCh <- err), the callback goroutine will block forever waiting for a reader. Since it's blocked, defer cancelElector() is never reached, elector.Run() never returns, and the reader at the bottom is never reached. Deadlock.

  2. Why the non-blocking read (the bottom select) is required:
    You asked: "Is there ever a case we could get here without that happening?" Yes, absolutely! During a normal pod termination (e.g., receiving SIGTERM), the global ctx is cancelled, causing elector.Run() to return gracefully. In this case, controller.Run() exits without error, and nothing is pushed to controllerErrCh. If we block on <-controllerErrCh at the bottom, the process will hang forever during graceful shutdown until Kubernetes SIGKILLs it. The select + default allows us to safely check for errors without hanging during a normal exit.

The current buffered channel + non-blocking select pattern acts as a safe 'error mailbox' across goroutine boundaries, ensuring we never block the crucial cancelElector() call or the graceful shutdown flow.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That makes sense. I'm still not 100% happy with the way it reads with these selects, but I'll defer to @shivamerla to decide if something should be changed here.

Copy link
Copy Markdown
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, jsut a few small comments about error propagation / proper shutdown semantics.

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Mar 25, 2026

/ok to test f3b4a4e

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Mar 25, 2026

Can you please squash to a single commit?

@herb-duan herb-duan force-pushed the feat/controller-leader-election branch from f3b4a4e to d9414d1 Compare March 26, 2026 02:29
@herb-duan
Copy link
Copy Markdown
Contributor Author

Can you please squash to a single commit?

Hi @klueska , I've squashed all the changes into a single commit as requested. The commit history is clean now. I think a new /ok-to-test might be needed since the commit SHA has changed after the force push. Thanks!

@jgehrcke
Copy link
Copy Markdown
Contributor

/ok-to-test d9414d1

Comment thread cmd/compute-domain-controller/main.go Outdated
for {
select {
case <-sigs:
klog.Info("Received signal, shutting down")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to do this in this PR. But we should make sure to log the exact signal received.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! It's a quick fix, so I've updated the log to include the exact signal (klog.InfoS(..., "signal", sig.String())) in the latest rebase.

enabled: false
leaseDuration: "15s"
renewDeadline: "10s"
retryPeriod: "2s"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did we take inspiration for these values? :)

Copy link
Copy Markdown
Contributor Author

@herb-duan herb-duan Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/kubernetes/client-go/blob/v0.34.0/tools/leaderelection/leaderelection.go#L116

type LeaderElectionConfig struct {
	// Lock is the resource that will be used for locking
	Lock rl.Interface

	// LeaseDuration is the duration that non-leader candidates will
	// wait to force acquire leadership. This is measured against time of
	// last observed ack.
	//
	// A client needs to wait a full LeaseDuration without observing a change to
	// the record before it can attempt to take over. When all clients are
	// shutdown and a new set of clients are started with different names against
	// the same leader record, they must wait the full LeaseDuration before
	// attempting to acquire the lease. Thus LeaseDuration should be as short as
	// possible (within your tolerance for clock skew rate) to avoid a possible
	// long waits in the scenario.
	//
	// Core clients default this value to 15 seconds.
	LeaseDuration time.Duration
	// RenewDeadline is the duration that the acting master will retry
	// refreshing leadership before giving up.
	//
	// Core clients default this value to 10 seconds.
	RenewDeadline time.Duration
	// RetryPeriod is the duration the LeaderElector clients should wait
	// between tries of actions.
	//
	// Core clients default this value to 2 seconds.
	RetryPeriod time.Duration

the recommended defaults in the client-go leaderelection package.

Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the great work and discussion here.

Can this be tested pragmatically in CI?

Note that we currently have

  • two k8s nodes in total in the CI envs
  • only one of these nodes being labeled with control-plane

Maybe this would need a bit of monkey-patching (removing a label selector?). Anyway, I can imagine that with a bit of trickery we can add a test to the current test suite that at least covers basic leader election code paths. Of course it would be nice to still do this in this PR -- but I also don't want to slow this down.

Summary: if this is testable with reasonable effort in current CI we must do it, either in this PR or in a follow-up patch.

I can also help building the test if needed.

@jgehrcke
Copy link
Copy Markdown
Contributor

@herb-duan looking at e.g. this makes me think that you didn't rebase on the current HEAD of main in this repo -- please do that, only then CI has a chance of succeeding :).

- Implement client-go leaderelection with safe error propagation.
- Group leader election flags in `pkg/flags/leaderelection.go`.
- Update Helm charts for leader election.

Change-Id: I2296b433f6d8dcdda6b95fdf487c91f4f195f35e
Signed-off-by: Herb Duan <herbertduan@qq.com>
@herb-duan herb-duan force-pushed the feat/controller-leader-election branch from d9414d1 to 3ec74e0 Compare March 26, 2026 12:53
@herb-duan
Copy link
Copy Markdown
Contributor Author

@herb-duan looking at e.g. this makes me think that you didn't rebase on the current HEAD of main in this repo -- please do that, only then CI has a chance of succeeding :).

Good catch! I've just rebased the branch onto the latest main and force-pushed. (Also added the exact signal logging in the same squashed commit).

Could you please trigger /ok-to-test again since the commit SHA has changed? Thanks!

@herb-duan
Copy link
Copy Markdown
Contributor Author

herb-duan commented Mar 26, 2026

  • two k8s nodes in total in the CI envs
  • only one of these nodes being labeled with control-plane

Maybe this would need a bit of monkey-patching (removing a label selector?). Anyway, I can imagine that with a bit of trickery we can add a test to the current test suite that at least covers basic leader election code paths.

Hi @jgehrcke, since podAntiAffinity is just a 'preferred' rule, setting leaderElection enable and replicas: 2 will actually force both pods onto the single control-plane node anyway, making it technically testable in the current environment.

However, considering the complexity of adding an end-to-end chaos test (e.g., killing the active pod and verifying the lease handover), I completely agree with your suggestion to handle this in a follow-up patch to keep this PR focused.

@jgehrcke
Copy link
Copy Markdown
Contributor

/ok-to-test 3ec74e0

Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI passed. Approved! Thank you @herb-duan.

(largely relying on Kevin's last review)

@jgehrcke
Copy link
Copy Markdown
Contributor

@shivamerla do you want to merge this? :)

@shivamerla shivamerla merged commit c657396 into kubernetes-sigs:main Mar 29, 2026
32 of 36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature issue/PR that proposes a new feature or functionality robustness issue/pr: edge cases & fault tolerance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Leader Election Logic for High Availability in ComputeDomain Controller

5 participants