feat(controller): Add leader election for high availability by herb-duan · Pull Request #851 · kubernetes-sigs/dra-driver-nvidia-gpu

herb-duan · 2026-02-02T09:40:17Z

Fixes #815

The compute-domain-controller currently operates as a singleton, which presents a single point of failure (SPOF) in a production environment. To enhance the reliability and availability of the controller, this change introduces a leader election mechanism to support a multi-replica, high-availability (HA) deployment model.
When HA mode is enabled, only one of the controller replicas becomes the leader and executes the core business logic as well as all change operations (e.g., configuration updates, resource modifications). All other replicas remain in a hot-standby state and will not perform any business or change-related work. This leader election dependency is critical for change operations in particular — concurrent execution of change logic by multiple controller replicas must be strictly prohibited to avoid data inconsistency, conflicting operations, or unintended side effects.
If the current leader fails, one of the standby replicas will automatically take over as the new leader, ensuring uninterrupted service continuity and consistent execution of both core business and change operations.

copy-pr-bot · 2026-02-02T09:40:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jgehrcke · 2026-02-05T18:18:08Z

+			},
+			OnNewLeader: func(identity string) {
+				if identity == lockID {
+					return


Can you add a code comment here, explaining that scenario? Does that mean the OnNewLeader() callback can get invoked for us with us being the leader before and after the callback gets invoked?

Can you emit a log message here on level 6? Something like klog.V(6).Infof("OnNewLeader() callback, new identity is still my lock ID")

Great catch.

Source code comments for OnNewLeader:

OnNewLeader is called when the client observes a leader that is not the previously observed leader. This includes the first observed leader when the client starts.

I will add the requested klog.V(6) log to track this. The if identity == lockID { return } check is a standard practice in K8s controllers to avoid redundant processing when the 'new' leader is actually ourselves.

done
I've just pushed an update that adds the requested V(6) logs and includes a detailed architecture note in the code to explain the lifecycle and shutdown strategy. The implementation has been verified with the network partition test as discussed.

Done. Added comment and klog.V(6) log as suggested.

jgehrcke · 2026-02-05T18:21:08Z

+				}
+			},
+			OnStoppedLeading: func() {
+				klog.Warningf("Lost leader election (id: %s), waiting to re-compete", lockID)


This can happen when we previously were the leader, but now we're transitioning into not being the leader anymore, correct? At least, a typical failover scenario has to be for the old leader to not be the leader anymore. Are there any more actions that we need to perform here in that case? Do we need to perform a controller shutdown? If not: why?

If we later become the leader again, it looks like we would call controller.Run() again, potentially for the Nth time during our lifetime. That of course needs to be safe. Is it safe as of now?

This is a critical point. In new implementation, we rely on Context-based cancellation.

Shutdown: When we lose leadership, the leaderelection library cancels the leaderCtx passed to controller.Run(leaderCtx). Our controller is designed to watch this context and shut down all workers gracefully when it's canceled.

Process Exit: By calling cancelElector() in the defer block inside OnStartedLeading, we ensure that as soon as the controller stops, the entire elector.Run loop terminates. This allows the pod to exit and be restarted by Kubernetes (the 'crash-everything-and-start-again' strategy), which is the safest way to clear any in-memory state.

Safety: Re-running controller.Run() within the same process lifetime is generally avoided here because we prefer the pod restart to ensure a clean slate. Should I clarify this in the comments?

Done. The controller now shuts down via context cancellation, and the pod restarts for clean state.

jgehrcke · 2026-02-05T18:22:59Z

+				klog.Infof("Became leader, starting controller (id: %s)", lockID)
+				if err := controller.Run(ctx); err != nil {
+					klog.Errorf("Error running controller as leader: %v", err)
+					return


Do we need to return an error here?

Yes, but we cannot return it directly from the callback because the callback signature doesn't support it.
In the updated logic, we capture it in the controllerErr variable and then call cancelElector(). This signals the main goroutine (blocked at elector.Run) to wake up, see the error, and return it to the CLI framework. This ensures the fail-fast behavior.

Done. Error is propagated correctly via error channel now.

jgehrcke · 2026-02-05T18:23:45Z

+		klog.InfoS("Context canceled, stopping leader elector", "lockID", lockID)
+	}()
+
+	elector.Run(ctx)


Does elector.Run(ctx) have an interesting return value?

leaderelection.Run itself doesn't return anything (it's a blocking call that returns when the context is canceled).

jgehrcke · 2026-02-05T18:23:57Z

+	}()
+
+	elector.Run(ctx)
+	return nil


Do we always want to return nil here?

In our updated version, we do not always return nil. We check controllerErr after elector.Run returns. If the controller failed while it was the leader, we propagate that error so the process exits with code 1. If it returns nil, it means the process received a standard SIGTERM and is exiting gracefully.

jgehrcke · 2026-02-05T18:24:43Z

-				case err := <-errChan:
-					cancel()
-					if err != nil {
-						return fmt.Errorf("run controller: %w", err)


Here, previously, we would return an error to the CLI framework -- which would after all terminate this process with a non-zero exit code.

We still need a way for the program to crash upon well-defined situations, and return with a non-zero exit code. Is this still ensured?

Yes, this is still strictly ensured.

In the updated logic, if controller.Run(leaderCtx) returns an error, it is captured in the controllerErr variable. Immediately after, we call cancelElector(), which causes the blocking elector.Run(electorCtx) to return.

Once elector.Run returns, the function checks controllerErr. If it's non-nil, we return a wrapped error back to the caller (the CLI framework). Since the CLI framework receives a non-nil error, it will handle the process termination with a non-zero exit code as before.

This approach allows us to achieve two things:
Graceful Lease Release: It ensures ReleaseOnCancel is triggered so the leader identity is cleared from the Lease object immediately.
Non-zero Exit: It maintains the existing behavior of crashing the process with an error state when the controller fails.

jgehrcke

Hey! Thanks for the contribution! Great!

As always, what's critical, is to design the error handling story well, and to then implement it according to that design.

In the spirit of that, I have left a few comments and questions.

Generally, with distributed system fail-overs, I consider everything 'broken' that is not explicitly tested with fault injection tests. I know that testing such things can be a lot of effort. Can you describe to which extent you could test certain scenarios so far?

After all, with leader election in place, the state space becomes much more complicated than with a single instance of the controller. When the single instance crashes, it gets started again from scratch, and has a great chance to heal that way.

With leader election in place, there is a chance for the entire ensemble to transition into a long-term dysfunctional state because the "crash-everything-and-start-again" story isn't that simple anymore. From that point of view, we need to make extra sure that this type of HA deployment with leader election only ever improves things, and never worsens things (compared to the single-replica situation). To that end, we need quite a bit of discussion and testing. I'm confident that this will lead to something good!

Thanks for your work on this!

herb-duan · 2026-02-05T20:28:57Z

Hey! Thanks for the contribution! Great!

As always, what's critical, is to design the error handling story well, and to then implement it according to that design.

In the spirit of that, I have left a few comments and questions.

Generally, with distributed system fail-overs, I consider everything 'broken' that is not explicitly tested with fault injection tests. I know that testing such things can be a lot of effort. Can you describe to which extent you could test certain scenarios so far?

After all, with leader election in place, the state space becomes much more complicated than with a single instance of the controller. When the single instance crashes, it gets started again from scratch, and has a great chance to heal that way.

With leader election in place, there is a chance for the entire ensemble to transition into a long-term dysfunctional state because the "crash-everything-and-start-again" story isn't that simple anymore. From that point of view, we need to make extra sure that this type of HA deployment with leader election only ever improves things, and never worsens things (compared to the single-replica situation). To that end, we need quite a bit of discussion and testing. I'm confident that this will lead to something good!

Thanks for your work on this!

@jgehrcke
I completely agree that HA should never worsen the situation. I have performed the following manual fault injection tests:

Leader Crash: kill -9 the leader pod process and kubelet delete the leader pod. Observed the Lease being taken over by a standby replica after the LeaseDuration.
Graceful Upgrade: Scaling down the deployment. Observed the leader calling cancelElector(), clearing the Lease holder, and the next pod becoming leader in <2 seconds.

TODO: Network Partition: Simulating API Server unreachability for the leader. Observed that leaderCtx was canceled once RenewDeadline passed, stopping the controller logic before it lost the lease.

To ensure we don't worsen things, we've implemented UUID-based lock IDs. Even if a Pod is restarted with the same name, the new instance gets a fresh ID, preventing it from accidentally assuming it still holds a stale lease.

herb-duan · 2026-02-05T20:32:13Z

Oops, sorry for the noise! I accidentally hit the 'Close with comment' button. Reopening now—please ignore the notification.

herb-duan · 2026-02-06T06:29:57Z

TODO: Network Partition: Simulating API Server unreachability for the leader. Observed that leaderCtx was canceled once RenewDeadline passed, stopping the controller logic before it lost the lease.

Hi @jgehrcke , I've conducted fault injection tests (simulating API server unreachability) and the logs confirm the robustness of the implementation.
Observations from logs:

Fail-fast on Timeout: When the API server became unreachable, the leader failed to renew its lease. Once the RenewDeadline was reached (14:16:53.901), the leaderelection library correctly signaled a failure.
Graceful Shutdown: My implementation's OnStoppedLeading was triggered (14:16:53.903), and the elector.Run loop returned cleanly as shown by the log: "Leader election loop ended gracefully".
Clean Slate Recovery: The process then exited. Upon restart by Kubernetes (14:16:56.195), the new instance initialized correctly and recognized the existing leader (-9nwqr), entering standby mode without any conflict.

This confirms that the "crash-everything-and-start-again" strategy is preserved and correctly coordinated with the leader election lifecycle. No zombie leaders or split-brain scenarios were observed.

herb-duan · 2026-02-09T03:02:43Z

Hi @jgehrcke, I have updated the PR and pushed the latest changes via commit amend:

Code Documentation:
- Enhanced the ARCHITECTURE NOTE in OnStartedLeading to clarify ctx cancellation/lease release logic
- Added ARCHITECTURE NOTE in OnStoppedLeading to explicitly document that controller shutdown is handled by leaderCtx cancellation (only logging here)
Observability: Added the requested klog.V(6) log in OnNewLeader to handle self-identity check scenarios and avoid redundant logs
Verification: As discussed, fault injection tests (leader crash/graceful upgrade/network partition) confirmed:
- No zombie leader/split-brain issues
- Controller shuts down gracefully upon lease loss (triggered by leaderCtx cancellation)
- Lease is released immediately via ReleaseOnCancel for fast failover

The implementation now fully reflects our discussion in code comments and has been validated with fault injection. Looking forward to your final review!

shengnuo · 2026-03-04T18:32:52Z

+	elector.Run(electorCtx)
+
+	// If exiting due to a controller failure, propagate the error to main
+	if controllerErr != nil {


controllerErr is modified inside of OnStartedLeading and read in run after elector.Run returns.

OnStartLeading is in a separate goroutine, wondering if that could cause a potential subtle data race.
Consider the following scenario

Pod becomes leader

leaderelection.Run runs the OnStartedLeading in a seperate go routine, that runs controlller.Run(leaderCtx)

go le.config.Callbacks.OnStartedLeading(ctx) le.renew(ctx)

Network failure, so leadership is lost

Since leadership is not acquired, context is cancel

Two things happen concurrently
a. Goroutine of run tries to read controllerError. if controllerError != nil
b. Goroutine of onStartedLeading is still in the middle of controller.Run and modifies controllerError

Timing is nondeterministic

run() may see controllerErr being nil before the write happens, which returns incorrect nil error

read might be happening while controllerErr is been written

It might be safer to pass the controllerErr to a error channel that's read inside of run

Fixed. Replaced shared variable with thread-safe error channel, no data race now.

herb-duan · 2026-03-13T06:51:25Z

Thanks for your reviews! I will fix all the issues you mentioned one by one (data race, flag abstraction, env naming, RBAC, code refactor, affinity, etc.) and push the updated code shortly.

Will keep you posted!

herb-duan · 2026-03-13T09:08:38Z

Hi @klueska @jgehrcke @shengnuo @dims ,
All review comments have been addressed and included in the latest commit.
Ready for final review.

klueska · 2026-03-25T11:30:22Z

+				select {
+				case controllerErrCh <- err:
+				default:
+				}


Is this select necessary? I would hope that erroring here would trigger elector.Run() to fail, meaning we can / should just do a direct write here, i.e.:

controllerErrCh <- err

Which will then block until the follow up call to:

err := <-controllerErrCh

happens later (which we want to block to ensure the controller has shutdown completely before returning the error).

klueska · 2026-03-25T11:32:27Z

+		},
+	}
+
+	controllerErrCh := make(chan error, 1)


Do we want this to be a buffered channel? I would think we want it to be unbuffered.

klueska · 2026-03-25T11:33:32Z

+			return fmt.Errorf("controller execution failed: %w", err)
+		}
+	default:
+	}


As above, I don't think we want a select here, just:

if err := <-controllerErrCh; err != nil { klog.ErrorS(err, "Process exiting due to controller failure") return fmt.Errorf("controller execution failed: %w", err) }

We can / should block here until the controller has pushed an error into this channel.
Is there ever a case we could get here without that happening?

Hi @klueska , thanks for the review!

Actually, making the channel unbuffered and removing the select blocks would lead to deadlocks in both the error path and the graceful shutdown path.

Why make(chan error, 1) and the non-blocking write are required:
elector.Run() does not automatically return when the callback (OnStartedLeading) errors out. We must explicitly trigger cancelElector() to stop the leader election loop. If we use an unbuffered channel and a blocking write (controllerErrCh <- err), the callback goroutine will block forever waiting for a reader. Since it's blocked, defer cancelElector() is never reached, elector.Run() never returns, and the reader at the bottom is never reached. Deadlock.

Why the non-blocking read (the bottom select) is required:
You asked: "Is there ever a case we could get here without that happening?" Yes, absolutely! During a normal pod termination (e.g., receiving SIGTERM), the global ctx is cancelled, causing elector.Run() to return gracefully. In this case, controller.Run() exits without error, and nothing is pushed to controllerErrCh. If we block on <-controllerErrCh at the bottom, the process will hang forever during graceful shutdown until Kubernetes SIGKILLs it. The select + default allows us to safely check for errors without hanging during a normal exit.

The current buffered channel + non-blocking select pattern acts as a safe 'error mailbox' across goroutine boundaries, ensuring we never block the crucial cancelElector() call or the graceful shutdown flow.

OK. That makes sense. I'm still not 100% happy with the way it reads with these selects, but I'll defer to @shivamerla to decide if something should be changed here.

klueska

Looking good, jsut a few small comments about error propagation / proper shutdown semantics.

klueska · 2026-03-25T14:24:36Z

/ok to test f3b4a4e

klueska · 2026-03-25T14:24:56Z

Can you please squash to a single commit?

herb-duan · 2026-03-26T02:39:34Z

Can you please squash to a single commit?

Hi @klueska , I've squashed all the changes into a single commit as requested. The commit history is clean now. I think a new /ok-to-test might be needed since the commit SHA has changed after the force push. Thanks!

jgehrcke · 2026-03-26T10:17:16Z

/ok-to-test d9414d1

jgehrcke · 2026-03-26T10:20:47Z

 			for {
 				select {
 				case <-sigs:
+					klog.Info("Received signal, shutting down")


We don't need to do this in this PR. But we should make sure to log the exact signal received.

Good catch! It's a quick fix, so I've updated the log to include the exact signal (klog.InfoS(..., "signal", sig.String())) in the latest rebase.

jgehrcke · 2026-03-26T10:22:49Z

+    enabled: false
+    leaseDuration: "15s"
+    renewDeadline: "10s"
+    retryPeriod: "2s"


Where did we take inspiration for these values? :)

https://github.com/kubernetes/client-go/blob/v0.34.0/tools/leaderelection/leaderelection.go#L116

type LeaderElectionConfig struct { // Lock is the resource that will be used for locking Lock rl.Interface // LeaseDuration is the duration that non-leader candidates will // wait to force acquire leadership. This is measured against time of // last observed ack. // // A client needs to wait a full LeaseDuration without observing a change to // the record before it can attempt to take over. When all clients are // shutdown and a new set of clients are started with different names against // the same leader record, they must wait the full LeaseDuration before // attempting to acquire the lease. Thus LeaseDuration should be as short as // possible (within your tolerance for clock skew rate) to avoid a possible // long waits in the scenario. // // Core clients default this value to 15 seconds. LeaseDuration time.Duration // RenewDeadline is the duration that the acting master will retry // refreshing leadership before giving up. // // Core clients default this value to 10 seconds. RenewDeadline time.Duration // RetryPeriod is the duration the LeaderElector clients should wait // between tries of actions. // // Core clients default this value to 2 seconds. RetryPeriod time.Duration

the recommended defaults in the client-go leaderelection package.

jgehrcke

Thanks for all the great work and discussion here.

Can this be tested pragmatically in CI?

Note that we currently have

two k8s nodes in total in the CI envs
only one of these nodes being labeled with control-plane

Maybe this would need a bit of monkey-patching (removing a label selector?). Anyway, I can imagine that with a bit of trickery we can add a test to the current test suite that at least covers basic leader election code paths. Of course it would be nice to still do this in this PR -- but I also don't want to slow this down.

Summary: if this is testable with reasonable effort in current CI we must do it, either in this PR or in a follow-up patch.

I can also help building the test if needed.

jgehrcke · 2026-03-26T10:33:23Z

@herb-duan looking at e.g. this makes me think that you didn't rebase on the current HEAD of main in this repo -- please do that, only then CI has a chance of succeeding :).

- Implement client-go leaderelection with safe error propagation. - Group leader election flags in `pkg/flags/leaderelection.go`. - Update Helm charts for leader election. Change-Id: I2296b433f6d8dcdda6b95fdf487c91f4f195f35e Signed-off-by: Herb Duan <herbertduan@qq.com>

herb-duan · 2026-03-26T13:23:15Z

@herb-duan looking at e.g. this makes me think that you didn't rebase on the current HEAD of main in this repo -- please do that, only then CI has a chance of succeeding :).

Good catch! I've just rebased the branch onto the latest main and force-pushed. (Also added the exact signal logging in the same squashed commit).

Could you please trigger /ok-to-test again since the commit SHA has changed? Thanks!

herb-duan · 2026-03-26T13:36:07Z

two k8s nodes in total in the CI envs

only one of these nodes being labeled with control-plane

Maybe this would need a bit of monkey-patching (removing a label selector?). Anyway, I can imagine that with a bit of trickery we can add a test to the current test suite that at least covers basic leader election code paths.

Hi @jgehrcke, since podAntiAffinity is just a 'preferred' rule, setting leaderElection enable and replicas: 2 will actually force both pods onto the single control-plane node anyway, making it technically testable in the current environment.

However, considering the complexity of adding an end-to-end chaos test (e.g., killing the active pod and verifying the lease handover), I completely agree with your suggestion to handle this in a follow-up patch to keep this PR focused.

jgehrcke · 2026-03-26T14:23:46Z

/ok-to-test 3ec74e0

jgehrcke

CI passed. Approved! Thank you @herb-duan.

(largely relying on Kevin's last review)

jgehrcke · 2026-03-28T15:25:34Z

@shivamerla do you want to merge this? :)

herb-duan force-pushed the feat/controller-leader-election branch 3 times, most recently from f7ac650 to 0df5f10 Compare February 5, 2026 17:04

jgehrcke reviewed Feb 5, 2026

View reviewed changes

herb-duan closed this Feb 5, 2026

herb-duan reopened this Feb 5, 2026

klueska added this to the Backlog milestone Feb 8, 2026

klueska assigned jgehrcke Feb 8, 2026

klueska added feature issue/PR that proposes a new feature or functionality robustness issue/pr: edge cases & fault tolerance labels Feb 8, 2026

klueska modified the milestones: Backlog, v26.4.0 Feb 8, 2026

herb-duan force-pushed the feat/controller-leader-election branch from dff5bd2 to 1e75766 Compare February 9, 2026 02:57

herb-duan mentioned this pull request Feb 10, 2026

Add Leader Election Logic for High Availability in ComputeDomain Controller #815

Closed

shengnuo reviewed Mar 4, 2026

View reviewed changes